Textbooks, Not the Internet, Trained This Powerful AI

Wait 5 sec.

:::infoAuthors:Yuanzhi Li, Microsoft ResearchS´ebastien Bubeck, Microsoft Research Ronen Eldan, Microsoft Research Allie Del Giorno, Microsoft Research Suriya Gunasekar, Microsoft Research Yin Tat Lee, Microsoft Research :::\AbstractWe continue the investigation into the power of smaller Transformer-based language models as initiated by TinyStories – a 10 million parameter model that can produce coherent English – and the follow-up work on phi-1, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate “textbook quality” data as a way to enhance the learning process compared to traditional web data. We follow the “Textbooks Are All You Need” approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named phi-1.5, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, phi-1.5 exhibits many of the traits of much larger LLMs, both good –such as the ability to “think step by step” or perform some rudimentary in-context learning– and bad, including hallucinations and the potential for toxic and biased generations –encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source phi-1.5 to promote further research on these urgent topics.\ \1        IntroductionOver the past few years, Large Language Models (LLMs) have transformed the field of Natural Language Processing. More broadly, they hold the promise of a paradigm shift for human-computer interaction. These advancements have far-reaching economic implications, as well as the potential to redefine our conceptual frameworks of artificial intelligence and perhaps even cognition itself. Moreover, the latest generation of models such as GPT-4 [Ope23] have demonstrated remarkable improvements over their predecessors, offering capabilities previously thought to be unattainable in the short term; see for example [BCE+23] for an in-depth comparison between GPT-4 and its predecessor GPT-3.5.The improvement from one generation of LLMs to the next seems at the moment to primarily stem from scale, with the most powerful models nearing trillions of parameters and trillion of tokens for training data (for example, PaLM [CND+22] has 540 billion parameters and was trained on 780 billion tokens). A natural question arises: Is this large scale indispensable for achieving high levels of capability? Far from being merely an academic question, answering this holds implications across several dimensions. Economically, the cost of training, deploying, and maintaining such large models can be substantial. Scientifically, understanding whether similar capabilities can be achieved at a smaller scale could provide insights into the architectures and development of intelligent systems. From a responsible AI standpoint, the energy consumption of large-scale models is becoming an increasing concern, as is the question of how controllable or governable these large models can be. Finally, the ability to train compact models with cutting-edge capabilities would democratize advanced AI, enabling a broader range of individuals and organizations to study and deploy them, instead of being an exclusive domain of a few with vast computational resources.In this work we continue the investigation into the fundamental question of “how small can a LLM be to achieve certain capabilities”. The prior work [EL23] considered this question for the task of “speaking fluent English”, while the subsequent work [GZA+23] considered the more challenging task of coding simple functions in Python. Here we focus on the more elusive concept of common sense reasoning, a notoriously challenging task for AI [SBBC21]. Our results are summarized in Figure 1. In a nutshell we build phi-1.5, a 1.3 billion parameter model trained on a dataset of 30 billion tokens, which achieves common sense reasoning benchmark results comparable to models ten times its size that were trained on datasets more than ten times larger. Moreover, our dataset consists almost exclusively of synthetically generated data (closely following the approach from [GZA+23], see next section for more details), which has important implications for the potential to control for the notoriously challenging issue of toxic and biased content generation with LLMs [BGMMS21]. Additionally, we discuss the performance of a related filtered web data enhanced version of phi-1.5, which we call phi-1.5-web .We open-source our raw phi-1.5 model (without instruction fine-tuning or any other stage of align-ment) to empower the research community in its work on some of the most urgent questions around LLMs: in-context learning, mechanistic interpretability, and mitigation strategies for hallucinations, toxic content generation, and biased outputs. Indeed, phi-1.5 is the first LLM at the one billion param-eters scale to exhibit most of the relevant traits of larger LLMs for research on these topics. We hope that phi-1.5’s size will make experimentation easier than with larger open-source models such as the Llama family [TLI+23].\ \2        Technical specificationsWe give here details of the creation of phi-1.5 . We also describe two other models created to investigate the value of web data compared to our synthetic data, phi-1.5-web-only and phi-1.5-web .\2.1       ArchitectureThe architecture for phi-1.5 (and its variants) is exactly the same as our previous model phi-1 in [GZA+23]. It is a Transformer [VSP+17] with 24 layers, 32 heads, and each head has dimension 64. We use rotary embedding with rotary dimension 32, and context length 2048. We also use flash-attention [DFE+22, Dao23] for training speed up, and we use the tokenizer of codegen-mono [NPH+22].\2.2       Training dataOur training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+23]).We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.\2.3       Training detailsWe train phi-1.5 starting from random initialization with constant learning rate 2e − 4 (no warm up)1, weight decay 0*.1. We use Adam optimizer with momentum 0.9,* 0*.98, and epsilon 1e* − 7. We use fp16 with DeepSpeed ZeRO Stage 2 [RRRH20]. We use batch size 2048, and train for 150B tokens, with 80% from the newly created synthetic data and 20% from phi-1 ’s training data.\2.4       Filtered web dataTo probe the importance of traditional web data we created two other models, phi-1.5-web-only and phi-1.5-web . To do so we create a dataset of 95B tokens of filtered web data following the filtering technique in [GZA+23]. This filtered web data consists of 88B tokens filtered from the Falcon refined web dataset [PMH+23], and 7B tokens of code data filtered from The Stack [KLA+22] and StackOverflow.Our phi-1.5-web-only model is trained purely on the filtered web data with about 80% training tokens from NLP data sources and 20% from code datasets (no synthetic data). Our phi-1.5-web model on the other hand is trained on a mix of all our datasets: a subset of the filtered web data, phi-1’s code data, and our newly created synthetic NLP data in proportions roughly 40%, 20%, 40%, respectively.Remark: None of our models have undergrone instruction finetuning or RLHF. Neverthe-less, they can be prompted to follow instructions in a question-answering formats, but not perfectly.\n 3        Benchmark resultsWe evaluate our models on standard natural language benchmarks, including common sense reasoning, language understanding, mathematics and coding. For common sense we pick five of the most widely used ones: WinoGrande [SLBBC19], ARC-Easy [PRR19], ARC-Challenge [Fer21], BoolQ [CLC+19], and SIQA [BB21]. We report zero-shot accuracy using LM-Eval Harness [GTB+21]. phi-1.5 achieves comparable results to Llama2-7B, Falcon-7B and Vicuna-13B on nearly all of the benchmarks.\ \Interestingly, one can see that our phi-1.5-web-only model trained purely on filtered web data al-ready outperforms all existing models of similar size. The comparison with Falcon-rw-1.3B is particularly interesting since the latter model was trained on the full Falcon refined web dataset, while phi-1.5-web-only was trained on only 15% of that dataset. Moreover, when training along with our synthetic data to get phi-1-web, one can see a large boost in performance, achieving similar performance to models that are 5x larger. Without any web data at all, phi-1.5 is also comparable to all of the other models. Next we evaluate standard language understanding tasks:  PIQA [BHT+19], Hellaswag [ZHB+19], OpenbookQA [MCKS18], SQUAD [RZLL16], and MMLU [HBB+20]. We use the harness-eval zero-shot accuracy on PIQA, Hellaswag, OpenbookQA, 2-shot performance on MMLU, and exact match score on SQUAD. Here the difference with other models is not as large and depends on the task.\ \Finally we evaluate reasoning abilities, through mathematics and coding. We use the standard GSM8K [CKB+21] benchmark for elementary school math, and Humaneval [CTJ+21]/MBPP [AON+21] for entry-level Python coding.  We only consider zero-shot pass@1 accuracy.  We can see that phi-1.5 outperforms all existing models, including Llama 65B on coding tasks. One can also see that the web data does help more here, as phi-1.5-web outperforms phi-1.5 somewhat significantly on those reasoning tasks. Interestingly we can see that phi-1.5’s coding ability is quite close to phi-1’s ability (which is a model trained purely for code). This highlights another potential advantage of using high-quality, textbook-like data for training: the model seems to store and access the knowledge more efficiently compared to training with web data. Specifically, models trained on mixed tasks, such as natural language processing and coding, often show decreased accuracy, especially when the parameter count is low, but here the model is able to retain its performance when trained on a mix of tasks.\ \4        Addressing Toxicity and BiasesToxic and biased content generation remains an ongoing challenge for language models [WUR+22, HPA23]. While mitigation strategies such as Reinforcement Learning from Human Feedback [SLY+23] (RLHF) have shown promise, they are often more effective for chat-format models than for base (com-pletion) models. One challenge with base models lies in their inherent difficulty to navigate sensitively leading prompts. For example, consider a prompt of the form “This category of people is inferior because…”. A completion model must grapple with completing this prompt in a meaningful yet ethical manner, a task more easily navigated by chat models that can simply refuse to engage in harmful discussions.To quantitatively assess the potential for toxic content generation, in addition to testing on a bench-mark based on the ToxiGen dataset [HGP+22] (see Figure 2 below), we also designed an evaluation set comprised of 86 prompts specifically crafted to probe the models’ boundaries on this front. We graded the model response manually as ‘fail’ (bad), ‘pass’ (good), and ‘did not understand’. Of the 86 prompts, phi-1.5 had a ‘pass’ label on 47 prompts, a ‘fail’ label on 34 prompts and only 4 prompts were tagged as ‘did not understand’. While these numbers are far from ideal, they are substantially better than Llama2-7B and Falcon-7B, which failed on 54 and 50 prompts respectively, and had a ‘did not understand’ tag on 13 and 17 prompts, respectively, thus passing on