Training Time Comparison: Multi-Token vs. Next-Token Prediction

Wait 5 sec.

Table of LinksAbstract and 1. Introduction2. Method3. Experiments on real data3.1. Benefits scale with model size and 3.2. Faster inference3.3. Learning global patterns with multi-byte prediction and 3.4. Searching for the optimal n3.5. Training for multiple epochs and 3.6. Finetuning multi-token predictors3.7. Multi-token prediction on natural language4. Ablations on synthetic data and 4.1. Induction capability4.2. Algorithmic reasoning5. Why does it work? Some speculation and 5.1. Lookahead reinforces choice points5.2. Information-theoretic argument6. Related work7. Conclusion, Impact statement, Environmental impact, Acknowledgements, and ReferencesA. Additional results on self-speculative decodingB. Alternative architecturesC. Training speedsD. FinetuningE. Additional results on model scaling behavior F. Details on CodeContests finetuningG. Additional results on natural language benchmarksH. Additional results on abstractive text summarizationI. Additional results on mathematical reasoning in natural languageJ. Additional results on induction learningK. Additional results on algorithmic reasoningL. Additional intuitions on multi-token predictionM. Training hyperparametersC. Training speeds\:::infoThis paper is available on arxiv under CC BY 4.0 DEED license.::::::infoAuthors:(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;(3) Baptiste Rozière, FAIR at Meta;(4) David Lopez-Paz, FAIR at Meta and his the last author;(5) Gabriel Synnaeve, FAIR at Meta and the last author.:::\