We show that autoregressive language models can learn to pad text after applying a simple transformation to the dataset, which simply moves a stretch of text from the middle of a document to its end. While this data augmentation has attracted much interest in recent years, we provide ample evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative ability, such as measured by perplexity and sampling ratings across a wide range of scales. Considering the usefulness, simplicity, and effectiveness of fill-in-the-middle (FIM) training models, we suggest that future autoregressive language models be trained with FIM by default. To this end, we perform a series of ablations on key hyperparameters, such as data transformation frequency, transformation structure, and filling span selection method. We use these ablations to prescribe strong default parameters and best practices for training FIM models. We've released our best fill model trained with best practices from our API and are publishing our fill benchmarks to aid future research.
Effective training of fill-in-the-blank language models
![](https://definewsnetwork.com/wp-content/uploads/2024/04/efficient-training-of-language-models-to-fill-in-the-middle-860x860.png)
Leave a comment