Many recent language model (LM) successes have been achieved within a “static paradigm”, where the focus is on improving performance on tests created without regard to aspect. temporal data. For example, answering questions about events that the model might learn about during training, or evaluating on text subsampled from the same time period as the training data. However, our language and knowledge are dynamic and constantly evolving. Therefore, to enable a more realistic evaluation of question-answering models for the next leap in performance, it is essential to ensure that they are flexible and robust in the face of new and unseen data.
In 2021 we published Mind the Gap: Assessing Temporal Generalization in Neural Language Models and the dynamic language modeling references for WMT and arXiv to facilitate the evaluation of language models that take temporal dynamics into account. In this paper, we highlighted the issues that today's large state-of-the-art LMs face in temporal generalization and found that knowledge-intensive tokens suffer a significant performance impact.
Today we publish two articles and a new reference that advance research on this topic. In StreamingQA: a benchmark for adapting to new knowledge over time in question answering modelswe study the downstream Q&A task on our proposed new benchmark, StreamingQA: We want to understand how retrieval-augmented parametric and semi-parametric Q&A models adapt to new information, in order to answer questions about new events. In Internet-enhanced language models with few-step prompts to answer open-domain questions, we explore the power of combining a large language model with few-click prompts and Google Search as a retrieval component. In doing so, we aim to improve the factuality of the model, while ensuring that it has access to up-to-date information to answer a diverse set of questions.
StreamingQA: a benchmark for adapting to new knowledge over time in question answering models
Language knowledge and understanding of models assessed by question-answering (QA) have been commonly studied on static knowledge snapshots, such as Wikipedia. To study how semi-parametric QA models and their underlying parametric LMs adapt to evolving knowledge, we built the new large-scale benchmark, StreamingQA, with human-written questions and automatically generated, asked on a given date, which must be answered with 14 years of experience. Timestamped news articles (see Figure 2). We show that parametric models can be updated without complete retraining, while avoiding catastrophic oversights. For semi-parametric models, adding new articles to the search space allows rapid adaptation. However, models with an obsolete underlying LM underperform those with a recycled LM.
Internet-enhanced language models with few-step prompts to answer questions in an open domain
We aim to capitalize on the unique capabilities offered by large-scale language models to overcome some of their challenges, with respect to basing them on factual and up-to-date information. Motivated by semi-parametric LMs, which base their decisions on externally retrieved evidence, we use few-step prompts to learn how to condition LMs on information returned from the web using Google Search, a source vast and constantly updated knowledge. Our approach does not involve fine-tuning or learning additional parameters, making it applicable to virtually any language model. And indeed, we find that web-conditioned LMs outperform closed-book models of similar or even larger size in answering open-domain questions.