New benchmark for evaluating multimodal systems based on real-world video, audio and text data
Of Turing test has ImageNet, benchmarks have been instrumental in the development of artificial intelligence (AI) by helping define research goals and allowing researchers to measure progress toward those goals. Incredible breakthroughs over the past 10 years, such as AlexNet in computer vision and AlphaFold in protein folding, have been closely linked to the use of benchmark datasets, allowing researchers to rank model design and training choices, and iterate to improve their models. As we work toward the goal of developing artificial general intelligence (AGI), developing robust and effective benchmarks that expand the capabilities of AI models is as important as developing the models themselves.
Perception – the process of experiencing the world through the senses – is an important part of intelligence. And building agents with human-level perceptual understanding of the world is a central but difficult task, one that is becoming increasingly important in robotics, self-driving cars, personal assistants, medical imaging, and more. So today we present to you the Perception testa multimodal benchmark using real-world videos to help evaluate a model's perceptual capabilities.
Develop a perception framework
Many perception-related criteria are currently used in AI research, such as Kinetic for video action recognition, Audio set for the classification of audio events, CT for tracking objects, or VQA for the answer to the image questions. These benchmarks have led to astonishing advances in how AI model architectures and training methods are built and developed, but each only targets narrow aspects of perception: image benchmarks exclude temporal aspects ; visual question answering tends to focus on understanding the high-level semantic scene; Object tracking tasks typically capture the lower-level appearance of individual objects, such as color or texture. And very few benchmarks define tasks in both audio and visual terms.
Multimodal models, such as Collector, FlamingoOr BEiT-3, aim to be more general models of perception. But their assessments were based on several specialized data sets, because no dedicated repository was available. This process is slow, expensive, and provides incomplete coverage of general perceptual abilities like memory, making it difficult for researchers to compare methods.
To address many of these issues, we created a dataset of specially designed videos of real-world activities, labeled according to six different task types:
- Object tracking: a frame is provided around an object at the start of the video, the model must render a complete track throughout the video (including via occlusions).
- Tracking points: a point is selected at the start of the video, the model must follow the point throughout the video (also via occlusions).
- Location of temporal actions: the model must locate and temporally classify a predefined set of actions.
- Temporal location of sound: the model must locate and temporally classify a predefined set of sounds.
- Multiple choice video questions and answers: text questions on the video, each with three choices from which to select the answer.
- Evidence-based video Q&A: text questions about the video, the model must return one or more object tracks.
We took inspiration from how children's perception is assessed in developmental psychology, as well as from synthetic datasets like RESTORATION And CLÉVRER, and designed 37 video scripts, each with different variations to ensure a balanced dataset. Each variation was filmed by at least a dozen crowdsourced participants (similar to previous work on Charades And something something), with a total of over 100 participants, resulting in 11,609 videos with an average length of 23 seconds.
The videos show simple games or daily activities, which would allow us to set tasks that require the following skills to solve:
- Knowledge of semantics: test aspects such as task completion, recognition of objects, actions or sounds.
- Understanding of physics: collisions, movements, occlusions, spatial relationships.
- Temporal reasoning or memory: temporal ordering of events, counting over time, detection of changes in a scene.
- Abstraction abilities: shape matching, same/different notions, pattern detection.
Crowdsourced participants labeled the videos with spatial and temporal annotations (object bounding box tracks, dot tracks, action segments, sound segments). Our research team designed the questions by script type for the multiple choice and informed video question answering tasks to ensure a good diversity of skills tested, for example questions that probe the ability to reason counterfactually or to provide explanations for a given situation. Corresponding responses for each video were again provided by crowdsourced participants.
Evaluation of multimodal systems with the perception test
We assume that the models have been pre-trained on external datasets and tasks. The perception test includes a small set of fine-tuning (20%) that model makers can optionally use to convey the nature of tasks to models. The remaining data (80%) consists of a public validation split and a held test split where performance can only be evaluated via our benchmark server.
Here we show a diagram of the evaluation setup: the inputs are a video and audio sequence, plus a task specification. The task may be in the form of high-level text for visual answering of questions or low-level input, such as the coordinates of an object's bounding box for the object tracking task.
The assessment results are detailed across several dimensions and we measure abilities in all six computing tasks. For visual question answering tasks, we also provide a mapping of questions according to the types of situations presented in the videos and the types of reasoning required to answer the questions for more detailed analysis (see our paper for more details). An ideal model would maximize scores on all radar plots and dimensions. This is a detailed assessment of a model's skills, allowing us to narrow down areas for improvement.
Ensuring the diversity of participants and scenes shown in the videos was a key consideration when developing the benchmark. To do this, we selected participants from different countries, ethnicities and genders, and sought to have diverse representation in each type of video scenario.
Learn more about the perception test
The perception test benchmark is accessible to the public here and further details are available in our paper. A leaderboard and challenge server will also be available soon.
On October 23, 2022, we are organizing a workshop on general perception models at the European Conference on Computer Vision in Tel Aviv (ECVC 2022), where we will discuss our approach and how to design and evaluate general perception models with other leading experts in the field.
We hope that the perception test will inspire and guide further research toward general models of perception. In the future, we hope to collaborate with the multimodal research community to introduce additional annotations, tasks, metrics, or even new languages into the benchmark.
Get in touch by email perception-test@google.com if you are interested in contributing!