The Internet is full of instructional videos that can teach curious viewers everything from making the perfect pancake to performing a life-saving Heimlich maneuver.
But identifying when and where a particular action occurs in a long video can be tedious. To streamline the process, scientists are trying to teach computers to perform this task. Ideally, a user could simply describe the action they're looking for, and an AI model would skip to its location in the video.
However, training machine learning models to do this typically requires a large amount of expensive video data that has been painstakingly labeled by hand.
A new, more efficient approach from researchers at MIT and the MIT-IBM Watson AI Lab trains a model to perform this task, known as spatiotemporal grounding, using only videos and their automatically generated transcripts.
Researchers teach a model for understanding unlabeled video in two distinct ways: by looking at small details to determine where objects are (spatial information) and by looking at the whole situation to understand when the action occurs (spatial information). temporal).
Compared to other AI approaches, their method more accurately identifies actions in longer videos with multiple activities. Interestingly, they found that training both spatial and temporal information simultaneously allows a model to better identify each of them individually.
In addition to streamlining e-learning and virtual training processes, this technique could also be useful in healthcare settings by quickly finding key moments in videos of diagnostic procedures, for example.
“We untangle the challenge of trying to encode spatial and temporal information at the same time and think about it like two experts working alone, which turns out to be a more explicit way of encoding information. Our model, which combines these two distinct branches, leads to the best performance,” says Brian Chen, lead author of a study. article on this technique.
Chen, a 2023 Columbia University graduate who conducted this research while a visiting student at the MIT-IBM Watson AI Lab, is joined on the article by MIT-IBM Fellow Principal Research Scientist James Glass Watson AI Lab and Director. from the Spoken Language Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, member of the MIT-IBM Watson AI Lab, also affiliated with Goethe University Frankfurt; and others at MIT, Goethe University, the MIT-IBM Watson AI Lab, and Quality Match GmbH. The research will be presented at the Computer Vision and Pattern Recognition Conference.
Global and local learning
Researchers typically teach models to perform spatiotemporal grounding using videos in which humans have annotated the start and end times of particular tasks.
Not only is generating this data expensive, but it can be difficult for humans to know exactly what to label. If the action is to “cook a pancake”, does that action begin when the chef begins to mix the batter or when he pours it into the pan?
“This time the task might be cooking, but next time it might be fixing a car.” There are so many different areas that people can annotate. But if we can learn everything without labels, that’s a more general solution,” says Chen.
For their approach, the researchers use unlabeled instructional videos and accompanying text transcriptions from a website like YouTube as training data. These do not require any special preparation.
They divided the training process into two parts. On the one hand, they teach a machine learning model to watch the video in its entirety to understand what actions are happening at certain times. This high-level information is called global representation.
For the second time, they teach the model to focus on a specific region in the parts of the video where the action takes place. In a large kitchen, for example, the model may need to focus only on the wooden spoon a chef uses to mix pancake batter, rather than the entire counter. This fine information is called local representation.
The researchers incorporate an additional component into their framework to mitigate the misalignments that occur between narration and video. Maybe the chef talks about cooking the pancake first and carries out the action later.
To develop a more realistic solution, the researchers focused on uncut videos lasting several minutes. In contrast, most AI techniques train using clips of a few seconds that someone has cut together to show a single action.
A new reference
But when they evaluated their approach, the researchers couldn't find an effective benchmark for testing a model on these longer, uncut videos. So they created one.
To build their benchmark dataset, the researchers designed a new annotation technique that works well for identifying multi-step actions. They asked users to mark the intersection of objects, like the point where the edge of a knife cuts a tomato, rather than drawing a box around important objects.
“This is more clearly defined and speeds up the annotation process, reducing human labor and costs,” says Chen.
Additionally, having multiple people annotate points on the same video can better capture actions that occur over time, such as the flow of milk being poured. Not all annotators will mark the exact same point in the liquid flow.
When they used this benchmark to test their approach, the researchers found that it was more accurate at identifying actions than other AI techniques.
Their method was also better at focusing on human-object interactions. For example, if the action is “serve a pancake,” many other approaches could focus only on key objects, like a stack of pancakes sitting on a counter. Instead, their method focuses on the actual moment the chef flips a pancake onto a plate.
Existing approaches rely heavily on labeled data from humans and are therefore not very scalable. This work takes a step toward solving this problem by providing new methods for locating events in space and time using speech that occurs naturally there. This type of data is ubiquitous, so in theory it would be a powerful learning signal. However, this often has no relation to what is on the screen, making it difficult to use in machine learning systems. This work helps solve this problem, making it easier for researchers to create systems using this form of multimodal data in the future,” says Andrew Owens, assistant professor of electrical engineering and computer science at the University of Michigan, who did not participate in this work.
Next, the researchers plan to improve their approach so that models can automatically detect when text and narration are not aligned and switch between modalities. They also want to extend their framework to audio data, because there are typically strong correlations between actions and sounds made by objects.
“AI research has made incredible progress toward creating models like ChatGPT that understand images. But our progress in understanding video lags far behind. This work represents a significant step forward in that direction,” says Kate Saenko, a professor in the Department of Computer Science at Boston University, who was not involved in this work.
This research is funded, in part, by the MIT-IBM Watson AI Lab.