Explaining the behavior of trained neural networks remains a compelling puzzle, especially as these models grow in size and sophistication. Like other scientific challenges throughout history, reverse engineering the operation of artificial intelligence systems requires a substantial amount of experimentation: formulating hypotheses, intervening in behavior, and even dissecting large networks to examine neurons individual. To date, most successful experiments have required extensive human oversight. Explaining every calculation inside models the size of GPT-4 and larger will almost certainly require more automation – perhaps even using the AI models themselves.
To facilitate this timely endeavor, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new approach that uses AI models to conduct experiments on other systems and explain their behavior. Their method uses agents built from pre-trained language models to produce intuitive explanations of computations within trained networks.
At the heart of this strategy is the “automated interpretability agent” (AIA), designed to mimic the experimental processes of a scientist. Interpretability agents plan and perform tests on other computer systems, ranging in scale from individual neurons to entire models, to produce explanations of those systems in various forms: linguistic descriptions of what a system and its failures, and code that reproduces the behavior of the system. Unlike existing interpretability procedures that passively classify or summarize examples, AIA actively participates in hypothesis formation, experimental testing, and iterative learning, thereby refining its understanding of other systems in real time.
The AIA method is supplemented by the new “interpretation and description of functions” (FIND), a testbed of calculation-like functions within trained networks, and accompanying descriptions of their behavior. One of the major challenges in assessing the quality of descriptions of real-world network components is that descriptions are only as good as their explanatory power: researchers do not have access to the ground truth. unit labels or descriptions of learned calculations. FIND addresses this long-standing problem in the field by providing a reliable standard for evaluating interpretability procedures: explanations of functions (e.g., produced by an AIA) can be evaluated against function descriptions in the benchmark.
For example, FIND contains synthetic neurons designed to mimic the behavior of real neurons inside language models, some of which are selective for individual concepts such as “land transportation.” AIAs benefit from black-box access to synthetic neurons and design inputs (such as “tree,” “happiness,” and “car”) to test a neuron’s response. After noticing that a synthetic neuron produces higher response values for “car” than for other inputs, an AIA could design finer-grained tests to distinguish the neuron’s selectivity for cars versus other inputs. other means of transport, such as planes and boats. When the AIA produces a description such as “this neuron is selective for road transport, not air or sea transport”, this description is evaluated against the ground truth description of the synthetic neuron (“selective for land transport”) in FIND. The benchmark can then be used to compare the capabilities of AIAs to other methods in the literature.
Sarah Schwettmann PhD ’21, co-senior author of a article about new job and researcher at CSAIL, highlights the advantages of this approach. “The ability of AIAs to generate and test hypotheses autonomously could enable behaviors to emerge that would otherwise be difficult for scientists to detect. It is remarkable that language models, when equipped with tools to probe other systems, are capable of this type of experimental design,” says Schwettmann. “Clean, simple benchmark tests with ground-truth answers have been a major driver of more general capabilities in language models, and we hope FIND can play a similar role in interpretability research .”
Great language role models still maintain their status as in-demand celebrities in the tech world. Recent advances in LLMs have highlighted their ability to perform complex reasoning tasks in various fields. The CSAIL team recognized that, given these capabilities, language models could serve as the basis for generalized agents for automated interpretability. “Interpretability has always been a multifaceted field,” says Schwettmann. “There is no one-size-fits-all approach; Most procedures are very specific to individual questions we might have about a system and to individual modalities like vision or language. Existing approaches to labeling individual neurons in vision models have required training specialized models on human data, with these models performing only this single task. Interpretability agents built from language models could provide a general interface for explaining other systems – synthesizing the results of experiments, integrating different modalities, or even discovering new experimental techniques at a very fundamental level.
As we enter a regime where explaining models are themselves black boxes, external evaluations of interpretability methods become increasingly vital. The team’s new benchmark meets this need with a suite of functions with known structure, modeled after behaviors observed in nature. FIND’s functions cover a diversity of domains, from mathematical reasoning to symbolic operations on strings to synthetic neurons built from word-level tasks. The interactive feature dataset is constructed procedurally; Real-world complexity is introduced into simple functions by adding noise, compositing functions, and simulating biases. This allows us to compare interpretability methods in a context that translates to real-world performance.
In addition to the feature dataset, the researchers introduced an innovative evaluation protocol to evaluate the effectiveness of AIAs and existing automated interpretability methods. This protocol involves two approaches. For tasks that require replication of the function in code, the evaluation directly compares the AI-generated estimates and the original ground truth functions. Evaluation becomes more complex for tasks involving natural language function descriptions. In these cases, accurately assessing the quality of these descriptions requires an automated understanding of their semantic content. To address this challenge, the researchers developed a specialized “third-party” language model. This model is specifically trained to evaluate the accuracy and consistency of natural language descriptions provided by AI systems, and compares them to the behavior of the ground truth function.
FIND allows an assessment revealing that we are still far from completely automating interpretability; Although AIAs outperform existing interpretability approaches, they still fail to accurately describe almost half of the functions in the benchmark. Tamar Rott Shaham, co-lead author of the study and postdoctoral fellow at CSAIL, notes that “while this generation of AIAs are effective at describing high-level features, they still often overlook finer details, particularly in functional subdomains with noise or noise.” irregular behavior. This probably comes from insufficient sampling in these areas. One problem is that the effectiveness of AIAs can be hampered by their initial exploratory data. To counter this, we tried to guide the exploration of AIAs by initializing their search with specific and relevant inputs, which significantly improved the accuracy of interpretation. This approach combines new AIA methods with previous techniques using precomputed examples to initiate the interpretation process.
Researchers are also developing a toolkit to increase the ability of AIAs to conduct more precise experiments on neural networks, both black-box and white-box. This toolkit aims to equip AIAs with better tools to select inputs and refine hypothesis testing capabilities for more nuanced and precise neural network analysis. The team also tackles practical challenges related to AI interpretability, focusing on determining the right questions to ask when analyzing models in real-world scenarios. Their goal is to develop automated interpretability procedures that could eventually help people auditing systems (e.g., for autonomous driving or facial recognition) diagnose potential failure modes, hidden biases, or surprising behaviors before the deployment.
Watch the observers
The team envisions one day developing nearly autonomous AIAs that can audit other systems, with human scientists providing oversight and guidance. Advanced AIAs could develop new types of experiments and questions, potentially beyond the initial considerations of human scientists. The focus is on expanding the interpretability of AI to include more complex behaviors, such as entire neural circuits or subnetworks, and predicting which inputs may lead to undesired behaviors. This development represents a significant advance in AI research, aimed at making AI systems more understandable and reliable.
“A good benchmark is a powerful tool for tackling difficult challenges,” says Martin Wattenberg, a professor of computer science at Harvard University who was not involved in the study. “It’s wonderful to see this sophisticated benchmark for interpretability, one of the most important challenges in machine learning today. I am particularly impressed by the automated interpretability agent created by the authors. It’s a kind of interpretability jiu-jitsu, turning AI on itself to aid human understanding.
Schwettmann, Rott Shaham and their colleagues presented their work at NeurIPS 2023 in December. Other MIT co-authors, all affiliated with CSAIL and the Department of Electrical Engineering and Computer Science (EECS), include graduate student Joanna Materzynska, undergraduate Neil Chowdhury, Shuang Li PhD ’23, Professor assistant Jacob Andreas and professor Antonio Torralba. Assistant Professor David Bau of Northeastern University is an additional co-author.
The work was supported, in part, by the MIT-IBM Watson AI Lab, Open Philanthropy, an Amazon Research Award, Hyundai NGV, the US Army Research Laboratory, the US National Science Foundation, the Zuckerman STEM Leadership Program and a Viterbi Scholarship. .