Researchers at MIT and NVIDIA have developed two techniques that accelerate the processing of sparse tensors, a type of data structure used for high-performance computing tasks. Complementary techniques could lead to significant improvements in the performance and energy efficiency of systems such as the massive machine learning models that drive generative artificial intelligence.
Tensors are data structures used by machine learning models. Both new methods seek to efficiently exploit so-called sparsity (zero values) in tensors. When processing these tensors, one can skip the zeros and save on both computation and memory. For example, anything multiplied by zero is worth zero, so it can skip this operation. And it can compress the tensor (zeroes don’t need to be stored) so that more of it can be stored in on-chip memory.
However, exploiting sparsity presents several challenges. Finding non-zero values in a large tensor is not an easy task. Existing approaches often limit the locations of non-zero values by applying a sparsity model to simplify the search, but this limits the variety of sparse tensors that can be processed efficiently.
Another challenge is that the number of non-zero values can vary across different regions of the tensor. This makes it difficult to determine how much space is needed to store different regions in memory. To ensure that the region fits, more space is often allocated than necessary, resulting in underutilization of the storage buffer. This increases off-chip memory traffic, which increases power consumption.
Researchers at MIT and NVIDIA have developed two solutions to solve these problems. On the one hand, they developed a technique that allows the hardware to efficiently find non-zero values for a wider variety of sparsity models.
For the other solution, they created a method that can handle the case where data does not fit in memory, which increases storage buffer usage and reduces off-chip memory traffic.
Both methods improve the performance and reduce the energy demand of hardware accelerators specifically designed to accelerate the processing of sparse tensors.
“Typically, when you use more specialized or domain-specific hardware accelerators, you lose the flexibility that you would get from a more general-purpose processor, like a CPU. What stands out from these two works is that we show that it is possible to maintain flexibility and adaptability while being specialized and efficient,” says Vivienne Sze, associate professor in the Department of Electrical Engineering and Computer Science (EECS). from MIT, a member of the Research Electronics Laboratory (RLE) and co-senior author of papers on both advances.
His co-authors include lead authors Yannan Nellie Wu PhD ’23 and Zi Yu Xue, graduate student in electrical engineering and computer science; and co-lead author Joel Emer, MIT Professor of Practice in Computer Science and Electrical Engineering and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), along with others at NVIDIA. Both papers will be presented at the IEEE/ACM International Symposium on Microarchitecture.
Highlight: Finding nulls efficiently
Sparsity can arise in the tensor for various reasons. For example, researchers sometimes “prune” unnecessary elements from machine learning models by replacing certain tensor values with zeros, thereby creating parsimony. The degree of parsimony (percentage of zeros) and the location of zeros may vary between models.
To make it easier to find the remaining non-zero values in a model with billions of individual values, researchers often limit the location of non-zero values so that they fit a certain model. However, each hardware accelerator is typically designed to support a specific sparsity model, which limits its flexibility.
In contrast, the hardware accelerator designed by the MIT researchers, called HighLight, can handle a wide variety of sparsity models while still performing well when running models that don’t have zero values.
They use a technique they call “hierarchical structured sparsity” to efficiently represent a wide variety of sparsity models composed of several simple sparsity models. This approach divides the values of a tensor into smaller blocks, where each block has its own simple, sparse pattern (perhaps two zeros and two non-zeros in a block with four values).
Then they combine the blocks into a hierarchy, where each collection of blocks also has its own simple, sparse pattern (perhaps one null block and three non-zero blocks in a level with four blocks). They continue to combine blocks into larger levels, but the patterns remain simple at each stage.
This simplicity allows HighLight to find and ignore zeros more efficiently, to take full advantage of the opportunity to reduce excessive calculations. On average, their accelerator design had about six times better energy delay product (a metric related to energy efficiency) than other approaches.
“Ultimately, the HighLight accelerator is able to efficiently accelerate dense models because it does not introduce much overhead, and at the same time it is able to exploit workloads with different amounts of nulls based on hierarchical structured parsimony,” Wu explains.
In the future, she and her collaborators want to apply hierarchical structured sparsity to more types of machine learning models and different types of tensors in the models.
Tailors and Swiftiles: Effective Overbooking to Speed Up Workloads
Researchers can also leverage sparsity to move and process data more efficiently on a computer chip.
Because tensors are often larger than can be stored in the chip’s buffer, the chip only captures and processes part of the tensor at a time. The pieces are called tiles.
To maximize the use of this buffer and limit the number of times the chip must access off-chip memory, which often dominates power consumption and limits processing speed, researchers seek to use the largest tile that can hold in the buffer.
But in a sparse tensor, many data values are zero, so an even larger tile can fit in the buffer than would be expected based on its capacity. Null values do not need to be stored.
But the number of zero values can vary for different regions of the tensor, so they can also vary for each tile. This makes it difficult to determine a tile size that will fit in the buffer. As a result, existing approaches often assume that there are no zeros and end up selecting a smaller tile, resulting in wasted empty spaces in the buffer.
To address this uncertainty, the researchers propose using “overbooking” to allow them to increase the size of the tiles, as well as a way to tolerate it if the tile does not fit into the buffer.
In the same way that an airline overbooks tickets for a flight, if all passengers show up, the airline must compensate those who are kicked off the plane. But usually not all passengers show up.
In a sparse tensor, a tile size can be chosen such that tiles generally have enough zeros that most still fit in the buffer. But sometimes a tile will have more non-zero values than it can hold. In this case, this data is deleted from the buffer.
The researchers allow the hardware to recover only the moved data without recovering or reprocessing the entire tile. They modify the “tail” of the buffer to handle this, hence the name of this technique, Tailors.
Then, they also created an approach to find tile sizes that takes advantage of overbooking. This method, called Swiftiles, quickly estimates the ideal tile size so that a specific, user-defined percentage of tiles is overbooked. (The names “Tailors” and “Swiftiles” pay homage to Taylor Swift, whose recent Eras tour was full of overbooked ticket pre-sale codes).
Swiftiles reduces the number of times hardware needs to check the tensor to identify an ideal tile size, saving on computation. The combination of Tailors and Swiftiles more than doubles the speed while requiring only half the energy demand of existing hardware accelerators that cannot handle overbooking.
“Swiftiles allows us to estimate the necessary size of these tiles without requiring multiple iterations to refine the estimate. This only works because overbooking is taken care of. Even if you have a decent gap, you can still get a bit of a speedup because of how the non-zero values are distributed,” says Xue.
In the future, researchers want to apply the idea of overbooking to other aspects of computer architecture and also work to improve the process of estimating the optimal level of overbooking.
This research is funded, in part, by the MIT AI Hardware Program.