The emergence of large language models (LLM) has inspired a variety of uses, including the development of chatbots like ChatGPT, messaging assistants, and coding tools. Substantial work has been devoted to improving the efficiency of these models for large-scale deployment. This has enabled ChatGPT to serve the needs of over 100 million weekly active users. It should be noted, however, that text generation only represents a fraction of the possibilities of these models.
The unique characteristics of Text-To-Image (TTI) and Text-To-Video (TTV) models imply that these scalable tasks have different advantages. Therefore, an in-depth review is required to identify areas for optimization of TTI/TTV operations. Despite notable algorithmic advances in image and video generation models in recent years, relatively limited efforts have been made to optimize the deployment of these models from a systems perspective.
Researchers from Harvard University and Meta take a quantitative approach to delineate the current landscape of Text-To-Image (TTI) and Text-To-Video (TTV) models by examining various design dimensions, including latency and computational intensity. To achieve this, they create a suite comprising eight representative tasks for image text and video generation, comparing them to widely used language models like LLaMA.
They find notable distinctions, demonstrating that new system performance limitations appear even with state-of-the-art performance optimizations such as Flash Attention. For example, convolution accounts for up to 44% of the execution time in diffusion-based TTI models, while linear layers consume up to 49% of the execution time in transformer-based TTI models.
Furthermore, they find that the temporal attention bottleneck increases exponentially with increasing number of images. This observation highlights the need for future system optimizations to address this challenge. They develop an analytical framework to model the evolution of memory requirements and FLOPs throughout the forward pass of a diffusion model.
Large language models (LLM) are defined by a sequence that indicates the extent of information the model can take into account, indicating the number of words it can take into account when predicting the next word. However, in state-of-the-art Text-To-Image (TTI) and Text-To-Video (TTV) models, the sequence length is directly influenced by the size of the image being processed.
They conducted a case study on the stable diffusion model to more concretely understand the impact of image size scaling and demonstrate the sequence length distribution for stable diffusion inference . They find that after applying techniques such as Flash Attention, convolution depends more on image size than attention.
Check Paper. All credit for this research goes to the researchers of this project. Also don’t forget to register our SubReddit of more than 35,000 ML, 41,000+ Facebook communities, Discord Channel, LinkedIn GroupAnd E-mailwhere we share the latest AI research news, interesting AI projects and much more.
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from Indian Institute of Technology Kharagpur. Understanding things at a fundamental level leads to new discoveries which lead to technological advancements. He is passionate about fundamentally understanding nature using tools such as mathematical models, ML models and AI.