The emergence of large language models (LLMs) like GPT, Claude, Gemini, LLaMA, Mistral, etc., has significantly accelerated recent progress in natural language processing (NLP). Modifying instructions is a well-known approach to training LLMs. This method allows LLMs to improve their pre-trained representations to track human instructions using large-scale, well-formatted instruction data. However, these tasks are complex in themselves, making model fitting difficult. For general tasks, larger models may not be able to maximize losses due to concurrent activities, resulting in poor performance.
Increasing model capacity can improve the efficiency of instruction tuning for general tasks. However, most LLMs are pre-trained dense models built using a transformer architecture, which significantly limits scalability when changing instructions. Instruction tuning provides the opportunity to achieve exceptional performance on general tasks by transforming dense models into MoE models. The expert layers of MoE models are initially configured as duplicates of the original feed-forward neural network (FFN) layers to make this change. Training such massive models is hampered by computational costs and GPU memory constraints caused by the need to update expert weights in the MoE layer due to the large parameter scale of existing LLMs.
New research from the Shanghai Artificial Intelligence Laboratory and the Chinese University of Hong Kong introduces Parameter-Efficient Sparsity Crafting (PESC), a method for transforming dense models into sparse models using the MoE model . By integrating adapters into the MoE layers of sparse models, PESC allows experts to be differentiated without changing their weights individually. This method significantly reduces GPU memory requirements and computational expenses. Since the adapters are integrated, the capacity of the model can be expanded with minimal increase in parameters.
To differentiate experts without changing the weighting of each expert in the MoE layers, PESC inserts adapters in the MoE layers of the sparse models. Researchers also update other sparse model weights using the QLoRA methodology, a popular PEFT method.
The researchers simultaneously trained the sparse model with MoE layers on various skills, including coding, math, and other general talents across many domains, to illustrate the model’s learning capabilities. For instruction tuning, this training integrated three distinct datasets from different domains: the SlimORCA, Magicoder, and MetaMathQA datasets. The final dataset included 520,000 instructions after filtering and sampling.
Additionally, they used the PESC method to create sparse models of camelids. Camelidae-8Ï34B outperforms GPT-3.5 in general and achieves SOTA performance on all open source sparse models.
Check Paper And Model. All credit for this research goes to the researchers of this project. Also don’t forget to follow us on Twitter. Join our SubReddit 36k+ ML, 41,000+ Facebook communities, Discord ChannelAnd LinkedIn Groops.
If you like our work, you will love our bulletin..
Don’t forget to join our Telegram channel
Dhanshree Shenwai is a Computer Science Engineer with good experience in FinTech companies spanning Finance, Cards & Payments and Banking with a keen interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world that makes everyone’s life easier.