Understand the essential techniques behind BERT architecture choices to produce a compact and efficient model
In recent years, the evolution of major linguistic models has exploded. BERT has become one of the most popular and effective models for solving a wide range of NLP tasks with high accuracy. After BERT, a set of other models appeared later, also demonstrating exceptional results.
The obvious trend that has become easy to observe is the fact that over time, large language models (LLMs) tend to become more complex by exponentially increasing the number of parameters and data on which they are trained. Deep learning research has shown that such techniques generally lead to better results. Unfortunately, the machine learning world has already faced several problems regarding LLMs, and scalability has become the main obstacle to effective training, storage, and use.
As a result, new LLMs have recently been developed to address scalability issues. In this article, we will discuss ALBERT which was invented in 2020 with a goal of significantly reducing BERT parameters.
To understand the mechanisms underlying ALBERT, we will refer to its official paper. Essentially, ALBERT derives the same architecture from BERT. There are three main differences in the choice of model architecture which will be discussed and explained below.
The training and development procedures in ALBERT are analogous to those in BERT. Like BERT, ALBERT is pre-trained on English Wikipedia (2500 million words) and BookCorpus (800 million words).
When an input sequence is tokenized, each of the tokens is then mapped to one of the vocabulary embeddings. These embeddings are used for entry into BERT.
To leave V be the size of the vocabulary (the total number of possible embeddings) and H – integrate dimensionality. Then for each of V integrations, we need to store H values resulting in a VxH integration matrix. As it turns out in practice, this matrix usually has huge sizes and requires a lot of memory to store it. But a larger problem is that most of the time the elements of an embedding matrix are trainable and the model requires a lot of resources to learn the appropriate parameters.
For example, take the basic BERT model: it has a vocabulary of 30,000 tokens, each represented by an embedding of 768 components. In total, this represents 23 million weights to store and train. For larger models, this number is even higher.
This problem can be avoided by using matrix factorization. The original vocabulary matrix VxH can be decomposed into a pair of matrices of smaller sizes VxE And E x H.
Consequently, instead of using O(VxH) parameters, the decomposition only gives O(VxE + ExH) weight. Obviously, this method is effective when H >> E.
Another interesting aspect of matrix factorization is the fact that it does not change the search process for obtaining token embeddings.: each row of the decomposed matrix on the left VxE maps a token to its corresponding embedding in the same simple way as in the original matrix VxH. In this way, the dimensionality of the embeddings decreases by H has E.
However, in the case of decomposed matrices, to obtain the entry for BERT, the mapped embeddings must then be projected into the hidden BERT space: this is done by multiplying a corresponding row of the left matrix by the columns of the matrix of right.
One way to reduce model parameters is to make them shareable. This means that they all share the same values. Essentially, this just reduces the memory required to store the weights. However, standard algorithms like backpropagation or inference should always be run on all parameters.
One of the most optimal ways to share weights occurs when they are located in different but similar blocks of the model.. Placing them in similar blocks increases the chance that most of the shareable parameter calculations during forward or backward propagation will be the same. This gives more possibilities to design an efficient computational framework.
The mentioned idea is implemented in ALBERT which consists of a set of Transformer blocks with the same structure making parameter sharing more efficient. In fact, there are several ways to share parameters in Transformers between layers:
- share only attention settings;
- share only direct neural network (FNN) parameters;
- share all parameters (used in ALBERT).
In general, it is possible to divide all transformer layers into N groups of size M each, with each group sharing parameters within the layers it has. The researchers found that the smaller the size of the M group, the better the results. However, decreasing the size of the M group results in a significant increase in total parameters.
BERT focuses on mastering two objectives during pre-training: masked language modeling (MSM) and next sentence prediction (NSP). In general, MSM was designed to improve BERT’s ability to acquire linguistic knowledge and the goal of NSP was to improve BERT’s performance on particular downstream tasks.
Nevertheless, several studies have shown that getting rid of the NSP goal could be beneficial, mainly because of its simplicity, compared to MLM. Following this idea, the ALBERT researchers also decided to remove the NSP task and replace it with a sentence order prediction (SOP) problem whose goal is to predict whether the two sentences are located in the correct or reverse order.
Talking about the training dataset, all positive pairs of input sentences are collected sequentially in the same text passage (the same method as in BERT). For negative sentences, the principle is the same except that the two sentences go in reverse order.
It has been shown that models trained with the NSP objective cannot accurately solve SOP tasks, while models trained with the SOP objective perform well on NSP problems. These experiments prove that ALBERT is better suited to solving various downstream tasks than BERT.
The detailed comparison between BERT and ALBERT is shown in the diagram below.
Here are the most interesting observations:
- By having only 70% of the parameters of large BERT, the xxlarge version of ALBERT provides better performance on downstream tasks.
- ALBERT large achieves comparable performance to BERT large and is 1.7 times faster due to massive parameter size compression.
- All ALBERT models have an integration size of 128. As shown in the ablation studies presented in the article, this is the optimal value. Increasing the embedding size, for example to 768, improves the metrics but not by more than 1% in absolute values, which is not so much related to the increasing complexity of the model.
- Although ALBERT xxlarge processes a single iteration of data 3.3 times slower than BERT large, experiments have shown that if these two models are trained for the same amount of time, ALBERT xxlarge demonstrates significantly better average performance on benchmarks than BERT broad (88.7% versus 87.2%).
- Experiments showed that ALBERT models with large hidden sizes (≥ 1024) do not benefit much from an increase in the number of layers. This is one of the reasons why the number of layers was reduced from 24 in the ALBERT large version to 12 in the xxlarge version.
- A similar phenomenon occurs with increasing the size of the hidden layer. Increasing it to values above 4096 degrades model performance.
At first glance, ALBERT seems a preferable choice to the original BERT models because it outperforms them on downstream tasks. However, ALBERT requires much more computation due to its longer structures. A good example of this problem is ALBERT xxlarge which has 235 million parameters and 12 encoder layers. The majority of these 235M weights belong to a single transformer block. The weights are then shared for each of the 12 layers. Thus, during training or inference, the algorithm must be executed on more than 2 billion parameters!
For these reasons, ALBERT is better suited to problems where speed can be sacrificed to achieve greater accuracy. Ultimately, the NLP field never stands still and is constantly progressing towards new optimization techniques. It is very likely that the speed rate at ALBERT will be improved in the near future. The authors of the article have already mentioned methods such as little attention And block attention as potential algorithms for ALBERT acceleration.
All images, unless otherwise noted, are by the author