Starting at a high level, Transformers require two pieces of information for input: token embeddings and position encodings. Token integrations are things like `tiktoken`

where they will use a fixed vocabulary size to generate a unique key for each token. Through training, the model then learns the query and value of each token so that it can successfully generate the next token with the information.

In addition to embeddings, we also need positional information to tell the LLM where the token is in a sentence. The equations above show the most abstract view for conveying position information. We have 3 functions, 1 for each element of the token, and 2 word embedding vectors (X*m* and*not*Or *m* And *not* mean the different dimensions of each vector).

One approach is to simply create a new vector for each token you see, so that the position is perfectly unique. Naturally, the tradeoff here is that the single vector makes it difficult for the model to see similarities in the training data, which degrades performance.

A secondary approach would be to create a vector with a similarity factor to other vectors for each token. In this way, we always collect information about how similar a situation is to another distinct situation. However, as we can create collisions of these vectors, this methodology can lead to confusion.

How to find the best combination of these approaches?

The industry has largely focused on RoPE as a way to get the best of both worlds. Without going too deep into the math, RoPE uses sine functions to assign position values to tokens. Because sine functions are repetitive by design, some position values will be very similar to others. Therefore, similar items will have a certain quantitative value indicating how similar they are.

As you can see in the equation above, we have a sparse matrix filled with different functions revolving around the θ value that is passed in in order to keep all the position encodings related.

Exactly how these θs are related is shown below:

The most critical part of this equation for context size is the value 10,000. As we tried to create larger contexts with non-infinite number ranges, the value of 10,000 became a limiting factor – after All in all, there are only a limited number of vectors you can create with that number as a base.

Although you can train a new model from scratch using a larger base value for your position encodings, there are several reasons that prevent the general public from doing this. First, training from scratch comes at a huge cost. Given that only a few organizations in the world currently have the resources to do this, the burden is heavy. Second, it is extremely difficult to find a large volume of high-quality, long texts. As training requires billions of tokens, finding quality long data at this scale is a major challenge.

Therefore, researchers have proposed different methodologies to extend RoPE to larger thetas.

The first method is linear positional interpolation (PI), in which you can increase the number of possible positions by reducing theta by a certain value λ. The equation below uses Beta to represent the θ^(2/d) equation we used to connect all the thetas from before.

While this works, the paper's authors note that there is a clutter effect in which some information ends up being lost after reduction.

The second method is YaRN (yet another RoPE extension method) where we divide the RoPE dimensions into 3 groups and assign a different linear factor to each of them. The basic idea is that tokens that appear frequently should not be modified (their λ := 1) and those that appear less frequently should be modified. From the graph below, we can see that this works well for extending up to 128 KB context length. The issue at stake here is determining the groupings. Groups are determined by people and therefore suboptimal decisions can be made that reduce performance.

So while YaRN and Linear Projection (PI) work, they have limitations that hold them back. Long RoPE takes the best of each idea and finds a clever way to combine them.

The Long RoPE researchers realized that to improve on previous methods, they introduced two key ideas: (1) the distribution of correct λ is irregular, so it is better to search for λ rather than assume a correct answer and (2) it There is a subset of tokens whose position simply should not be changed.

These two results are found in the formula below. To find the optimal λ, they created a loss function that they were able to minimize. The formula below is a reformatted version of RoPE with the result of 𝕀 and ( n/ β*I*** ) **representing the scaling performed on our position vector. When they find the smallest loss, they choose the corresponding one λ.

The 𝕀 step function is how we update the subset of tokens that should not be changed. By choosing a value of 1, we signal that the position encodings must remain the same. To limit the search, they only considered the n-hat values of* {0, 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 64, 128, 256}*. The higher the value of n-hat, the more tokens retain their original position encodings.

Now that we've covered the theory, let's see the results!

Long RoPE works both without fine tuning and with it. The graph above shows the performance of LongRoPE when applied to LLaMA2–7B. The original context of this model was 4k. By finding the optimal λ, they were able to expand the popup to 32,000 tokens with no noticeable change in perplexity! The incredible thing is that the math required to make a change like this is almost negligible compared to the fine-tuning costs. An 8x expansion with no major computational expense is incredible.

Getting huge expansion requires a combination of fine-tuning and finding the optimal λ. The article's researchers achieved a 512x expansion following this methodology. They first increased the model to a size of 128k and 256k. They refined 400 steps on the 128,000, then opted to use the 256,000 factors for an additional 600 steps. Since this worked better than just tuning directly to 256 KB, it appears that learning a more general distribution rather than just one of the scaled distributions gives better performance. They then optimized again for the best λ and arrived at a popup of 2048k, an increase of 512 over the original popup of 4k!

One of the difficulties of a larger context is a loss of performance for tasks with small contexts. This behavior has been observed before and the theory is that the data at the beginning is condensed into a smaller range, leading to some loss of attention.

They solved this problem in the 2048k pop-up model by finding the ideal λ for shorter lengths (in the paper it was 4k and 8k). During inference, if the context is determined to be small, the LLM will dynamically switch to using the smallest λ for position encoding data.

LLMs are good at reasoning and they continue to amaze us with their real-world applications. With a larger context window, especially one that can be obtained at a limited cost while remaining efficient, we will only see their applications grow.

An interesting question is whether dynamic positional coding calculations are the way of the future. If you can fine-tune the multiple position encodings and get quality performance for 2 λ, then we may have 1 model that can seamlessly switch between multiple λs at inference time.

One of the things I find most interesting about the LLM space is the ability to sift through data. While the Internet has done a remarkable job of democratizing access to information, it has unfortunately also flooded our lives with noise. There are a lot of things that are shown to us online that have almost no consequence for us. With a tool that can extract important information from mundane and even harmful information, we can use the Internet to its full potential.

With larger pop-ups, the LLM's ability to summarize and condense information can be used to even greater effect. There may even come a time when great progress will be made by giving LLMs two seemingly disparate sets of information and having them discover something new that can be reasoned about given the premises of each set.

It’s an exciting time to build.