Was this answer better or worse? Better Worse Even
It has been said that information theory and machine learning are “two sides of the same coin” due to their close relationship. One notable relationship is the fundamental similarity between probabilistic data models and lossless compression. The essential theory defining this concept is the source coding theorem, which states that the predicted message length in bits of an ideal entropy encoder is equal to the negative log2 probability of the statistical model. In other words, decreasing the amount of bits needed for each message is comparable to increasing the log2 likelihood. Different techniques to achieve lossless compression with a probabilistic model include Huffman coding, arithmetic coding, and asymmetric number systems.
Figure 1 | Arithmetic coding of the sequence “AIXI” with a probabilistic model (language) P (both in blue) gives the binary code “0101001” (in green). The data is compressed via arithmetic coding by giving the symbols certain intervals based on the probability given by P. It gradually softens these pauses to produce compressed bits which replace the original message. Based on the incoming compressed bits, arithmetic coding initializes an interval during decoding. To reconstruct the original message, it iteratively associates the intervals with the symbols using the probabilities provided by P.
The total compression efficiency depends on the capabilities of the probabilistic model since arithmetic coding is known to be optimal in terms of coding length (Fig. 1). Furthermore, huge pre-trained transformers, also called baseline models, have recently demonstrated excellent performance in various prediction tasks and are therefore attractive candidates for use with arithmetic coding. Transformer-based compression with arithmetic coding has generated industry-leading results in both online and offline environments. The offline option they are considering in their work is to train the model on an external dataset before using it to compress a (perhaps different) data stream. In the online context, a pseudo-randomly initialized model is immediately trained on the data stream to be compressed. Therefore, offline compression uses a fixed set of model parameters and is performed in context.
Transformers are ideally suited for offline reduction because they have demonstrated exceptional in-context learning capabilities. Transformers learn to compress efficiently, as they will describe in this task. They must therefore have strong contextual learning skills. Context length, a critical limiting factor of offline compression, determines the maximum number of bytes a model can compress simultaneously. Transformers are computationally intensive and can only compress a small amount of data (a “token” is programmed with 2 or 3 bytes). Since many difficult prediction tasks (such as algorithmic reasoning or long-term memory) require extended contexts, extending the context length of these models is an important issue that is receiving more attention. The compression view in context highlights how current baseline models fail. Google researchers DeepMind, Meta AI and Inria promote the use of compression to explore the prediction problem and evaluate how well large (basic) models compress data.
They make the following contributions:
• They perform empirical research on the ability of baseline models to perform lossless compression. To this end, they explore the role of arithmetic coding in compressing predictive models and draw attention to the relationship between the two fields of study.
• They demonstrate that baseline models with context-learning capabilities, trained primarily on text, are general-purpose compressors. For example, Chinchilla 70B outperforms domain-specific compressors like PNG (58.5%) or FLAC (30.3%), achieving compression ratios of 43.4% on ImageNet patches and 16.4% on LibriSpeech samples.
• They present a new perspective on scaling laws by demonstrating that scaling is not a magic solution and that the size of the dataset sets a strict upper limit on the size of the model in terms of compression performance.
• They use compressors as generative models and use compression-prediction equivalence to graph the performance of the underlying compressor.
• They show that tokenization, which can be considered as pre-compression, does not, on average, improve compression performance. Instead, it allows models to increase the information content of their environment and is typically used to improve prediction performance.
Check Paper. All credit for this research goes to the researchers of this project. Also don’t forget to register our SubReddit of more than 30,000 ML, More than 40,000 Facebook communities, Discord Channel, And E-mailwhere we share the latest AI research news, interesting AI projects and much more.
If you like our work, you will love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and he is passionate about building solutions around it. He loves connecting with people and collaborating on interesting projects.