Byte Latent Transformer: Patches Scale Better Than Tokens - Paper Summary

Summary

Modern large language models (LLMs) rely almost universally on tokenization as a preprocessing step. The process of tokenization involves mapping raw text—ultimately composed of characters or bytes—into a fixed, pre-defined vocabulary of subword units or tokens. Although tokenization significantly reduces sequence length and thus computational load, it also hardwires certain linguistic assumptions and biases into the model, often leading to drawbacks such as domain sensitivity, vulnerability to input noise, unequal treatment of languages, and challenges in modeling fine-grained morphological or orthographic information.

The paper introduces the Byte Latent Transformer (BLT), a tokenizer-free architecture designed to overcome the limitations of fixed-vocabulary tokenization and match state-of-the-art performance at scale. BLT uses raw bytes as input and dynamically groups them into “patches”—variable-length segments—enabling the model to adaptively allocate compute based on data complexity. Through extensive scaling studies, the authors show that BLT can match the performance of token-based models while introducing gains in inference efficiency and robustness, as well as improved handling of long-tail linguistic phenomena and noisy inputs.

Motivation and Background

Tokenization-based large language models such as GPT, Llama, and other Transformers, have succeeded in tackling a wide range of tasks across languages and domains. However, tokenization introduces several issues:

Heuristic and Fixed Granularity: Tokenization attempts to compress text into subword units before training, encoding a rigid, one-size-fits-all segmentation scheme. This can be suboptimal for domains that differ from the training text domain, as well as for languages and writing systems poorly represented in the original training data.
Inflexibility and Domain Sensitivity: Different domains might benefit from different forms of segmentation, but typical tokenizers cannot dynamically adjust. For instance, a technical domain might have long, predictable terms where finer segmentation is unnecessary, while more creative or unpredictable text might demand a more fine-grained representation.
Input Noise and Robustness: Tokenization-based models are often less robust to perturbations at the character level. Small orthographic changes, random casing, or insertion of unusual characters can break tokens and degrade performance.
Inequality Across Languages: Multilingual models often rely on subword vocabularies heavily influenced by high-resource languages, marginalizing rare languages whose words are tokenized into disproportionately long sequences, leading to poorer performance.

Training directly at the byte level could remove these issues—after all, bytes are a universal lowest-level representation of text. But naive byte-level modeling inflates sequence length and, consequently, computational cost. The authors’ previous attempts and others in the field found that byte-level Transformers struggle to scale efficiently.

Byte Latent Transformer: Patches Scale Better
Than Tokens Download

Key Idea: Dynamic Patching

The paper’s central contribution is an architecture and methodology to operate at the byte level without prohibitive computational costs. The Byte Latent Transformer (BLT) divides a byte sequence into patches—groups of consecutive bytes—on the fly. In contrast to fixed-vocabulary tokens, these patches are not stored in a lookup table and do not represent a fixed set of subword units. Instead, the model dynamically decides how to form patches based on the complexity of the upcoming prediction task.

A patch corresponds to a single time-step at the global Transformer level. By controlling the patch size, BLT can reduce the effective sequence length of the global model. Not all parts of a text require the same amount of computational capacity: some parts are more predictable and repetitive (low-entropy), while others are more uncertain, such as the start of a new word or sentence. BLT allocates more patches (and thus more compute steps) where the entropy of prediction is high, and fewer patches where the text is predictable. This adaptive approach ensures efficient utilization of model capacity.

Contrast with Tokenization

Tokenization-based models have a fixed set of tokens that are chosen offline before training. These tokens compress bytes in a heuristic manner (e.g., Byte-Pair Encoding). The resulting tokens must always be modeled at the same level of granularity, which can be wasteful. In BLT, patches are formed dynamically and can be larger when text is predictable or repetitive, thus requiring fewer global steps. They can also be smaller when the model encounters unfamiliar or difficult content, dedicating more representational capacity to challenging predictions.

An important conceptual difference is that in BLT, the model retains full access to the underlying bytes through a local encoder/decoder stack. Token-based models, once they have mapped text to a fixed set of tokens, lose direct connection to the raw characters and rely entirely on learned token embeddings. BLT, on the other hand, can still refer “back” to bytes at any time, making it inherently more robust to character-level manipulations or rare lexical phenomena.

BLT Architecture

BLT consists of three main components:

Local Encoder: A small transformer that takes raw bytes as input and produces local byte-level representations. This module is lightweight, with few layers. It applies cross-attention to pool byte representations into patch representations. Byte embeddings at this stage can be enriched with hash-based n-gram features, giving the model direct access to subword structures. This local encoder efficiently compresses a sequence of bytes into a single vector that represents a patch.
Global Latent Transformer: A large, autoregressive transformer operates over sequences of patch representations. The patches serve as the main time steps for the global model, similar to how tokens serve as steps for token-based LLMs. Since patch-level steps are fewer than byte-level steps, the global model can be significantly larger without incurring a linear scaling in compute cost.
Local Decoder: Another small transformer model that takes the global transformer’s patch-level output and decodes it back into a sequence of bytes. The decoding step ensures that at generation time, for each predicted patch, BLT can reconstruct the actual byte sequence.

This three-part architecture allows BLT to model language at two complementary levels: the global transformer focuses on semantic coherence and long-range dependencies at a coarse patch scale, while the local modules handle fine-grained details at the byte scale. The local encoder and decoder are computationally cheap, enabling large savings over a purely byte-level global transformer.

Patching Methods

The authors consider several approaches to segment bytes into patches:

Strided Patching: Group every K bytes into a patch. This is simple but wasteful. If K is large, predictions are coarse-grained; if too small, too many global steps are needed.
Space Patching: End patches whenever a space or space-like character occurs. Spaces often correlate with word boundaries in many languages, allocating more steps to predict the first character of a word and fewer for subsequent characters. This improves over strided patching but still relies on a heuristic that may not generalize well to all languages or scripts.
Entropy Patching: The most flexible and data-driven approach. A small byte-level language model is trained to estimate the probability distribution over the next byte. Using these probabilities, the authors compute the entropy of the next byte prediction. If the entropy is high (i.e., the model is uncertain), a patch boundary is created. This ensures the global transformer is frequently invoked in complex areas. When entropy is low, the patch can be extended further, letting the model cover large predictable sequences in one step. Two variants are introduced: a global threshold on entropy, and a heuristic to detect local entropy spikes, approximating monotonic segments of decreasing entropy.

Entropy patching is incremental and domain-agnostic, making it more powerful and flexible than the previous methods. It also provides a natural mechanism to control average patch size, since one can adjust the threshold to achieve a desired average patch length.

Training Setup and Compute Control

A key focus of the paper is scaling. The authors conduct a large-scale study training BLT models up to 8B parameters and 4 trillion bytes, aiming to match or surpass well-established LLM baselines. They compare BLT to Llama 2 and Llama 3 models, which are token-based strong baselines.

The authors adopt a flop-controlled setting to ensure a fair comparison. They measure the flops required for both training and inference and compare models with the same training or inference cost. By carefully controlling for computational budgets, they can isolate differences in architecture and representation without conflating them with training scale advantages.

They also employ the “compute-optimal” concept: for a given model size, there is an approximately optimal ratio of model parameters to training data. This ratio, discovered in prior work (like Llama 3 and Chinchilla), ensures that training is most efficient. BLT is tested under these optimal conditions and beyond, so that differences in scaling behavior are well-understood.

Scaling Trends and Results

Matching or Surpassing Token-Based Models: The paper shows that BLT models match the training flop-controlled performance of a state-of-the-art tokenizer-based model (Llama 3) at scale. Even more notably, BLT achieves this with potentially 50% fewer inference flops. This is a key accomplishment: previously, byte-level models lagged behind token-based models in efficiency and performance, but BLT breaks this barrier.
Improved Robustness and Character-Level Tasks: Direct byte-level modeling makes BLT more robust to input noise. Evaluations on tasks where characters are randomly cased, dropped, repeated, or otherwise noised show that BLT outperforms token-based models. BLT also excels at tasks requiring detailed character-level manipulations (e.g., substituting characters, checking orthographic similarity, and phonetic transcription). Because token-based models lack direct access to character information, their performance on such tasks is weaker.
Long-Tail Generalization: BLT performs better on multilingual and rare language tasks, improving equitability across languages. On FLORES-101 translations, BLT outperforms a matched Llama 3 model on low-resource language pairs, confirming that byte-level modeling avoids biases inherent in fixed token vocabularies.
Inference Cost vs. Model Size Trade-Off: Since patch size is not fixed by a pre-defined vocabulary, BLT can increase patch size to reduce steps and thereby reallocate saved compute to a larger global model. This creates a new scaling axis: one can increase both model size and patch size while keeping the same inference budget. Larger patch sizes initially underperform at small scales but become competitive or even superior as the model and training budget grow. The authors observe that as BLT scales, big patches plus a large global model yield better performance than token-based models at an equivalent training and inference cost.
From Tokens to Bytes (Retrofit Existing Models): Another interesting experiment shows that one can initialize BLT’s global transformer from a pre-trained token-based model (like Llama 3.1) and then train the local encoder and decoder. This “byte-ifying” process leverages existing pretraining and can yield substantial improvements on certain tasks with less training compute. Although additional tuning might be needed, this suggests a future direction: starting from a well-trained token-based model and converting it into a tokenizer-free model for enhanced byte-level capabilities.

Ablations and Analysis

The paper includes extensive ablation studies to identify which architectural choices are crucial to BLT’s success:

Cross-Attention and n-gram Embeddings: BLT relies on cross-attention layers in the local encoder and decoder to integrate byte and patch representations. Ablation results show that these cross-attention modules are essential, especially the one applied to the decoder, significantly improving performance. Similarly, adding hash-based n-gram embeddings to the byte-level representations leads to large improvements in compression and representational power. Smaller n-grams (3 to 5) are more impactful, and having large hash vocabularies for these embeddings shows diminishing returns after a certain point.
Entropy Model Size: The performance of entropy-based patching improves as the entropy model (the small byte-level LM used to determine patch boundaries) grows larger and sees a wider context window. Larger and more capable entropy models yield better patch segmentation and subsequent improvement in BLT’s scaling trends.
Local Model Depth: The authors find that an extremely lightweight local encoder (even a single layer) combined with multiple decoder layers and n-gram embeddings is effective. This suggests that the critical complexity is in the global model and the ability to quickly and efficiently pool information from bytes to patches (encoder) and from patches back to bytes (decoder).

Comparison With Other Byte-Level Architectures

BLT is not the first attempt at byte-level Transformers. Previous works like CharFormer, ByT5, and MegaByte tried tackling the bottleneck of large sequence lengths. ByT5 models bytes end-to-end, suffering heavy computational costs. MegaByte introduced static patching (i.e., fixed-size patching) and showed improvements in efficiency. However, MegaByte still lags behind token-based models at scale. BLT surpasses these methods by employing:

Dynamic patching (especially entropy patching) rather than fixed-size patching.
A distinct three-part architecture that uses both local encoder-decoder modules and a global transformer, with cross-attention layers carefully placed to mediate between these levels.
Hash-based n-gram embeddings to enhance local modeling capabilities.
Systematic exploration of scaling laws, ensuring that the model competes at the frontiers of modern LLM performance and does not rely solely on small-scale experiments.

Implications and Future Work

BLT’s success suggests that tokenization is not an immutable necessity for large language models. Modeling directly from bytes opens avenues to eliminate bias induced by subword segmentation and handle diverse languages fairly. The capacity to scale model size at fixed inference cost by adjusting patch size introduces a new dimension for scaling LLMs. BLT’s robustness to noisy inputs and improved character-level understanding holds promise for applications in which input data may not be well-formed text, including OCR outputs, transliterated text, and domains with rich morphological complexity.

Despite these advantages, there are certain acknowledged limitations. The scaling laws used were derived from token-based models, so there might be even more optimal ratios for BLT. The current implementation may require further engineering for optimal wall-clock speeds. Future directions could involve end-to-end training where the entropy model is integrated and jointly optimized, rather than trained separately. Another direction is to refine techniques for retrofitting pre-trained token-based models, making it easier to convert them into BLT-like models without a large additional training cost.

Conclusion

The Byte Latent Transformer (BLT) represents a step forward in tokenizer-free language modeling. Through dynamic patching, BLT can match the performance of state-of-the-art subword tokenized models while improving inference efficiency and robustness. It provides better scaling trends than token-based architectures, especially beyond the compute-optimal training regime, and offers unique advantages in handling noise, morphological complexity, and multilingual data. The success of BLT shows that large language models can achieve strong performance directly on raw bytes and that doing so confers critical benefits in flexibility, efficiency, and fairness. It challenges the entrenched paradigm that tokenization is a mandatory step, potentially revolutionizing how future LLMs are built and scaled.