As large language models (LLMs) become increasingly vital across various industries, companies must choose the best LLMs to suit their specific needs and long-term objectives. One of the most important aspects to consider in this decision-making process is the context length of the model. This guide will delve into context length, why it matters, and the pros and cons of different context lengths.
Defining Context Length and Its Importance
Context length, or the context window, refers to the maximum amount of information an LLM can process in a single input. The larger the context length, the more information a user can include in a prompt to generate a response.
While it's common to think of context length in terms of words, language models measure it in tokens. A token typically represents four characters (in English) or roughly ¾ of a word. Therefore, 100 tokens equate to about 75 words.
To give you an idea, here are the context lengths of some well-known LLMs:
Mistral 7B: 8K
Palm-2: 8K, with Google's Gemini model supporting up to 32K
Claude: 9K
Claude 2: 100K (in beta at the time of writing)
Llama: 2K
Llama 2: 4K
GPT-3.5-turbo: 4K, with a variant (GPT-3.5-16k) supporting 16K
GPT-4: 8K, with another variant (GPT-4-32k) supporting 32K
The significance of context length lies in its impact on an LLM's functionality and performance in several ways:
Input Scope and Complexity: Larger context lengths enable models to handle more complex and detailed inputs, directly influencing their suitability for various tasks. For example, a 4K context window (like those in GPT-3.5 or Llama 2) is equivalent to about six pages of text, while a 32K context length could encompass up to 49 pages. The larger the window, the more comprehensive tasks like summarization can be.
Coherence: Since LLMs don't have memory in the traditional sense, context length determines how much previous input they can recall, affecting the coherence and accuracy of their output.
Accuracy: A larger context window increases the potential for the model to provide a relevant response by allowing it to consider a more extensive set of information.
Pros and Cons of Different Context Lengths
Now that we understand why context length is important, let's explore the advantages and disadvantages of short and long context lengths.
Short Context Windows
Advantages
Faster Response Times: With less input to process, LLMs with shorter context lengths can quickly generate responses, enhancing user experience.
Resource Efficiency: Shorter context windows require less computational power, memory, and energy, making these models more cost-effective and accessible on a wider range of devices.
Disadvantages
Limited "Memory": LLMs are stateless, so their context window acts as their memory. A shorter window means the model can only recall a limited portion of a conversation or input. If a user refers to information outside this window, this leads to irrelevant or incoherent responses.
Reduced Contextual Understanding: A smaller context length limits the information a user can input. This may result in the model misunderstanding the context and producing less accurate or incomplete output. This often necessitates prompt engineering or providing more precise and detailed input.
Long Context Windows
Advantages
Broader Application Range: Larger context windows can handle more extensive and complex inputs, making them suitable for tasks involving documents, large databases, codebases, and multiple data sources. This also enables models to tackle more intricate, multi-step tasks.
Deeper Understanding: A more extensive context allows the model to understand user requests better, leading to more accurate, coherent, and relevant responses.
Improved Efficiency: Larger context windows save time by processing more information simultaneously, reducing the need for iterative steps to complete tasks.
Disadvantages
Inaccuracy: While longer context lengths offer more capabilities, maintaining accuracy across extended windows can be challenging. Research has shown that LLMs often struggle to accurately recall information in the middle of long contexts, a phenomenon known as the "missing middle."
Higher Resource Demands: Larger context windows require significantly more computational power and memory, which can strain resources and limit the number of users who can effectively utilize such models.
Slower Response Times: The increased computational demands of longer context lengths often lead to slower inference times, which can negatively impact real-time performance and user experience.
Training and Deployment Challenges: Models with longer context lengths have more parameters, leading to longer training times and greater complexity in deployment, particularly on devices with limited computational power.
Addressing the "Missing Middle" Problem
Given the benefits of longer context lengths, what can be done to address the "missing middle" issue that current LLMs with extended windows face? Fortunately, recent research into context length extrapolation—extending an LLM's context window beyond its pre-trained limit—offers promising solutions. However, this requires modifications to the transformer's self-attention mechanism.
Transformers, Attention Mechanisms, and Positional Encoding
Transformers handle input by converting it into a numerical representation, called input embedding. This involves mapping each token in the context onto a multi-dimensional vector space, where the distance between vectors represents the relationship between tokens.
To maintain the input's sequence, transformers use positional embedding, which adds a vector to each token to remember its order without storing it in the neural network. This reduces memory requirements and necessitates larger context windows.
RoPE and Positional Interpolation
Rotary positional embedding (RoPE), used in models like Llama 2 and PaLM, forms the basis for context length extrapolation. RoPE rotates a token’s vector according to its position, allowing for more accurate vector space relationships while maintaining efficiency.
Researchers have successfully introduced positional interpolation (PI) to extend the context window. Instead of extending the encoding to positions outside the pre-trained context length, PI repositions the positional embedding within the original context length by scaling the position vector. This method has shown promising results in tasks requiring long contexts, such as document summarization.
Alternative RoPE Extensions
Despite the progress, the RoPE and PI methods aren't without flaws. Interpolating positional embeddings can degrade performance on tasks with shorter contexts, potentially due to the loss of important high-frequency information. To address this, researchers have developed an alternative extension method (yet another RoPE extensioN), which applies a more targeted interpolation technique called NTK-by-parts. This approach helps maintain performance across short and long contexts, achieving strong results on tasks with context lengths up to 128K.
Conclusion
While more considerable context lengths offer greater flexibility, efficiency, and applicability, they come with accuracy, resource demands, and deployment challenges. As LLM development advances, solutions to these challenges are emerging, paving the way for more effective and reliable models.
References
[1] Lost in the Middle: How Language Models Use Long Contexts (2023)
[2] Extending Context Window of Large Language Models via Positional Interpolation (2023)
[3] Giraffe: Adventures in Expanding Context Lengths in LLMs (2023)