A new version of the Gemini model has been introduced by Google, called Gemini 1.5. This version offers major enhancements in speed and efficiency and introduces a significant feature known as the long context window. The long context window determines the number of tokens, the smallest building blocks such as part of a word, image, or video, that can be processed at a time. In order to shed light on the significance of this development, we consulted the Google DeepMind project team to discuss the concept of long context windows and how this experimental feature can benefit developers in various ways.
Context windows play a crucial role in enabling AI models to recall information during a session. Just like how we may struggle to remember details in the middle of a conversation, AI models also face challenges in retaining information. The introduction of long context windows aims to address this issue.
Previously, Gemini could process up to 32,000 tokens at a time. However, the new 1.5 Pro model, the first of the 1.5 series available for early testing, boasts a context window of up to 1 million tokens — making it the largest context window of any large-scale foundation model to date. Moreover, the ability to process up to 10 million tokens has been successfully tested in research. A longer context window allows the model to absorb and process more text, images, audio, code, or video.
According to Google DeepMind Research Scientist Nikolay Savinov, the original goal was to achieve a 128,000-token context window. However, through ambitious aims, they have managed to exceed this goal, achieving a context window of 1 million tokens in their research.
The leap forward was made possible through a series of deep learning innovations, leading to an unexpected chain of breakthroughs. As explained by Google DeepMind Engineer Denis Teplyashin, each breakthrough opened up new possibilities, resulting in a progression from 128,000 tokens to 512,000 tokens, then to 1 million tokens, and most recently, 10 million tokens in internal research.
The enhanced capacity of the 1.5 Pro model enables new ways to interact with the model. Instead of summarizing a document that is dozens of pages long, it can now summarize documents thousands of pages long. Similarly, while the previous model could analyze thousands of lines of code, the 1.5 Pro with its long context window can now analyze tens of thousands of lines of code in a single instance.
In one test, the model successfully generated documentation for an entire code base, demonstrating its capabilities. In another test, it accurately answered questions about the 1924 film “Sherlock Jr.,” having been given the entire 45-minute movie to process.
The 1.5 Pro model is also able to reason across data provided in a prompt. One example provided by Google DeepMind Research Scientist Machel Reid involves its ability to translate a rare language called Kalamang, based on the content of a grammar manual and sample sentences, demonstrating proficiency comparable to a person learning from the same content.
Gemini 1.5 Pro is equipped with a standard 128,000-token context window. However, a limited group of developers and enterprise customers can test it with a context window of up to 1 million tokens via AI Studio and Vertex AI in a private preview. While the full 1 million token context window requires further optimizations to reduce latency, efforts are underway to achieve this as the model is scaled out.
Looking ahead, the team is focused on enhancing the model’s speed, efficiency, and safety, while also aiming to expand the long context window, improve the underlying architectures, and integrate new hardware improvements. According to Nikolay, the model’s capability to process 10 million tokens at once is already near the thermal limit of their Tensor Processing Units, indicating the potential for even greater capability as hardware continues to improve.
The team is eager to see the innovative uses that developers and the wider community will discover with these new capabilities. Machel expressed initial curiosity about the practical application of a million tokens in context, but has since observed a growing imagination among people, anticipating a wealth of creative uses for these advanced capabilities.