Language Model Plateau: OpenAI's Orion Shows Little Improvement

The development of artificial intelligence (AI) language models seems to have hit a wall. OpenAI’s upcoming model, codenamed “Orion,” barely outperforms its predecessor, GPT-4. This slowdown isn’t just OpenAI’s issue; it affects the entire AI industry. Experts are beginning to question whether we’ve reached a temporary ceiling in AI capabilities.

Smaller Gains Than Expected

According to a new report by The Information, Orion delivers much smaller performance gains than expected. The quality improvement between GPT-4 and Orion is less significant than the leap from GPT-3 to GPT-4. In the past, each new model brought significant advancements. For instance, GPT-3 was a major step up from GPT-2, and GPT-4 improved on GPT-3 in many ways.

However, with Orion, the gains are minimal. In some areas like programming, Orion doesn’t consistently beat GPT-4. It only shows improvements in language capabilities, such as understanding and generating text. But even these improvements are modest. This raises concerns about the value of investing in larger models that offer only slight enhancements.

What’s more, running Orion could cost more in data centers than previous models. Larger models require more computational power and energy. This makes them more expensive to operate. Companies may find it hard to justify the increased costs for minimal performance gains. The economic feasibility of deploying such large models on a wide scale is in question.

Running Out of Training Data

One reason for the slowdown is a lack of high-quality training data. OpenAI researchers point out that most publicly available texts and data have already been used. Over the years, AI models have been trained on vast amounts of data from the internet, books, articles, and other sources. But now, there isn’t much new data left to feed into these models. And no, Reddit doesn’t count as quality data.

To address this issue, OpenAI has created a “Foundations Team” led by Nick Ryder. This team is tasked with finding new ways to gather and generate training data. They are exploring methods to make better use of the data they already have. This includes cleaning existing data and finding novel sources of information.

This move aligns with CEO Sam Altman’s statement in June. He said that while data exists in sufficient quantities, the focus should shift to learning more from less data. In other words, it’s not just about having more data but about using it more effectively. Altman suggests that models can be trained to extract more value from the same amount of information.

Using Synthetic Data

One of the strategies OpenAI is employing is the use of synthetic data. This is training material generated by AI models themselves. By creating data artificially, they hope to overcome the shortage of new information. Synthetic data can include text, images, or other types of content produced by existing models.

Orion has already been partially trained on synthetic data from GPT-4 and OpenAI’s new “reasoning” model called o1. However, this approach has its risks. There’s a concern that the new model will end up resembling older models too closely. An OpenAI employee expressed worries that training on synthetic data could lead to a kind of “echo chamber,” where the model keeps reinforcing the same patterns without introducing truly new knowledge.

Industry-Wide Slowdown

The slowdown in language model progress isn’t limited to OpenAI. Other tech giants are facing similar challenges. For example, Google’s upcoming Gemini 2.0 is also falling short of internal targets, as reported by The Verge. Gemini was expected to be a major advancement but seems to be struggling to meet expectations. This suggests that even with significant resources, achieving substantial improvements is becoming harder.

Anthropic, another AI company, is rumored to have halted development on version 3.5 of its flagship model Opus. Instead, they released an improved model called Sonnet. This move might be an attempt to avoid disappointing users and investors with a model that doesn’t offer significant improvements. It indicates that companies are rethinking their strategies in light of the challenges.

Open-Source Models Catching Up

Over the past 18 months, open-source models have been catching up to proprietary ones that cost billions to develop. This suggests an industry-wide plateau. If major tech companies could effectively convert their massive investments into better AI performance, we wouldn’t see open-source models closing the gap so quickly.

A scatter plot shows the development of MMLU scores of AI models from 2022 to 2024. The graph illustrates a convergence between closed-source and open-source performance. Different language models now perform similarly. While earlier versions showed clear performance gaps, MMLU scores have converged since 2023. This suggests a temporary performance ceiling.Image: Maxime Labonne via X

What the Data Shows

The MMLU (Massive Multitask Language Understanding) benchmark is used to evaluate the performance of AI models across a wide range of tasks. The convergence of scores means that improvements are becoming harder to achieve. Models are reaching a point where making them better requires much more effort for smaller gains.

This plateau raises questions about the future of AI development. If adding more data and increasing model sizes no longer lead to significant improvements, researchers may need to find new approaches. The traditional method of scaling up models might not be sustainable.

Optimism Amidst Challenges

Despite these challenges, OpenAI CEO Sam Altman remains optimistic. In a recent interview, he said that the path to artificial general intelligence (AGI) is clear. He believes that what is needed is a creative use of existing models. Altman could be referring to the combination of language models with reasoning approaches like o1 and agentic AI.

Agentic AI involves models that can take actions and make decisions, rather than just generating text based on input. By integrating reasoning and decision-making capabilities, AI could become more versatile and powerful without necessarily becoming larger.

Noam Brown, a prominent AI developer at OpenAI and former Meta employee who helped create o1, agrees with Altman. He says that Altman’s statement reflects the views of most OpenAI researchers. Brown believes that focusing on inference and reasoning offers a “new dimension for scaling.”

A New Approach to Scaling

The new o1 model aims to create fresh scaling opportunities. It shifts the focus from training to inference—the computing time AI models need to complete tasks. Instead of just making models bigger, this approach looks at making them smarter in how they process information.

By improving inference, AI models can perform better without requiring more data or larger architectures. This can involve optimizing algorithms, enhancing computational efficiency, and integrating new reasoning capabilities.

However, this will require billions of dollars and significant energy use. Building models that focus on inference means investing in new types of hardware and software. This raises a key question for the industry: Does building ever-more-powerful AI models—and the massive data centers they need—make economic and environmental sense? OpenAI seems to think so, but others are skeptical.

Environmental Concerns

The environmental impact of large AI models is becoming a significant concern. Training and running these models consume vast amounts of energy. As models get bigger, their carbon footprint increases. Some experts worry that the benefits of slightly improved AI capabilities may not justify the environmental costs.

Companies are exploring ways to make AI more energy-efficient. This includes developing specialized hardware that consumes less power and optimizing algorithms to be more efficient. However, these solutions may not be enough if model sizes continue to grow exponentially.

Criticism from Experts

Not everyone agrees with the current direction of AI development. Google AI expert François Chollet criticized scaling language models for mathematical tasks. He called it “especially obtuse” to cite progress in mathematical benchmarks as proof of AGI.

Chollet argues that empirical data shows deep learning and large language models can’t solve math problems independently. Instead, they need discrete search methods. These are systematic approaches that check various solution paths, rather than predicting likely answers like language models do.

He also criticized using “LLM” (Large Language Model) as a marketing term for all current AI advances, even when they are unrelated to language models. He pointed to Gemini’s integration into Google DeepMind’s AlphaProof as “basically cosmetic and for marketing purposes.” This suggests that some companies might be overstating the capabilities of their models for marketing reasons.

The Need for New Methods

Chollet’s comments highlight a broader issue. Relying solely on scaling up language models may not be the best path forward. To solve complex problems like mathematical proofs or advanced reasoning, AI may need new architectures and methods.

Researchers are exploring hybrid models that combine neural networks with symbolic reasoning. Others are looking into neurosymbolic AI, which blends statistical learning with logical reasoning. These approaches aim to overcome the limitations of current models.

Economic Considerations

The cost of developing and running large AI models is another critical factor. Companies invest billions of dollars in research and infrastructure. If these investments yield diminishing returns, it could lead to a reevaluation of strategies.

Smaller companies and startups may find it increasingly difficult to compete. The barrier to entry becomes higher as the cost of training state-of-the-art models rises. This could lead to consolidation in the industry, with only a few big players dominating the field.

User Expectations and Trust

As AI models become more integrated into everyday applications, user expectations increase. People expect AI to understand context, provide accurate information, and even exhibit reasoning abilities. If new models offer only slight improvements, users may become disillusioned.

Moreover, overpromising and underdelivering can erode trust. If companies market their models as revolutionary but fail to meet expectations, it could harm their reputation. Transparency about capabilities and limitations is crucial.

Conclusion

The AI industry faces a significant challenge. The rapid progress we’ve seen over the past few years may be slowing down. OpenAI’s Orion barely outperforms GPT-4, and other companies face similar hurdles. While some remain optimistic about new approaches, others question the economic and environmental costs.

The convergence of model performance suggests that simply making models bigger isn’t enough. Researchers may need to find new methods and focus on different aspects of AI. The debate between scaling up and innovating new techniques is likely to continue.

Only time will tell if the industry can overcome these obstacles. For now, the focus may need to shift from scaling up models to finding new ways to improve them. This could involve more efficient algorithms, new architectures, or entirely different approaches to AI. Collaboration between companies, researchers, and policymakers might be key to navigating this new landscape.

Sources

The Information: OpenAI’s Orion Model Barely Outperforms GPT-4
Sam Altman’s Interview on AGI: OpenAI CEO Discusses Future of AI
Maxime Labonne via X: Scatter Plot of MMLU Scores
François Chollet’s Criticism: Scaling Language Models for Math Tasks

Continue Reading

Recent Launches

Compare

GPT-5 is Here: OpenAI Launches Its Most Advanced AI Model

Latest News

Kingy Launch Brief

Every Friday, the verified AI launches, apps, funding rounds, pricing changes and under-the-radar moves worth knowing—source-linked and explained in five minutes.

Free · Every Friday · Unsubscribe anytime · No daily email