TLDR

Video Models Are Zero-Shot Learners and Reasoners: The Coming GPT-3 Moment for Computer Vision

Google DeepMind just dropped a research bomb that could fundamentally reshape how we think about computer vision. Remember when GPT-3 made everyone realize large language models could solve tasks they weren’t explicitly trained for? Well, Veo 3 is doing the same thing for vision—and it’s absolutely wild.

The researchers tested Veo 3 on 62 different visual tasks it was never specifically trained for, from basic perception (edge detection, segmentation) to complex reasoning (solving mazes, visual analogies). The kicker? It just figures them out through simple text prompts. No fine-tuning, no task-specific training heads, no specialized architectures—just “hey, detect edges in this image” and boom, it works.

2509.20328v1 Download

Here’s what’s genuinely mind-bending: Veo 3 can perceive objects, model physics (understanding buoyancy, material properties, gravity), manipulate scenes (background removal, style transfer, 3D-aware editing), and even reason visually through what the authors call “chain-of-frames”—essentially thinking step-by-step through video generation like language models do with chain-of-thought.

The performance leap from Veo 2 to Veo 3 is dramatic. On maze solving, Veo 3 hits 78% success rate on 5×5 grids while Veo 2 managed just 14%. For visual symmetry tasks, Veo 3 achieves 95% accuracy compared to Veo 2’s 47%. Even on classic computer vision tasks like edge detection, it’s approaching specialized model performance (0.77 vs 0.90 SOTA) despite being completely zero-shot.

But here’s the deeper implication: just as NLP shifted from task-specific models to unified foundation models, computer vision appears to be hitting its own inflection point. Today’s vision AI landscape—with separate models for segmentation (SAM), detection (YOLO), depth estimation, etc.—mirrors NLP circa 2019. Veo 3 suggests we’re moving toward a future where one generalist video model handles everything.

The economics are compelling too. While video generation is expensive now, inference costs historically fall 9-900× annually. Early GPT-3 was deemed “prohibitively expensive,” yet here we are.

This isn’t just about better computer vision—it’s about vision AI that can reason about the visual world in ways we’ve never seen before. The researchers demonstrate early forms of visual planning, physics simulation, and spatial reasoning that emerge naturally from scale and training, not careful engineering.

The paradigm shift is already beginning.

Video Models Are Zero-Shot Learners and Reasoners: The GPT-3 Moment for Computer Vision Has Arrived

The Quiet Revolution Happening in Plain Sight

Let me paint you a picture of where we are right now in computer vision. It’s 2025, and most AI systems still work like highly specialized craftspeople—you’ve got your Segment Anything Model for cutting out objects, your YOLO variants for detection, your depth estimation models, your edge detection algorithms. Each one brilliant at its specific task, each one requiring careful training and fine-tuning for new scenarios.

Sound familiar? It should, because this is exactly where natural language processing was sitting around 2019. A Tower of Babel situation where every task demanded its own custom-built solution.

Then GPT-3 happened, and everything changed. Suddenly, one model could write code, translate languages, answer questions, and compose poetry—all through the simple magic of text prompts. No fine-tuning required. The age of foundation models had begun.

Google DeepMind’s latest research suggests we’re witnessing the same tectonic shift in computer vision, and their weapon of choice is Veo 3, a video generation model that’s quietly becoming something far more powerful than its original purpose might suggest.

When Video Models Start Seeing the World

The core insight driving this research is almost embarrassingly simple: video models trained on massive datasets with a generative objective might naturally develop general visual understanding. After all, to generate convincing videos, a model needs to understand objects, physics, spatial relationships, temporal dynamics, and causality. Those are the same ingredients needed for… well, pretty much every computer vision task ever conceived.

The DeepMind team decided to test this hypothesis by throwing 62 different visual tasks at Veo 3—none of which it was explicitly trained for. The results read like a greatest hits album of computer vision capabilities:

Perception: Edge detection, segmentation, object tracking, super-resolution, denoising, visual search
Modeling: Physics simulation, material properties, 3D understanding, temporal reasoning
Manipulation: Image editing, style transfer, 3D-aware transformations, scene composition
Reasoning: Maze solving, visual analogies, pattern completion, spatial planning

And here’s the kicker—it accomplishes all of this through simple text prompts. Want edge detection? “All edges in this image become more salient by transforming into black outlines.” Need segmentation? “Each distinct entity is overlaid in a different flat color.” Solve a maze? “The red square slides smoothly along the white path, stopping perfectly on the green square.”

The Numbers Don’t Lie: A Capability Explosion

Let me hit you with some concrete performance metrics that made me do a double-take:

Edge Detection: Veo 3 achieves 0.77 OIS@10 on the BIPED dataset—remember, this is completely zero-shot against specialized models’ 0.90. The gap isn’t a chasm; it’s a crack that’s rapidly closing.

Maze Solving: On 5×5 grids, Veo 3 hits 78% success rate at 10 attempts, obliterating Veo 2’s measly 14%. Even more impressively, it outperforms Gemini 2.5 Pro when the maze is presented as an image rather than text.

Visual Symmetry: 95% accuracy for shapes, 98% for random patterns—tasks requiring precise spatial reasoning and pattern completion.

Instance Segmentation: 0.74 mIoU on LVIS dataset, essentially matching the specialized Nano Banana model at 0.73.

But the raw numbers tell only part of the story. The trend is what’s genuinely terrifying for traditional computer vision. The performance leap from Veo 2 to Veo 3 isn’t incremental—it’s exponential. We’re watching capabilities emerge that weren’t there before, like watching intelligence crystallize in real-time.

Chain-of-Frames: When Video Models Start Thinking

Here’s where things get philosophically interesting. The researchers introduce a concept they call “chain-of-frames” (CoF)—essentially the visual equivalent of chain-of-thought reasoning in language models.

When Veo 3 solves a maze, it doesn’t just output a solution path. It generates a video where you can watch it think step-by-step, frame-by-frame, as a red dot navigates through the maze. When it solves visual symmetry puzzles, you see the pattern completion happen gradually. When it demonstrates tool use, you watch the reasoning process unfold temporally.

This isn’t just clever—it’s a fundamental shift in how we might approach visual reasoning. Language models can manipulate human-invented symbols through time (tokens in sequence). Video models can manipulate the physical world through time and space (pixels across frames). The dimensional advantage is profound.

Consider the implications: while LLMs are constrained to the symbolic realm, video models operate in the same spatial-temporal reality we do. They don’t just understand concepts—they can simulate, predict, and manipulate visual scenarios with physics-aware precision.

The Physics Engine in the Neural Net

One of the most striking aspects of Veo 3’s emergent abilities is its intuitive physics understanding. Without any explicit physics training, it demonstrates knowledge of:

Buoyancy: Bottle caps float, rocks sink
Gravity and air resistance: Feathers fall slower than bowling balls on Earth, equally fast on the Moon
Material properties: Glass inverts reflections, mirrors don’t; paper burns, metal doesn’t
Rigid vs. soft body dynamics: Vases maintain shape when moved, silk scarves drape naturally
Optical phenomena: Refraction, reflection, additive vs. subtractive color mixing

This isn’t programmed physics simulation—it’s learned intuition emerging from pattern recognition across millions of videos. The model has essentially developed a world model that captures the statistical regularities of how the physical world behaves.

The Prompt Engineering Goldmine

Here’s something that deserves more attention: prompt engineering for vision is about to become as crucial as it is for language models. The researchers found that small changes in prompts can shift performance by 40-64 percentage points on symmetry tasks.

Want better segmentation? Use a green background instead of white (0.74 vs 0.66 mIoU)—likely because of training data bias toward green screens. Need the model to stop animating after task completion? Add a “motion outlet” like a spinning color wheel to signal when to freeze the solution.

This suggests we’re entering an era where visual prompt engineering becomes its own specialized skill. The ChatGPT prompt optimization industry was worth hundreds of millions; visual prompting could be worth billions.

The Economics of Generalist vs. Specialist Models

“But wait,” you might say, “isn’t video generation hideously expensive compared to running YOLO on an edge device?”

Fair point. Today’s video models are computationally intensive. But here’s the thing about technology costs—they fall with predictable ruthlessness. Epoch AI estimates LLM inference costs drop 9-900× annually for equivalent performance. Early GPT-3 was deemed “challenging to deploy” due to computational costs. Today, you can run comparable models on a smartphone.

The economics of generalist models follow a different curve than specialist ones. Instead of maintaining dozens of task-specific models, you maintain one foundation model that handles everything. The operational simplicity alone—no more managing separate training pipelines, deployment systems, or integration headaches—could justify higher per-inference costs.

Plus, consider the qualitative advantages: 3D-aware image editing, physics-consistent transformations, temporal reasoning, cross-task transfer learning. These aren’t just performance improvements—they’re entirely new categories of capability that no combination of specialized models can match.

Jack of All Trades, Master of… Actually, Getting Pretty Good

The traditional criticism of generalist models is that they’re “jack of all trades, master of none.” And yes, Veo 3 doesn’t outperform specialized models on every task. Its colorization hits only 8% success rate, while visual analogy completion struggles with rotational reasoning.

But this criticism misses the forest for the trees. Early GPT-3 performed worse than fine-tuned models on most benchmarks. That didn’t stop it from revolutionizing NLP because the aggregate value of general capability outweighed specialized performance gaps.

More importantly, the performance trajectory matters more than absolute numbers. The Veo 2 → Veo 3 improvement is dramatic across virtually every task. We’re not looking at a plateau—we’re looking at a steep slope that shows no signs of leveling off.

And there’s another factor: inference-time scaling. The researchers show consistent improvement from pass@1 to pass@10 across tasks, suggesting that techniques like self-consistency and multi-sample verification could boost performance further without model changes.

The Prompt is the Program

What we’re witnessing is the visual equivalent of programming’s shift from machine code to high-level languages. Instead of carefully engineering CNN architectures for specific tasks, we’re moving toward natural language as the primary interface for visual computation.

“Remove the background” replaces complex segmentation pipelines. “Make this look like a Hundertwasser painting” replaces style transfer networks. “Solve this maze” replaces path-planning algorithms.

This isn’t just more convenient—it’s fundamentally more powerful. Natural language allows for compositional complexity that specialized models can’t match. Want edge detection + style transfer + background removal + 3D rotation? That’s one prompt instead of three separate models plus integration code.

Failure Modes and Cognitive Biases

Of course, Veo 3 isn’t perfect. The researchers candidly document failure cases: laundry folding instructions that make no sense, spatial reasoning errors on complex puzzles, systematic biases in rotation tasks.

But here’s what’s fascinating about these failures—they often reveal the model’s underlying cognitive architecture. The systematic bias against rotational transformations in visual analogies suggests the training data might under-represent certain geometric transformations. The difficulty with complex tool use reflects the challenging nature of multi-step procedural reasoning.

These aren’t random failures—they’re informative errors that point toward specific areas for improvement. When a model fails at something humans find easy, it’s usually because we take vast amounts of implicit knowledge for granted. Making that knowledge explicit and trainable is an engineering problem, not a theoretical impossibility.

The Cambrian Explosion of Visual AI

If video models become the foundation models for vision, we’re looking at a Cambrian explosion of visual AI applications. Today’s computer vision pipeline involves:

Choose task-specific model architecture
Collect and label training data
Train/fine-tune the model
Deploy and maintain specialized infrastructure
Integrate with other vision models for multi-task scenarios

Tomorrow’s pipeline might look like:

Write a text prompt
Run inference on foundation model
Done

This isn’t just a productivity boost—it’s a complete democratization of computer vision. The barrier to entry drops from “PhD in computer vision + months of development time” to “ability to write descriptive English.”

Consider what happens when every developer, designer, and content creator has access to general visual intelligence through simple text prompts. We’re not just talking about better computer vision—we’re talking about an entirely new medium for human-computer interaction.

Beyond Computer Vision: Toward Visual Intelligence

The deeper implication of this research extends beyond computer vision per se. What we’re seeing is the emergence of visual intelligence—systems that don’t just process images but understand, reason about, and manipulate the visual world in contextually appropriate ways.

This touches on fundamental questions in AI and cognition. How much of human intelligence is grounded in our ability to perceive, model, and manipulate spatial-temporal relationships? If video models can develop similar capabilities through pure pattern recognition, what does that tell us about the nature of visual understanding itself?

The researchers position this work within the broader trajectory toward artificial general intelligence, and while that might seem hyperbolic, the pattern is hard to ignore. Language models achieved general-purpose linguistic competence through scale and training on diverse text. Video models appear to be following the same path toward general-purpose visual competence.

The Road Ahead: Challenges and Opportunities

Several critical challenges remain before video models can fully replace specialized computer vision systems:

Computational Efficiency: Video generation is inherently more computationally intensive than image processing. While costs are falling rapidly, real-time applications on edge devices remain challenging.

Reliability and Controllability: Generative models can be unpredictable. For safety-critical applications, we need better methods for ensuring consistent, reliable outputs.

Evaluation and Benchmarking: How do we systematically evaluate general visual intelligence? Current benchmarks were designed for specialized models and may not capture the full capabilities of foundation models.

Ethical and Safety Considerations: General visual intelligence raises new questions about privacy, misuse, and societal impact that the computer vision community is only beginning to grapple with.

But these are engineering challenges, not fundamental limitations. The core insight—that video models trained on web-scale data naturally develop general visual understanding—appears robust and replicable.

The GPT-3 Moment for Vision

Looking back, GPT-3’s release in 2020 marked a clear inflection point for NLP. Not because it was perfect, but because it demonstrated that general-purpose language understanding was possible through scaling simple primitives. The specialized models that dominated NLP for decades suddenly looked like evolutionary dead ends.

Veo 3 feels like the same kind of moment for computer vision. It’s not the final answer, but it’s the proof of concept that general visual intelligence is achievable through video generation at scale. The implications ripple far beyond academic computer vision into product development, creative industries, robotics, and human-computer interaction.

We’re watching the early stages of a paradigm shift that will reshape how we build, deploy, and interact with visual AI systems. The age of specialized computer vision models isn’t over yet, but you can see the sunset from here.

The future of computer vision isn’t just about better object detection or more accurate segmentation. It’s about systems that can see, understand, and reason about the visual world as flexibly as humans do. And if this research is any indication, that future is arriving faster than most of us expected.

The revolution will be visualized.

Project page: video-zero-shot.github.io
Paper: 2509.20328v1.pdf on arXiv