Genome modeling and design across all domains of life with Evo 2 &#8211; Paper Summary

TL;DR

Evo 2 is a fully open-source genomic language model—essentially a DNA-focused LLM—trained on 9.3 trillion base pairs from bacteria, archaea, eukaryotes, bacteriophages, and more. It can process up to 1 million base pairs at once, identifies previously unknown evolutionary patterns, generates entire chromosomes and synthetic genomes, and predicts harmful mutations in both coding and noncoding regions (including BRCA1, splicing variants, etc.) without specialized finetuning.

Using a unique StripedHyena 2 architecture (combining convolution and attention), Evo 2 autonomously learns features like exon–intron boundaries, transcription factor binding sites, and protein domains. In proof-of-concept “generative epigenomics” experiments, it has embedded Morse code into synthetic DNA controlling chromatin accessibility. Beyond mere demonstration, the arrival of Evo 2 heralds a new era of AI-driven synthetic biology where entire genomes may be programmed from scratch—lowering barriers to bioengineering, accelerating precision medicine, and paving the way for future cell-scale modeling.

Evo2 Download

A (Re)Introduction: The Fusion of Genomics and AI

Is biology becoming a computational discipline? Many say yes, pointing to the confluence of high-throughput multi-omics data, advanced machine learning, and cheaper DNA synthesis. “AI is moving beyond describing biology to designing it,” as one observer tweeted, ushering in an age where life itself can be programmed with increasing sophistication. Evo 2 emerges squarely in this domain. Though reminiscent of large language models (LLMs) for text, it plies its trade on DNA, reading entire chromosomes like sprawling paragraphs, discovering motifs akin to punctuation, and generating new sequences as linguistically coherent as an English essay—except that, in Evo 2’s realm, the “language” is genomic.

In “Genome modeling and design across all domains of life with Evo 2” [https://arcinstitute.org/tools/evo/evo-designer], the authors showcase a model that digests 9.3 trillion nucleotides from every domain of life—bacteria, archaea, eukaryotes, viruses (mostly prokaryotic phages)—and learns to interpret, predict, and even create entire genomes without direct, fine-labeled training. This stands in stark contrast to specialized approaches that rely on multiple sequence alignments or curated annotations for each target region. By adopting a universal approach, Evo 2 emerges as a broad generalist, bridging local motifs like splice sites with global phenomena such as epigenomic patterns across a 1,000,000-bp context.

As the authors and various ancillary sources put it, “think of it as a DNA-focused LLM. Instead of text, it generates genomic sequences. It interprets complex DNA, including noncoding regions usually considered junk, generates entire chromosomes and new genomes, and predicts disease-causing mutations—even those not previously understood.” Indeed, that pithy descriptor captures the essence of Evo 2: the blueprint for an epoch of “biology hacking.”

1. Breadth of Data: 9.3 Trillion Base Pairs

The impetus for Evo 2’s design was to gather as wide a genomic sample as possible, capturing all major lineages from the Tree of Life—prokaryotic, eukaryotic, organellar, bacteriophages—while excluding known eukaryotic viruses to mitigate biosecurity concerns. This curated set ultimately ballooned to 9.3 trillion base pairs, forming the “OpenGenome2” dataset, freely accessible at [https://huggingface.co/datasets/arcinstitute/opengenome2].

A key Twitter snippet reads, “It was trained on a dataset of 9.3 trillion base pairs from bacteria, archaea, eukaryotes, and bacteriophages,” underscoring the comprehensiveness of the corpus. The data curation cleaned redundancies, pruned out suspicious sequences, and balanced representation so that the model could glean fundamental genetic rules across an exceptionally broad diversity of life.

Beyond size, the training data reflect an emphasis on functional elements. The authors highlight that naive training on “raw eukaryotic references” (composed mostly of repetitive noncoding regions) degrades performance on mutational tasks. Hence, masked weighting strategies for repetitive segments were employed, ensuring robust learning for both genic and regulatory contexts.

2. The StripedHyena 2 Architecture and Large Context Windows

One of Evo 2’s hallmark achievements is its 1 million base pair context window. That is not a misprint: 1,000,000 nucleotides in a single forward pass. According to the text, “it processes up to 1 million base pairs in a single context window, covering entire chromosomes. It identifies evolutionary patterns previously unseen by humans.” Implementation demanded a specialized approach. The authors side-stepped the typical Transformer scaling pitfalls by using StripedHyena 2, a convolution-based multi-hybrid architecture [https://github.com/zymrael/savanna]. Instead of naive attention (which can become expensive at length scales beyond tens of thousands of tokens), StripedHyena 2 integrates short explicit (SE), medium regularized (MR), and long implicit (LI) Hyena operators interspersed with pockets of self-attention.

This confluence of convolution and selective attention yields a system capable of simultaneously handling local motifs (e.g., TATA boxes, splice junctions) and wide-range dependencies (e.g., interactions between distant enhancers and promoters). Another tweet: “Evo-2 uses stripedhyena 2, combining convolution and attention mechanisms, not transformers. It models DNA at multiple scales, capturing long-range interactions, and autonomously learns features like exon-intron boundaries…” Indeed, the authors systematically demonstrate that Evo 2’s performance at million-token contexts significantly improves “needle-in-a-haystack” retrieval tasks, presumably letting it recall exact sequences from far upstream within a single forward pass.

Moreover, training a 40B-parameter model (along with a smaller 7B variant) at such scale required massive parallelism: tensors, pipelines, contexts, data shards. They employed advanced HPC methods and carefully orchestrated distributed computing to push cross-entropy training reliably over 9.3 trillion tokens.

3. Zero-Shot Mutational Effect Predictions

3.1. Noncoding Pathogenicity and Disease Variants
One of Evo 2’s most astonishing claims is that “it predicts whether mutations are harmful or benign without specific training on human disease data… even outperforms specialized models on BRCA1 variants.” This emerges from the fundamental property of language modeling: the next-token probability. If a mutation disrupts essential motifs, the probability shifts in ways correlated with actual functional disruption.

The authors tested Evo 2 on ClinVar [https://www.ncbi.nlm.nih.gov/clinvar/] for known pathogenic vs. benign variants, especially noncoding or splice-altering changes. Because many prior genomics models focus on coding changes, accurately capturing regulatory disruptions is often out of reach. Evo 2, however, “understands noncoding DNA, which regulates gene expression and is involved in many genetic diseases, achieving state-of-the-art performance on noncoding variant pathogenicity.” Indeed, the paper reported that Evo 2’s zero-shot log-likelihood-based scoring both distinguished known splicing disruptions and predicted new potential splicing variants with impressive accuracy.

The same principle applied to BRCA1 / BRCA2 variant classification. A notable tweet states, “Evo-2 predicts the functional impact of mutations in these regions, achieving state-of-the-art performance for noncoding variant pathogenicity and BRCA1 variant classification. This could lead to advances in precision medicine.” The authors also described training a simpler supervised classifier on top of Evo 2 embeddings, yielding best-in-class results for certain orthogonal metrics.

3.2. Mechanistic Interpretations
Intrigued by how Evo 2 discerns “syntax rules,” the authors turned to interpretability. They discovered internal features that highlight intron–exon boundaries, TATA motifs, Shine-Dalgarno sequences, and sometimes entire prophage or transposon segments. This was done with a form of dictionary learning or sparse autoencoders, revealing that Evo 2 “is not just memorizing, it’s understanding biology.”

One might call it an emergent phenomenon: the unsupervised model spontaneously organizes genomic signals in ways that echo classical biology. The learned representation extends to noncoding transcripts, regulatory motifs, codon usage signals, etc.

4. Genome-Scale Generation

In addition to predictions, Evo 2 can generate truly massive sequences. The paper recounts prompts of partial sequences from organellar, prokaryotic, or eukaryotic species, with Evo 2 completing them to full “synthetic” chromosomes. In the words of a tweet, “Evo-2 has demonstrated practical generation abilities, creating synthetic yeast chromosomes, mitochondrial genomes, and minimal bacterial genomes. This is computational design in action.” Indeed:

Mitochondrial Genomes
Prompted with the first few kilobases of the human mitochondrial genome, Evo 2 outputs 16 kb that contain plausible tRNAs, rRNAs, and coding genes for COI/COII complexes. Tools like MitoZ [https://github.com/linzhi2019/MitoZ] can annotate them with only minor deviations from the standard organellar map.
Minimal Bacterium (Mycoplasma genitalium)
The authors feed ~10 kb from M. genitalium, and Evo 2 returns ~580 kb. Bacterial gene-finding algorithms (Prodigal [https://github.com/hyattpd/Prodigal]) plus Pfam domain checks confirm these synthetic genes are biologically coherent. The model does not replicate the reference exactly, but it aligns with the minimal genome’s “style.”
Yeast Chromosome
Prompting with 10.5 kb of S. cerevisiae chromosome III yields ~330 kb of synthetic eukaryotic sequence, replete with genes, introns, tRNA loci, and promoter motifs. True, the gene and tRNA densities are slightly below those of the real chromosome. But as the authors note, “this is unconstrained generation,” meaning no specialized inference or feedback loop was used.

Hence, we see hints of a near-future where scientists can instruct a model to “design me a minimal eukaryotic chromosome with these essential genes plus minimal transposons,” bridging textual prompts to actual synthetic DNA.

5. Generative Epigenomics and “Morse Code in Chromatin”

A highlight demonstration: “Evo-2 generates DNA sequences that influence chromatin accessibility, controlling gene expression. It has embedded simple Morse code into epigenomic designs as proof of concept.” The authors used Enformer [https://github.com/lucidrains/enformer-pytorch] and Borzoi (another high-performing epigenomic predictor) as scoring functions for partial Evo 2 outputs. By chunkwise beam search, Evo 2 was guided to produce sequences with open-chromatin peaks only in user-specified intervals.

To illustrate, they replaced a region of the mouse genome with newly generated sequences. Each partial generation was scored according to how well its predicted DNase hypersensitivity profile matched the “desired peaks.” By iterating in 128-bp increments, they could produce final sequences whose predicted open vs. closed states spelled out short Morse-coded words (e.g., “LO,” “ARC,” “EVO2”).

While obviously a whimsical demonstration, it illuminates a fundamental principle: “AI is moving beyond describing biology to designing it.” The ability to re-engineer large genomic contexts for custom epigenetic patterns underlines the synergy of a generative model (Evo 2) with sequence-to-function scorers (Enformer/Borzoi). “This is biology hacking,” as one tweet said. Another label used is “Generative Epigenomics,” wherein an AI autonomously sculpts DNA regions to exhibit a particular 3D-chromatin or histone-modification pattern.

Crucially, the model remains “natural” enough to maintain plausible k-mer frequencies. “The new sequences match the mm39 reference genome’s dinucleotide frequencies, suggesting they remain biologically coherent,” the authors write.

6. Safety, Security, and Ethical Dimensions

When discussing a system with the capacity to produce entire synthetic chromosomes, the authors and the community highlight the potential dual-use risk. The paper notes multiple steps to mitigate misuse:

They excluded eukaryotic viruses from the training set, aiming to degrade performance on potentially pathogenic viruses.
Testing on known viral sequences shows Evo 2 yields random or incomplete results, limiting direct malicious generation.
The entire work references [Responsible AI x Biodesign, 2024 (https://responsiblebiodesign.ai/#values-and-principles)] guidelines for safe usage.

Yet, they do not claim these measures fully eliminate all risk. They emphasize that open-sourcing fosters scrutiny and collaboration, while also acknowledging that “accident or misuse” may arise. “We deliberately tried to hamper performance on eukaryotic viruses.” Indeed, reported perplexity is much higher in those domains, suggesting a partial success.

On the bright side, open release also “will lead to massive widespread innovation in bioengineering, lowering barriers to genome design. It’s a revolution moment for the field.” The authors emphasize how researchers can inspect the code, replicate or refine the training procedure, and systematically experiment with new data or scoring constraints.

7. Global Availability and the Arc Institute’s Ambitions

A frequent social media refrain is: “Evo 2 is FULLY OPEN SOURCE, including model parameters, training data, and code. This will lead to massive widespread innovation.” Indeed, the authors confirm that everything from the Evo 2 7B and 40B weights to the “OpenGenome2” dataset is publicly accessible. They provide links:

Code: [https://github.com/zymrael/savanna] (training), [https://github.com/zymrael/vortex] (inference)
Data: [https://huggingface.co/datasets/arcinstitute/opengenome2]
Models: [https://huggingface.co/arcinstitute/evo2_40b], [https://huggingface.co/arcinstitute/evo2_7b]
Interactive Tools: [https://arcinstitute.org/tools/evo/evo-designer], [https://arcinstitute.org/tools/evo/evo-mech-interp]

Multiple tweets hail the significance of this open release, contending that “this is how biology becomes truly computational,” enabling “the Arc Institute’s quest to model entire cells, possibly entire organisms in silico.” Indeed, the authors mention future expansions: coupling Evo 2’s DNA generation with advanced transcriptomics, proteomics, metabolic or structural data, culminating in “whole-cell simulation.”

As one tweet puts it, “The Arc Institute aims to model entire cells, moving beyond DNA to whole organisms. This could lead to AI creating new life forms and synthetic biology becoming AI-driven.” While that may sound hyperbolic, it is increasingly within the realm of possibility given the trajectory from Evo 1 to Evo 2, and eventually beyond.

8. Broader Implications: The Dawn of AI-Driven Synthetic Biology

“The era of biotech is here,” proclaims a cryptic tweet. The authors themselves do not go as far in rhetorical drama, but they do underscore that Evo 2’s generalist approach may transform how we approach:

Precision Medicine: Zero-shot analysis of novel mutations, especially in noncoding or splicing contexts. Real-time triaging of variants from personal genome sequencing, bridging a key gap in variant-of-unknown-significance classification.
Industrial / Synthetic Biology: Automated design of microbial strains with specific metabolic or epigenetic features, quickly generating plasmids or entire bacterial chromosomes that produce desired compounds or degrade toxins, etc.
Evolutionary Studies: In silico exploration of “what if” evolutionary paths by systematically mutating regions and measuring predicted viability or function.
Educational Tools: Students or researchers can prompt Evo 2 to see how it writes, say, a hypothetical chromosome for a new species, or to identify putative regulatory motifs in an uncharted genome.

One does not have to look far to see expansions in the near future. “We see a future involving programmable life at scale,” the authors suggest. The synergy of generative LLMs with high-throughput experimental assays to close the design-build-test cycle is central to next-generation biotech.

9. Conclusion and Forward-Looking Horizons

In sum, “Genome modeling and design across all domains of life with Evo 2” signifies a major leap. “Evo 2 is fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset LMFAO,” as one irreverent social media post exclaims. But that irreverence belies the sincerity of a scientific milestone: a universal genomic language model that can process entire chromosomes (1,000,000 bp), glean fundamental biology from unsupervised sequence data, predict variant pathogenicity, generate new organisms, and even manipulate epigenomic states.

At a high level, “it’s not just memorizing, it’s understanding biology,” bridging local motifs and global architecture. This new capacity stands at the cusp of expanding into higher-level modeling, such as 3D genome structure, multi-omics data integration, or entire cell-level simulations reminiscent of Karr et al.’s [2012] “Whole-cell” concept.

The authors emphasize next steps along multiple lines:

Refined Inference: Constrained sampling or beam search can systematically build synthetic sequences for specialized tasks (e.g., epigenetic wave patterns, engineered minimal plasmids).
Integration With Structural Omics: Pairing with next-gen AlphaFold-like systems that evaluate protein-level or transcript-level fitness in real time.
Fine-Tuning or RLHF: Incorporating actual lab-based variant measurements or user feedback to further guide the model’s generative capabilities.
Scaling to Full Chromosomes in Eukaryotes: Possibly designing entire synthetic mammalian chromosomes with user-specified epigenetic or transcriptomic patterns.

Lastly, the authors underscore the “biology hacking” theme. Will it be used responsibly? Will we see an influx of unregulated synthetic designs? The paper’s open stance, allied with disclaimers and risk analyses, aims to spark mainstream scientific and regulatory conversation.

Regardless of the vantage, one thing is clear: “the future involves programming life at increasing scales,” from modulating a single gene to rewriting entire chromosome arms. Evo 2 is a critical stepping stone in that direction.

Extended Discussion

This entire integrated summary underscores how Evo 2 shifts the spotlight from purely reading and annotating the “book of life” to actively writing new chapters. The synergy of advanced HPC, creative architecture design (StripedHyena 2), and the impetus to unify eukaryotic, prokaryotic, and organellar data sets seemed like a pipedream just a few years ago. Yet here it is—publicly available for researchers, hobbyists, and potential industrial applications.

Among the more evocative lines from the social media swirl: “Make me blonde!”—a tongue-in-cheek request referencing genome editing to alter a physical phenotype, might soon be in the realm of possibility. Of course, realistically, toggling hair color in a complex organism is not as trivial as flipping a single mutation, but the comedic flair underscores the sense that “we are officially hacking biology.”

The authors also mention that while Evo 2’s epigenomic design demonstration used a toy “Morse code” example, the underlying concept can be extended to more practical designs: controlling expression of clusters of genes, building synthetic gene circuits, or systematically optimizing entire metabolic pathways. Because it is a single universal model—rather than a patchwork of domain-specific models—Evo 2 elegantly transitions from small tasks (like scoring single-nucleotide variants) to scaling tasks (like generating entire eukaryotic chromosome arms with user-defined features).

In short, Evo 2’s open-source debut might indeed constitute a watershed moment in merging AI and biology: it is probable that the coming years will see more advanced iterations with fewer limitations (e.g., refined or domain-targeted). Coupled to cheap DNA synthesis and CRISPR editing, the possibilities range from personalizing gene therapy to designing new industrial microbes to exploring the boundaries of “new life forms.”

Closing Remarks

“Evo 2 is… a revolution moment for the field,” as repeated in the user-shared tweets. The authors maintain a tempered optimism, acknowledging potential misuses but also championing open collaboration as the best path forward. They envision that “the future involves programming life at increasing scales,” presaging that entire cells or multicellular organisms could one day be engineered with partial or substantial end-to-end design.

All code, data, and model weights remain publicly accessible via the paper’s references and repository links. The open invitation stands for the broader scientific community to test, refine, interpret, or even challenge Evo 2’s capabilities. Undoubtedly, the synergy of large language model innovation with genomics is accelerating. Evo 2 emerges as an early exemplar, offering a glimpse of how AI plus biology might well shape the fundamental fabric of living systems.

Sources

Evo 2 Paper
Genome modeling and design across all domains of life with Evo 2 (2025)
Code and Models
Evo 2 GitHub Repository
OpenGenome2 Dataset
Hugging Face Dataset: arcinstitute/opengenome2
Interactive Tools
Evo Designer
Evo Mech Interp Visualizer
Responsible Use Guidelines
Responsible AI x Biodesign Community Values

For AI founders and marketers

Want your AI product explained to a large AI-native audience?

Kingy AI helps AI companies turn complex products into clear, useful YouTube videos that drive awareness, product understanding, demos, clicks, and search visibility.

Get a Sponsorship Fit Review Calculate Sponsored Video ROI See Client Examples

Genome modeling and design across all domains of life with Evo 2 – Paper Summary

Curtis Pyke

Related Posts

GPT-5.5 vs Claude Opus 4.8: The Evidence-Based 2026 Comparison

Claude Opus 4.8: Anthropic’s Frontier Model Gets Sharper, Faster, and More Honest

Codex vs. Claude Code vs. Cursor: The Definitive 2026 Guide

Leave a Reply Cancel reply

Recent News

Microsoft 365 Copilot Gets a Makeover: Faster, Cleaner, and Finally Less Chaotic

When AI Models Ran City Hall, One Town Thrived and Another Basically Rage-Quit Existence

GPT-5.5 vs Claude Opus 4.8: The Evidence-Based 2026 Comparison

Claude Opus 4.8: Anthropic’s Frontier Model Gets Sharper, Faster, and More Honest

The Best in A.I.

Recent Posts

Recent News

Microsoft 365 Copilot Gets a Makeover: Faster, Cleaner, and Finally Less Chaotic

When AI Models Ran City Hall, One Town Thrived and Another Basically Rage-Quit Existence

Welcome Back!

Retrieve your password