Decentralized Diffusion Models (McAllister et al., 2025) - Detailed Paper Summary

1. Introduction

Image generation models grounded in diffusion processes have evolved at a meteoric pace, becoming catalysts for breakthroughs not only in static image tasks but also in video modeling, robotics, and cross-domain generative endeavors (Ho et al., 2020; Rombach et al., 2022; Saharia et al., 2022). However, these generative models, particularly at scale, confront formidable computational and infrastructural demands. They conventionally rely on large, monolithic clusters of GPUs that synchronize gradients after every step, incurring intense bandwidth and power requirements. This environment weighs heavily on the ability of researchers with only modest computational resources to replicate or innovate upon state-of-the-art diffusion model training regimes (McAllister et al., 2025).

In “Decentralized Diffusion Models,” McAllister, Tancik, Song, and Kanazawa bring forth a proposition that upends the typical monolithic blueprint of diffusion training. By dispensing with the high-bandwidth, centralized interconnect and enabling modular sub-models to be learned in isolation, they proffer a route to more flexible, robust, and cost-effective generative modeling. Their framework, which they label Decentralized Diffusion Models (DDM), apportions the dataset into disjoint clusters and trains individual “expert” diffusion sub-models on these clusters without inter-expert communication. Eventually, a separate router—trained as a lightweight classification or posterior-probability predictor—enables these experts to be combined en masse (or partially) at inference time. As they elegantly illustrate, the aggregated ensemble of specialized sub-models collectively optimizes the same objective as a single monolithic system would.

This summary traverses the entire conceptual geography of the original paper: from the impetus behind decentralization, through their mathematical derivations of Decentralized Flow Matching, to exhaustive experiments on ImageNet and LAION. Finally, it underscores how such designs open new frontiers for “foundation-scale” generative modeling, reinforcing the notion that large-scale diffusion training may no longer demand supermassive clusters and enormous synergy among thousands of GPUs.

Decentralized Diffusion Models Download

2. Background and Motivation

Training colossal diffusion models has historically entailed distributing data batches across myriad GPUs, followed by synchronized backpropagation aggregations. That synergy, although powerful, locks participants into a centralized training apparatus. In many organizations, only specialized industrial labs possess the proprietary integrated networks capable of sustaining such workloads. Academics, smaller startups, and those who lack the capital for centralized clusters find themselves at a severe disadvantage.

The burdens are not merely financial. When hundreds or thousands of GPUs must remain in permanent communication, hardware faults, network latencies, or bandwidth bottlenecks undermine progress (An et al., 2024; Dubey et al., 2024). In that sense, centralized training frameworks become not only expensive but precarious, vulnerable to myriad issues that can stymie or regress the entire training cycle.

Simultaneously, diffusion models are continually blossoming in size. For example, the well-known Stable Diffusion 1.5 consumed over 6,000 A100 GPU days (Chen et al., 2023). More recent text-to-video diffusion systems—like MovieGen from Meta—push GPU usage to extremes reminiscent of large language model undertakings such as GPT-3 (Brown et al., 2020; Polyak et al., 2024). Meanwhile, to run these massively parallel computations, data centers must build out specialized high-bandwidth interconnect fabrics, incurring even more expense.

Against this backdrop, McAllister et al. (2025) propose an approach that liberates diffusion training from the necessity of cross-cluster synchronization. Instead of a single gargantuan parametric function, the authors’ Decentralized Diffusion Models revolve around distributing the load across multiple “compute islands.” Each island handles a slice of the dataset—dubbed a cluster—with minimal overhead, thanks to the modular architecture. The result is a synergy of specialized “experts” that approximate the global generative distribution.

3. Relationship to Prior Work

Before outlining their novel solution, the authors anchor their ideas within a wide tapestry of existing research:

Accelerating Diffusion Models: Recent developments have tried to expedite diffusion training by focusing on aspects such as token masking (Sehwag et al., 2024) or alignment with pretrained representations (Yu et al., 2024). Another approach, flow matching, seeks to unify continuous normalizing flows with diffusion-based models (Chen et al., 2018; Lipman et al., 2022; Liu et al., 2022). Decentralized Diffusion Models, while building on the same general principle of accelerating training, adopt an orthogonal perspective—splitting the training load among independent sub-models rather than compressing or optimizing each individual model’s training loop.
Mixture of Experts (MoE): The literature on sparse mixture-of-experts has gained traction in language modeling, from Switch Transformers (Fedus et al., 2022) and beyond. Typically, MoE approaches route specific tokens to specialized sub-layers, controlling which parameters are activated for each token. In effect, MoEs scale up capacity without proportional increases in compute. Although DDM draws heavily on the concept of experts, it differs in that each entire diffusion network is trained over a partitioned data manifold, rather than mixing specialized feed-forward segments in a single architecture.
Federated and Low-Communication Learning: The impetus to train models in a decentralized manner, without transferring local data, appears extensively in federated learning (McMahan et al., 2017; Reddi et al., 2020). The objective there is often data privacy, but system designs must likewise grapple with severely limited bandwidth (Douillard et al., 2023). DDM is not necessarily a privacy solution; however, the independence and isolation of compute nodes resonates with these earlier lines of research, illuminating interesting synergy or future expansions (Li et al., 2022).

Thus, McAllister et al. (2025) differentiate themselves by focusing on how to unify partial sub-distributions across experts into a single generative model, concurrently achieving the overarching goal of covering the entire data distribution as a single monolith would. By doing so, they effectively circumvent the steep communication overhead typically inherent in large-scale diffusion training.

4. Decentralized Flow Matching (DFM)

At the crux of the authors’ proposed method lies Decentralized Flow Matching (DFM), a novel objective adapted from the standard flow matching approach to diffusion (Lipman et al., 2022; Song et al., 2020). Flow matching conceptualizes diffusion as learning to reverse a deterministic forward noising process αtx0+σtϵ\alpha_t x_0 + \sigma_t \epsilonαtx0+σtϵ. The flow ut(xt)u_t(x_t)ut(xt) is a velocity field bridging the noisy variable xtx_txt back to the data manifold at t=0t=0t=0.

Flow Matching Fundamentals: In the conventional viewpoint, the marginal flow field ut(xt)u_t(x_t)ut(xt) can be derived by integrating conditional flows ut(xt∣x0)u_t(x_t|x_0)ut(xt∣x0) weighted by probabilities pt(xt∣x0)p_t(x_t|x_0)pt(xt∣x0). That summation (or integral) ensures the learned flow marginally accounts for the entire data distribution. Each step in the diffusion sampling procedure uses this learned velocity to incrementally denoise (Ho et al., 2020).
Partitioning the Data: In DDM, the authors break the data into KKK non-overlapping sets, or clusters, each assigned to an “expert.” That means one can write the marginal flow as:ut(xt)=∑k=1Kpt,Sk(xt)pt(xt)∑x0∈Skut(xt∣x0) pt(xt∣x0) q(x0) / pt,Sk(xt).u_t(x_t) = \sum_{k=1}^{K} \frac{p_{t,S_k}(x_t)}{p_t(x_t)} \quad \sum_{x_0 \in S_k} u_t(x_t|x_0)\, p_t(x_t|x_0)\, q(x_0) \,\Big/\, p_{t,S_k}(x_t).ut(xt)=k=1∑Kpt(xt)pt,Sk(xt)x0∈Sk∑ut(xt∣x0)pt(xt∣x0)q(x0)/pt,Sk(xt).The outer sum effectively becomes a categorical distribution over clusters. Within each cluster’s distribution, the inner sum is the local marginal flow. Put differently, each expert learns to reconstruct (via flow matching) the distribution of its assigned subset, with no knowledge of other subsets.
Router Model: A separate, lightweight router rθ(xt,t) \,r_\theta(x_t,t)\,rθ(xt,t) is then trained to predict the probability that any given noisy sample xtx_txt belongs in cluster kkk. This is cast as a standard classification problem: feed the noise sample (and possibly conditioning) into a compact transformer that outputs a distribution over the KKK clusters. During inference, these router probabilities serve as the mixing coefficients that linearly blend the experts’ predictions. The brilliance lies in how straightforwardly the router can be trained offline, using the ground-truth cluster label of each training sample.
Test-Time Combinations: Once the router and experts are all trained, at inference one can adopt multiple combination strategies:
- Full Ensemble: Summing all experts, weighted by router probabilities.
- Top-kkk Experts: Selecting a subset of the highest-scoring experts to reduce computational overhead.
- Single Expert (“Top-1”): Minimizing GPU usage by picking only the expert with the maximal router score.
Interestingly, sampling from the router’s categorical distribution is also possible, giving an unbiased Monte Carlo approximation of the full ensemble. In practice, the authors find that simply picking top-1 yields a surprising performance advantage while drastically cutting the FLOP cost during inference.
Knowledge Distillation: Despite the computational elegance of activating only one expert at a time, the total model memory footprint is multiplied by KKK. This might hamper production deployment if one must load all expert weights. Hence, McAllister et al. (2025) propose a teacher-student distillation scheme, compressing the multi-expert ensemble back into a single dense model. Each training sample is guided by whichever expert is specialized in that cluster, effectively making the student mimic the teacher’s predictions. They discover that the distilled model matches or even slightly surpasses the performance of a direct monolith, with fewer effective training FLOPs.

5. Methodological Details

5.1 Clustering the Dataset

To isolate “clusters” in a high-dimensional image corpus without undue computational overhead, the authors turn to the approach described by Ma et al. (2024) (cited in the original references). The method consists of two levels: first, a large number (e.g., 1024) of fine-grained k-means centroids are found from a self-supervised feature embedding (like DINOv2; Oquab et al., 2023). Then, these 1024 “micro-clusters” can be merged into KKK coarser clusters that are more manageable for direct training. The assignment of each image to a coarse cluster is trivially done by choosing the nearest centroid. This multi-stage approach outperforms naive strategies like random partitioning, as the latter fails to exploit sub-manifold structure in the dataset (Brown et al., 2022; Wang et al., 2024).

5.2 Architecture and Hyperparameters

For ImageNet experiments, they rely on the DiT XL/2 architecture introduced in prior diffusion transformers literature (Peebles & Xie, 2023). Each expert has ~895M parameters, and the router (with only ~158M parameters) remains notably smaller. Training is executed with typical hyperparameters: AdamW optimizers, 1e-4 learning rates, exponential moving averages of model weights, and a range of training steps up to 800k or more.

On a subset of LAION-Aesthetics (Schuhmann et al., 2022), they incorporate text conditioning via cross-attention to a pretrained language encoder (CLIP or T5, depending on the experiment). The data size is large (153.6 million images), so they adopt a bigger batch size of 1024 overall. The load is equally divided among each expert, ensuring the total compute cost matches that of a monolith baseline.

5.3 Flow Matching vs. Score Matching

Although their main derivation uses the flow matching perspective, they provide an analogous derivation for score matching in an appendix. The result is the same: a sum of partial score fields that combine into a single global score. This robust theoretical grounding cements the notion that each expert’s specialized domain can be learned in complete isolation, then seamlessly recombined to yield a global generator.

6. Experimental Results

The authors conduct rigorous testing on two mainstream image datasets: ImageNet (Russakovsky et al., 2015) and LAION Aesthetics. The second is more representative of real-world web-scale data with textual captions, filtering a portion of the vast LAION corpus by aesthetic quality.

6.1 Test-Time Combination Strategies

A first exploration is to examine how best to fuse expert predictions once the entire system is trained:

Full: Weighted sum across all KKK experts. This yields a precise approximation of the global flow but quadratically increases inference cost with KKK.
Top-1: Only the single cluster predicted with the highest probability. They discover that, astonishingly, this yields better performance than Full on certain metrics while being drastically cheaper in compute.
Stochastic: Either nucleus sampling from the router’s distribution (Holtzman et al., 2020) or random sampling. These methods can add variability at test-time but often degrade coverage or produce less consistent sample quality.

Empirically, Top-1 emerges as a sweet spot, overshadowing other sampling strategies in terms of FID and computational overhead (McAllister et al., 2025).

6.2 Number of Experts

They experiment with K∈{1,4,8,16}K\in\{1,4,8,16\}K∈{1,4,8,16}. “1” implies a standard monolithic model. “4” or “8” experts appear to be feasible partitions, whereas “16” can reduce per-expert batch sizes so much that training becomes unstable or suboptimal. On ImageNet, 8 experts achieved the best results in balanced conditions, while on LAION, 8 experts similarly delivered robust performance. This outcome suggests a natural sweet spot where each cluster is neither too large nor too small.

6.3 Performance Gains Over Monoliths

Perhaps the most pivotal claim is that DDM can outperform monolithic diffusion under a fixed FLOP budget. Through side-by-side comparisons on ImageNet, an 8-expert ensemble converged to a 28% lower FID than a single model with the same number of total training FLOPs. On LAION Aesthetics, the decentralized approach reached an FID of 6.48 at 200k optimization steps, surpassing the monolith’s best FID of 6.52 (achieved at 800k steps). In other words, DDM can realize a potential 4x speedup in training time to similar quality.

6.4 Clustering Ablation

When substituting the sophisticated feature-based clusters with purely random partitions, the authors observe a drastic drop in performance. This indicates that coherent sub-manifold specialization—i.e., grouping visually or semantically related images—facilitates more potent expert models. Indeed, the underlying data manifold is believed to be composed of interconnected local structures (Brown et al., 2022; Wang et al., 2024), so partitioning according to these structures fosters more effective localized learning.

6.5 Distillation

The advantage of top-1 expert selection reduces inference-time FLOP cost but not the total memory overhead of storing all experts. So they propose a teacher-student distillation pipeline, where a single dense model is taught to replicate the expert predictions. Over the course of 400k training steps, a distilled student can match the monolith’s FID without requiring as large a batch size. This result is extremely significant for practical deployment. Production-grade models might prefer a single, contiguous parameter set to avoid distributed parameter loading—and the authors demonstrate they can preserve performance while trimming the memory bloat.

6.6 Scaling to 24 Billion Parameters

Finally, the authors push the approach to an extreme: building an 8-expert ensemble in which each expert has 3 billion parameters (using a scaled-up variant of the MMDiT architecture. This yields a total parameter count of ~24B when all experts are considered together, an enormous capacity on par with large-scale diffusion or language model paradigms. Intriguingly, they highlight that with DDM, each expert can be trained on “just eight individual GPU nodes” over less than a week, if done in parallel or distributed across available resources. This breaks from the tradition of needing a single, integrated supercluster to train a 24B-parameter diffusion model.

They provide sample outputs, showing diverse and visually compelling images that rival or surpass smaller monoliths. Empirically, larger experts continue to exhibit improved performance, and the authors see no signs of saturation at the scales tested.

7. Discussion and Future Directions

This approach to Decentralized Diffusion Models broadens the horizons for generative training across multiple angles:

Cost-Efficiency and Accessibility: By obviating the centralized, high-bandwidth interconnect, smaller labs can piece together “scattered” resources—on-demand cloud GPU nodes, departmental clusters, or multi-organizational cooperatives. This fosters inclusivity, allowing more research groups to undertake ambitious training objectives.
Fault Tolerance and Robustness: If one cluster fails or becomes temporarily unavailable, the rest can continue training independently, unaffected by that compute island’s downtime. The specialized ensemble can either be completed once the straggling node recovers or be used in partial capacity.
Data Privacy and Localized Training: The authors hint that this structure could integrate well with federated or private data paradigms. Experts could be trained on site at various data silos (e.g., hospitals, different companies) without having to share raw data. Only the final router needs minimal knowledge to produce global sampling coherence.
Applicability to Other Domains: While the study focuses on images (with some text conditioning in LAION), the underlying mathematics apply to general diffusion or flow-based modeling. One might apply DDM to tasks in robotics, video generation (Brooks et al., 2024), or even geospatial data.
Hybrid Approaches: The authors mention synergy with other acceleration methods, such as consistency distillation (Song et al., 2023) or partial token masking (Sehwag et al., 2024). This invites a new category of multi-expert systems that can incorporate more advanced or domain-specific training accelerations.
Sparsity in Inference: The synergy with MoE’s established principle of sparse activation opens intriguing design possibilities. In the future, one may conceive a hierarchical router for sub-cluster specialization or incorporate parameter-efficient fine-tuning methods that let each expert adapt to a narrower domain.

Ultimately, the authors conclude that the days of requiring centralized superclusters for advanced generative modeling may be coming to an end. Decentralized Diffusion Models exemplify how a straightforward shift in perspective—splitting data and training tasks among multiple experts—can circumvent the hardware tyranny of data-parallel behemoths.

8. Detailed Theoretical Insights

8.1 Mathematical Formulation of Flow Partition

Flow matching represents the distribution transformation from noisy latents xtx_txt to data distribution at time t=0t=0t=0. Let αt\alpha_tαt and σt\sigma_tσt define how the data is gradually noised:xt=αt x0+σt ϵ,where ϵ∼N(0,I).x_t = \alpha_t\,x_0 + \sigma_t\,\epsilon,\quad \text{where } \epsilon \sim \mathcal{N}(0,I).xt=αtx0+σtϵ,where ϵ∼N(0,I).

The learned velocity field vθ,t(xt)v_{\theta,t}(x_t)vθ,t(xt) approximates the negative gradient of the log-likelihood ∇xtlog⁡pt(xt)\nabla_{x_t}\log p_t(x_t)∇xtlogpt(xt), or equivalently, the flow from the distribution of xtx_txt to the distribution at t−Δtt – \Delta tt−Δt. The ground truth “flow target” can be computed analytically by summing over the entire dataset or through objective-based sampling (Ho et al., 2020; Song et al., 2020).

Now, dividing the dataset into KKK shards {S1,…,SK}\{S_1, \dots, S_K\}{S1,…,SK} yields partial flows:ut(xt) = ∑k=1Kp(k∣xt,t) ut,k(xt),u_t(x_t) \;=\; \sum_{k=1}^K p(k \mid x_t, t)\,u_{t,k}(x_t),ut(xt)=k=1∑Kp(k∣xt,t)ut,k(xt),

wherep(k∣xt,t)=pt,Sk(xt)pt(xt),ut,k(xt)=1pt,Sk(xt) ∑x0∈Skut(xt∣x0) pt(xt∣x0) q(x0).p(k \mid x_t, t)= \frac{p_{t,S_k}(x_t)}{p_t(x_t)} , \quad u_{t,k}(x_t) = \frac{1}{p_{t,S_k}(x_t)} \,\sum_{x_0 \in S_k} u_t(x_t|x_0)\,p_t(x_t|x_0)\,q(x_0).p(k∣xt,t)=pt(xt)pt,Sk(xt),ut,k(xt)=pt,Sk(xt)1x0∈Sk∑ut(xt∣x0)pt(xt∣x0)q(x0).

Hence, each “expert” attempts to regress ut,k(xt)u_{t,k}(x_t)ut,k(xt) from data in SkS_kSk. The router then merely learns p(k∣xt,t)p(k \mid x_t,t)p(k∣xt,t), which is a classification over partitions.

8.2 Expert Specialization

Each expert is free to develop specialized representations. For instance, if one partition is dominated by flora images while another has mechanical objects, each sub-model is not forced to cover the full diversity of the dataset. A sub-model dedicated to mechanical visuals can refine its internal features specifically for metallic textures, symmetrical shapes, or typical color palettes. Meanwhile, the flora-focused expert might hone in on organic color variations and fractal-like patterns. This specialization can lead to better representational compactness (Zhou et al., 2022) and improved generative fidelity for each sub-domain, culminating in an overall performance lift once integrated.

8.3 Router Training

The router is trained via cross-entropy on the cluster label. For each training example, we sample (x0,k) \,(x_0, k)\,(x0,k) from the dataset, draw a random time step t\,tt, and generate xt=αtx0+σtϵ.\,x_t = \alpha_t x_0 + \sigma_t \epsilon.xt=αtx0+σtϵ. We then pass xt\,x_txt into the router transformer. Its final classification token is fed to a linear layer that outputs logits over the KKK clusters. Minimizing cross-entropy with the ground-truth label kkk ensures the router’s predictions p(k∣xt)p(k\mid x_t)p(k∣xt) approximate the genuine cluster posterior. Because each sample’s cluster membership is known a priori, this stage doesn’t require any synergy with the experts. Indeed, it can be done entirely offline or concurrently.

9. Limitations and Practical Considerations

Despite its manifold advantages, Decentralized Diffusion Models come with caveats:

Training Overhead for Many Experts: Although no cross-communication is needed, launching KKK separate training runs can be operationally cumbersome. Automated scheduling or pipeline orchestration is often necessary, especially at scale (Zhao et al., 2023).
Expert Memory Footprint at Inference: Without the optional distillation step, storing all experts can be unwieldy for extremely large KKK. This is partially mitigated by the fact that only the relevant experts are needed to generate a single sample if “top-1” or “top-k” selection is used. But for full coverage, all experts must exist somewhere.
Cluster Imbalance: If the dataset naturally has extremely unbalanced cluster sizes, certain experts might starve for data or remain undertrained. A more sophisticated clustering scheme could aim for balanced partition sizes.
Potential Over-Specialization: If clusters become too disjoint or small, experts might degenerate to memorizing a micro-domain. The authors note that an optimal KKK must be found to strike the right trade-off between specialization and data coverage.

Even so, the authors argue that the benefits—especially for those lacking in integrated HPC clusters—make DDM a transformative innovation for scaling diffusion training.

10. Conclusions

In their paper, “Decentralized Diffusion Models,” McAllister et al. (2025) carve out a new direction in the field of generative modeling by demonstrating that world-class diffusion models can be trained across disjoint GPU “islands” with zero cross-communication. By harnessing the synergy between a router that learns partition probabilities and an ensemble of data-partitioned experts, they show that the global diffusion or flow matching objective can be satisfied piecewise. Each expert focuses on a specialized sub-manifold, achieving more effective local representation.

The final results are compelling across multiple fronts:

Strong Performance Gains: DDM consistently outruns a similarly sized monolith in terms of FID, effectively leveraging specialized experts to handle domain-specific complexities.
Lower Infrastructure Requirements: Because experts train without synchronous gradient sharing, traditional HPC cluster fabrics (with terabit-level bandwidth) become optional.
Flexibility and Fault Tolerance: The approach is robust to node failures, random slowdowns, or cluster fragmentation, since each sub-model is independent.
Production Feasibility via Distillation: A single dense model distilled from the multi-expert ensemble can be used to circumvent memory overhead, thereby matching a single large model’s inference pipeline while benefiting from the specialized training dynamic.

Thus, Decentralized Diffusion Models represent a pivotal milestone. They champion a future in which smaller labs, multiple institutions, or ephemeral on-demand cloud compute can orchestrate the training of advanced diffusion networks that, until now, seemed reserved for the biggest players with integrated high-throughput hardware.

The authors foresee further expansions in applying DDM to privacy-sensitive data, to real-time or streaming data in multi-node setups, and to additional modalities such as text, video, and robotics. Their math and experiments lay the foundation for a new era where distributing generative models across modest, independently operating clusters becomes not only viable but advantageous.

References

Below is a collated list of sources cited in the summary, mapping to the citations in the original paper. Where possible, we include direct references from the provided text. Note that these references replicate the numbering (and partial styles) from McAllister et al. (2025).

11. Final Observations

The quest to decentralize diffusion model training, as manifested in DDM, demonstrates that the guiding principle of dividing and conquering the data manifold can yield a robust generative model. The specialized local sub-models, each devoted to a curated slice of data, collectively reconstruct the entire distribution through a lightweight router. This synergy not only reimagines hardware usage but also redefines accessibility.

Armed with empirical proof that DDM can match—if not outdo—monolithic behemoths while sidestepping the complexities of heavily integrated GPU clusters, practitioners now have a new blueprint for tackling large-scale generative tasks. Whether the future expansions move toward fully federated medical image training (where data privacy is paramount) or flexible consumer-oriented GPU clouds, Decentralized Diffusion Models seem poised to amplify the democratization of generative modeling research.

In sum, McAllister et al. (2025) deliver an elegant yet impactful method for rethinking how we coordinate large-scale diffusion training across an often-fragmented computational environment. By harnessing the synergy of specialized experts and a posterior-predicting router, they unchain diffusion from the constraints of heavy data parallelism, thereby energizing a new era in massive-scale generative modeling.