The Fusion Generation Paradigm of Diffusion Models and Autoregressive Models

From Discrete to Continuous: Deep Analysis of the Fusion Generation Paradigm Combining Diffusion Models and Autoregressive Models

1. Background

In the evolution of generative AI, two mainstream paradigms have long dominated: autoregressive models and diffusion models. The former, represented by GPT and DALL-E, generates content by progressively predicting discrete tokens; the latter, represented by Stable Diffusion and Imagen, produces high-quality images through stepwise denoising in continuous space. For a long time, these two technical routes developed independently with little overlap.

However, with the emergence of DiT (Diffusion Transformer) in 2023 and the MAR (Masked Autoregressive) series in 2024, an exciting trend has become increasingly clear: combining the continuous denoising of diffusion processes with the discrete prediction of autoregression is becoming a new mainstream direction in text-to-image generation. This fusion is not a simple stacking of techniques but achieves a profound unification at the level of probabilistic modeling.

The core challenge faced by traditional autoregressive models is that discrete token prediction naturally lacks the ability to model global consistency, making long-distance dependencies difficult to capture. Although diffusion models excel in image quality, their continuous denoising process lacks explicit structural constraints, making flexible local control difficult to achieve. The fusion paradigm is designed to leverage the strengths of both—using the causal structure of autoregression to provide a generation framework and the continuous denoising of diffusion to ensure visual quality.

From an application perspective, this fusion paradigm demonstrates significant advantages across multiple dimensions: generation quality reaches or even surpasses pure diffusion models, inference speed is several times faster than pure autoregressive models, and it supports advanced functions such as conditional control and local editing. In the field of video generation, this paradigm shows unique value—utilizing the temporal structure of autoregression combined with the spatial modeling of diffusion to generate video content that is both coherent and high-quality.

2. Technical Principles

2.1 Core Idea: Discrete Skeleton and Continuous Texture

The core insight of the fusion paradigm is that visual generation can be decomposed into two stages: discrete “skeleton” prediction and continuous “texture” filling. Autoregressive models excel at capturing the intrinsic structural relationships between discrete tokens, which corresponds to the semantic skeleton of an image; diffusion models excel at recovering continuous details from noise, which corresponds to the texture and质感 of an image.

Specifically, fusion models typically adopt a two-stage architecture:

  1. Discrete Encoding Stage: Use VQ-VAE or similar methods to encode an image into a sequence of discrete tokens.
  2. Hybrid Generation Stage: An autoregressive model predicts the token sequence, and a diffusion model performs denoising in the continuous space corresponding to the tokens.

This design cleverly combines the advantages of both paradigms: the autoregressive part provides causal constraints and flexible conditional control, while the diffusion part ensures that the visual region corresponding to each token has high-quality local details.

2.2 Mathematical Foundation: From Cross-Entropy to Diffusion Loss

The key to understanding the fusion paradigm lies in unifying the two loss functions. The autoregressive model uses cross-entropy loss:

L_ar = -Σ log p(x_i | x_{<i})

The diffusion model uses noise prediction loss:

L_diff = E[||ε - ε_θ(x_t, t)||²]

In the fusion paradigm, these two losses are cleverly combined. Taking MAR (Masked Autoregressive) as an example, its core innovation is introducing a “masked autoregressive” mechanism:

  1. Randomly mask some tokens.
  2. Use an autoregressive method to predict the masked tokens.
  3. Apply diffusion loss to refine the predictions.

Mathematically, this is equivalent to constructing a hybrid probability model:

p(x) = Σ_m p(m) · p_ar(x_m | x_{¬m}) · p_diff(x_{¬m} | x_m)

Here, m is the masking pattern, p_ar is the autoregressive prediction distribution, and p_diff is the conditional diffusion distribution.

2.3 Key Innovation: Continuous Token Representation

Traditional autoregressive models map each token to a discrete category, while the fusion paradigm introduces continuous token representation. Each token corresponds to a continuous vector, and the diffusion process performs denoising in this continuous space. This design brings several key advantages:

  • Increased Information Density: Continuous representations can encode richer visual information.
  • Friendly Gradient Propagation: Avoids gradient truncation caused by discretization.
  • Natural Support for Interpolation: Linear interpolation in continuous space corresponds to smooth visual transitions.

In specific implementations, a “quantization-dequantization” strategy is typically adopted: an encoder maps the image to continuous vectors, which undergo vector quantization to obtain discrete indices, and a decoder maps the discrete indices back to continuous space. The diffusion model operates on the continuous representation output by the decoder.

3. System Architecture Design

3.1 Overall Architecture

architecture

The system adopts a layered architecture design, from top to bottom:

  1. Control Layer: Receives inputs such as text prompts and image conditions.
  2. Generation Layer: Contains the autoregressive module and diffusion module.
  3. Representation Layer: Responsible for conversion between images and tokens.
  4. Optimization Layer: Provides inference acceleration and memory management.

3.2 Module Detailed Design

VQ-VAE Encoder:

  • Input: RGB image (H x W x 3)
  • Output: Discrete token sequence (h x w)
  • Compression Ratio: Typically 16x or 8x

Autoregressive Transformer:

  • Architecture: Causal Transformer Decoder
  • Input: Partially visible token sequence
  • Output: Predicted next token distribution

Diffusion Denoiser:

  • Architecture: U-Net or DiT
  • Input: Noisy continuous representation + timestep
  • Output: Predicted noise

Condition Fusion Module:

  • Cross-attention between text embeddings and visual features
  • Supports multiple condition forms (text, image, mask)

3.3 Data Flow Design

The data flow during generation is divided into three stages:

Stage One: Skeleton Generation

Text → Text Encoder → Autoregressive Transformer → Discrete Token Sequence

Stage Two: Continuous Mapping

Discrete Tokens → Embedding Table → Continuous Vector Sequence

Stage Three: Detail Refinement

Continuous Vectors + Noise → Diffusion Denoiser → Refined Continuous Representation → VQ-VAE Decoder → Image

4. Core Implementation (Golang Code)

4.1 Basic Data Structures

// Token represents a discrete token in the image
type Token struct {
    Index  int       // Discrete index
    Embed  []float32 // Corresponding continuous embedding vector
    Masked bool      // Whether it is in masked state
}

// ImageRepresentation contains both discrete and continuous forms
type ImageRepresentation struct {
    DiscreteTokens []Token        // Discrete token sequence
    ContinuousLatent []float32    // Continuous latent representation
    Height, Width  int           // Spatial dimensions
}

// DiffusionConfig parameters for diffusion
type DiffusionConfig struct {
    Timesteps    int     // Total denoising steps
    BetaStart    float32 // Noise schedule start
    BetaEnd      float32 // Noise schedule end
    ScheduleType string  // Schedule type: linear/cosine
}

// ARConfig parameters for autoregression
type ARConfig struct {
    MaxSeqLen     int    // Maximum sequence length
    NumLayers     int    // Number of Transformer layers
    NumHeads      int    // Number of attention heads
    EmbedDim      int    // Embedding dimension
    VocabSize     int    // Vocabulary size
}

4.2 VQ-VAE Encoder Implementation

// VQVAEEncoder encodes an image into discrete tokens
type VQVAEEncoder struct {
    ConvLayers []ConvLayer
    Codebook   []float32 // Codebook vectors
    EmbedDim   int
}

func (e *VQVAEEncoder) Encode(image []float32) (*ImageRepresentation, error) {
    // 1. Convolutional downsampling
    latent := image
    for _, layer := range e.ConvLayers {
        latent = layer.Forward(latent)
    }
    
    // 2. Vector quantization: find nearest codebook vector
    h, w := len(latent)/e.EmbedDim, e.EmbedDim
    tokens := make([]Token, h*w)
    
    for i := 0; i < h*w; i++ {
        // Calculate distance between current vector and all codebook vectors
        minDist := float32(math.MaxFloat32)
        bestIdx := 0
        
        for j, code := range e.Codebook {
            dist := euclideanDistance(latent[i*e.EmbedDim:(i+1)*e.EmbedDim], code)
            if dist < minDist {
                minDist = dist
                bestIdx = j
            }
        }
        
        // Record discrete index and continuous embedding
        tokens[i] = Token{
            Index:  bestIdx,
            Embed:  e.Codebook[bestIdx*e.EmbedDim : (bestIdx+1)*e.EmbedDim],
            Masked: false,
        }
    }
    
    return &ImageRepresentation{
        DiscreteTokens:   tokens,
        ContinuousLatent: latent,
        Height:           h,
        Width:            w,
    }, nil
}

4.3 Autoregressive Generator

// ARGenerator autoregressive token predictor
type ARGenerator struct {
    Transformer *CausalTransformer
    Config      *ARConfig
}

// Generate autoregressively generates a token sequence
func (g *ARGenerator) Generate(condEmbedding []float32, numTokens int) ([]Token, error) {
    tokens := make([]Token, numTokens)
    
    // Initialize start token
    tokens[0] = Token{Index: BOS_TOKEN, Embed: make([]float32, g.Config.EmbedDim)}
    
    for i := 1; i < numTokens; i++ {
        // Build current context
        context := g.buildContext(tokens[:i], condEmbedding)
        
        // Transformer forward pass
        logits := g.Transformer.Forward(context)
        
        // Sample next token
        nextToken := g.sampleToken(logits[i-1])
        
        // Update token sequence
        tokens[i] = nextToken
        
        // Check if end token is generated
        if nextToken.Index == EOS_TOKEN {
            break
        }
    }
    
    return tokens[:i], nil
}

// sampleToken samples the next token based on logits
func (g *ARGenerator) sampleToken(logits []float32) Token {
    // Apply softmax to get probability distribution
    probs := softmax(logits)
    
    // Temperature sampling
    temperature := float32(0.8)
    for i := range probs {
        probs[i] = math.Exp(math.Log(probs[i]) / temperature)
    }
    
    // Normalize
    sum := float32(0)
    for _, p := range probs {
        sum += p
    }
    for i := range probs {
        probs[i] /= sum
    }
    
    // Random sampling
    r := rand.Float32()
    cumulative := float32(0)
    selectedIdx := 0
    for i, p := range probs {
        cumulative += p
        if r <= cumulative {
            selectedIdx = i
            break
        }
    }
    
    return Token{
        Index:  selectedIdx,
        Embed:  g.getEmbedding(selectedIdx),
        Masked: false,
    }
}

4.4 Diffusion Denoiser

// DiffusionDenoiser continuous space diffusion denoiser
type DiffusionDenoiser struct {
    UNet   *DiTUNet
    Config *DiffusionConfig
}

// Denoise performs the complete denoising process
func (d *DiffusionDenoiser) Denoise(noisyLatent []float32, condEmbedding []float32) ([]float32, error) {
    // Get noise schedule parameters
    alphas := d.getAlphaSchedule()
    
    // Current latent representation
    current := make([]float32, len(noisyLatent))
    copy(current, noisyLatent)
    
    // Step-by-step denoising
    for t := d.Config.Timesteps - 1; t >= 0; t-- {
        // Construct timestep embedding
        tEmbed := d.getTimestepEmbedding(t)
        
        // UNet predicts noise
        predictedNoise := d.UNet.Forward(current, tEmbed, condEmbedding)
        
        // Update latent representation
        alpha := alphas[t]
        alphaPrev := alphas[t-1]
        
        // DDIM update formula
        for i := range current {
            // Predict original data
            x0 := (current[i] - math.Sqrt(1-alpha)*predictedNoise[i]) / math.Sqrt(alpha)
            
            // Update current step
            current[i] = math.Sqrt(alphaPrev)*x0 + 
                         math.Sqrt(1-alphaPrev)*predictedNoise[i]
        }
        
        // Optional: Add random noise (DDPM mode)
        if t > 0 {
            noise := make([]float32, len(current))
            for i := range noise {
                noise[i] = float32(rand.NormFloat64())
            }
            
            sigma := math.Sqrt((1 - alphaPrev) / (1 - alpha) * (1 - alpha/alphaPrev))
            for i := range current {
                current[i] += sigma * noise[i]
            }
        }
    }
    
    return current, nil
}

// getAlphaSchedule gets the noise schedule
func (d *DiffusionDenoiser) getAlphaSchedule() []float32 {
    betas := make([]float32, d.Config.Timesteps)
    
    switch d.Config.ScheduleType {
    case "linear":
        // Linear schedule
        for i := range betas {
            betas[i] = d.Config.BetaStart + 
                       float32(i)/float32(d.Config.Timesteps-1)*
                       (d.Config.BetaEnd - d.Config.BetaStart)
        }
    case "cosine":
        // Cosine schedule
        for i := range betas {
            t := float32(i) / float32(d.Config.Timesteps)
            betas[i] = 1 - math.Cos(math.Pi/2*(t+0.008)/1.008)
        }
    }
    
    // Calculate cumulative alpha
    alphas := make([]float32, d.Config.Timesteps)
    cumAlpha := float32(1)
    for i, beta := range betas {
        cumAlpha *= (1 - beta)
        alphas[i] = cumAlpha
    }
    
    return alphas
}

4.5 Fusion Generator

// FusionGenerator main fusion generator class
type FusionGenerator struct {
    VAE      *VQVAEEncoder
    AR       *ARGenerator
    Diff     *DiffusionDenoiser
    Decoder  *VQVAEDecoder
}

// Generate complete generation process
func (f *FusionGenerator) Generate(prompt string, opts *GenerateOptions) ([]float32, error) {
    // 1. Encode text condition
    textEmbed := f.encodeText(prompt)
    
    // 2. Autoregressive generation of discrete skeleton
    numTokens := opts.NumTokens
    if numTokens == 0 {
        numTokens = 16 * 16 // Default 256 tokens
    }
    
    tokens, err := f.AR.Generate(textEmbed, numTokens)
    if err != nil {
        return nil, fmt.Errorf("autoregressive generation failed: %w", err)
    }
    
    // 3. Convert to continuous representation
    continuousLatent := f.tokensToContinuous(tokens)
    
    // 4. Add noise to start diffusion
    noiseScale := opts.NoiseScale
    if noiseScale == 0 {
        noiseScale = 0.5 // Default medium noise
    }
    
    noisyLatent := make([]float32, len(continuousLatent))
    for i, v := range continuousLatent {
        noise := float32(rand.NormFloat64()) * noiseScale
        noisyLatent[i] = v + noise
    }
    
    // 5. Diffusion denoising refinement
    refinedLatent, err := f.Diff.Denoise(noisyLatent, textEmbed)
    if err != nil {
        return nil, fmt.Errorf("diffusion denoising failed: %w", err)
    }
    
    // 6. VQ-VAE decode to image
    image, err := f.Decoder.Decode(refinedLatent)
    if err != nil {
        return nil, fmt.Errorf("image decoding failed: %w", err)
    }
    
    return image, nil
}

// tokensToContinuous converts discrete tokens to continuous representation
func (f *FusionGenerator) tokensToContinuous(tokens []Token) []float32 {
    latent := make([]float32, len(tokens)*f.VAE.EmbedDim)
    
    for i, token := range tokens {
        copy(latent[i*f.VAE.EmbedDim:(i+1)*f.VAE.EmbedDim], token.Embed)
    }
    
    return latent
}

5. Performance Optimization

5.1 Inference Acceleration Strategies

KV Cache Optimization: During autoregressive generation, the Transformer’s Key-Value cache is a major bottleneck. By using a shared cache mechanism to reuse historically computed KV values, repeated computation can be avoided. Measurements show that for generating 256 tokens, KV caching can reduce computation by approximately 70%.

Parallel Decoding: Traditional autoregressive models can only generate tokens one by one, while fusion models allow parallel prediction of multiple masked tokens. By introducing a “block-level autoregressive” strategy, the sequence is divided into blocks, with parallel prediction within blocks and causal dependencies maintained between blocks. This strategy improves inference speed by 3-5 times with almost no loss in quality.

Diffusion Step Compression: Using a DDIM sampler can compress diffusion steps from 1000 to 50 while maintaining visual quality. Further employing “knowledge distillation” to train a lightweight few-step diffusion model can compress steps to 4-8.

5.2 Memory Optimization

Gradient Checkpointing: During training, gradient checkpointing trades computation for memory. Only some intermediate results from the forward pass are retained, and discarded results are recomputed during the backward pass. For large models like DiT, this can reduce GPU memory usage by 40%.

Mixed Precision Training: Using FP16/BF16 mixed precision with dynamic loss scaling. On NVIDIA A100, mixed precision training can double throughput while reducing memory usage by approximately 30%.

Model Parallelism: For very large models (over 10B parameters), a combination of tensor parallelism and pipeline parallelism is used. Transformer layers are split across multiple GPUs, and asynchronous communication masks transfer latency.

5.3 Code-Level Optimization

// Use memory pool to reduce allocations
var latentPool = sync.Pool{
    New: func() interface{} {
        return make([]float32, 0, 256*256)
    },
}

func (d *DiffusionDenoiser) DenoiseOptimized(noisyLatent []float32, condEmbedding []float32) ([]float32, error) {
    // Get temporary buffer from memory pool
    temp := latentPool.Get().([]float32)
    defer latentPool.Put(temp)
    
    // Ensure buffer is large enough
    if cap(temp) < len(noisyLatent) {
        temp = make([]float32, len(noisyLatent))
    }
    temp = temp[:len(noisyLatent)]
    
    // SIMD-optimized matrix operations
    for t := d.Config.Timesteps - 1; t >= 0; t-- {
        // Use batch processing to reduce function call overhead
        d.processTimestep(noisyLatent, temp, t)
        
        // In-place swap
        noisyLatent, temp = temp, noisyLatent
    }
    
    return noisyLatent, nil
}

// Use SIMD instructions to optimize vector operations
//go:noescape
//go:nosplit
func vectorAddSIMD(a, b, result []float32) {
    // Actual implementation uses assembly or CGo to call SIMD instructions
    for i := range a {
        result[i] = a[i] + b[i]
    }
}

6. Production Practice

6.1 Deployment Architecture

In a real production environment, the fusion model is typically deployed as a microservice architecture:

Client → API Gateway → Load Balancer → Inference Node Cluster
                                        ↓
                                   Model Management Service
                                        ↓
                                   Cache Layer (Redis)
                                        ↓
                                   Object Storage (MinIO)

Inference nodes use GPU instances, each running one model replica. Kubernetes is used for auto-scaling, dynamically adjusting the number of nodes based on request volume.

6.2 Service Implementation

// FusionService fusion generation service
type FusionService struct {
    generator *FusionGenerator
    cache     *redis.Client
    metrics   *prometheus.Metrics
}

// GenerateHandler HTTP handler function
func (s *FusionService) GenerateHandler(w http.ResponseWriter, r *http.Request) {
    // Parse request
    var req GenerateRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "invalid request", http.StatusBadRequest)
        return
    }
    
    // Check cache
    cacheKey := fmt.Sprintf("gen:%s:%d", req.Prompt, req.Seed)
    if cached, err := s.cache.Get(r.Context(), cacheKey).Bytes(); err == nil {
        w.Header().Set("Content-Type", "image/png")
        w.Write(cached)
        s.metrics.CacheHits.Inc()
        return
    }
    
    // Generate image
    start := time.Now()
    image, err := s.generator.Generate(req.Prompt, &GenerateOptions{
        NumTokens:  req.NumTokens,
        NoiseScale: req.NoiseScale,
    })
    
    if err != nil {
        s.metrics.Errors.Inc()
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    
    // Encode to PNG
    var buf bytes.Buffer
    if err := encodePNG(&buf, image, req.Width, req.Height); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    
    // Write to cache
    s.cache.Set(r.Context(), cacheKey, buf.Bytes(), 1*time.Hour)
    
    // Record metrics
    s.metrics.Latency.Observe(time.Since(start).Seconds())
    s.metrics.Requests.Inc()
    
    // Return image
    w.Header().Set("Content-Type", "image/png")
    w.Write(buf.Bytes())
}

6.3 Monitoring and Alerting

A production environment requires a comprehensive monitoring system:

Model Metrics:

  • Generation quality (FID, CLIP Score)
  • Inference latency (P50, P99)
  • GPU utilization
  • GPU memory usage

Business Metrics:

  • QPS (Queries Per Second)
  • Success rate
  • Cache hit rate
  • Average response time

Alert Rules:

  • Latency exceeds 5 seconds triggers a warning
  • Error rate exceeds 1% triggers a critical alert
  • GPU memory exceeds 90% triggers scaling

6.4 Common Issues and Solutions

Issue 1: Artifacts in Generated Results Solution: Adjust the diffusion noise schedule, increase denoising steps, or use a finer VQ-VAE codebook.

Issue 2: Stuttering in Autoregressive Generation Solution: Check the KV cache implementation to ensure cache hit rate; consider using Flash Attention to optimize attention computation.

Issue 3: GPU Memory Overflow Solution: Enable gradient checkpointing, use model sharding, or reduce batch size.

7. Conclusion

The fusion generation paradigm combining diffusion models and autoregressive models represents an important technical direction in the field of visual generation. By combining the structural advantages of discrete prediction with the detail expressiveness of continuous denoising, this paradigm demonstrates outstanding performance in image/video generation tasks.

From a technical perspective, the core innovations are:

  1. Unified Probabilistic Framework: Unifies the causal constraints of autoregression with the continuous modeling of diffusion within the same generation process.
  2. Flexible Representation: Discrete tokens provide a semantic skeleton, while continuous vectors carry visual details.
  3. Efficient Inference Strategies: Parallel decoding and step compression make practical deployment feasible.

From an engineering practice perspective, we have verified:

  • Golang implementation can meet the performance requirements of a production environment.
  • Memory pools and SIMD optimizations can significantly improve throughput.
  • Caching and load balancing are key to ensuring service stability.

Looking ahead, there are several worthwhile questions to explore in this direction:

  • How to further reduce diffusion steps to achieve real-time generation?
  • How to unify multiple modalities such as text, images, and video into the same fusion framework?
  • How to design more efficient autoregressive architectures to avoid quadratic complexity?

The fusion generation paradigm is rapidly evolving. We believe that in the near future, we will see more applications based on this idea, from text-to-image to video generation, from creative design to scientific visualization. AI-generated content will reach new heights. As technical practitioners, we must keep up with cutting-edge research while focusing on engineering practice, transforming theoretical innovations into reliable production systems.