Optimizing Mixture-of-Experts (MoE) Model Deployment on Edge Devices

Optimizing Mixture-of-Experts (MoE) Model Deployment on Edge Devices

1. Background

1.1 Edge Computing Challenges in the Era of Large Models

In recent years, deep learning model scales have grown exponentially. Large models with hundreds of billions of parameters, such as GPT-4 and Gemini, have achieved breakthrough advancements in natural language processing, computer vision, and other domains. However, the high computational cost and memory footprint of these models primarily confine them to cloud GPU clusters. Simultaneously, edge computing scenarios—such as smart cameras, IoT devices, and mobile terminals—have an increasingly urgent need for real-time processing, privacy preservation, and offline capability.

Edge devices typically suffer from the following limitations:

  • Limited compute power: CPU/GPU performance is far inferior to the cloud; some devices lack a GPU entirely.
  • Constrained memory: Common edge devices have 512MB to 8GB of RAM.
  • Power sensitivity: Battery-powered devices must control energy consumption.
  • Unstable network: Low-latency cloud communication cannot be guaranteed.

Mixture-of-Experts (MoE), as a sparsely activated architecture, theoretically offers new possibilities for edge deployment—each inference activates only a subset of experts, not the entire model. However, in practice, MoE still faces challenges such as large total parameter counts, routing computation overhead, and expert load imbalance.

1.2 Practical Significance of MoE on Edge

According to OpenAI research, MoE architectures can significantly improve model performance under the same computational budget. For edge scenarios, the sparse nature of MoE implies:

  • Inference computation: Only 10%–30% of parameters are activated, reducing latency.
  • Memory footprint: Dynamic expert loading can reduce resident memory.
  • Task adaptability: Different experts can be fine-tuned for different tasks.

Nevertheless, the stringent latency and memory requirements of edge devices make direct deployment of raw MoE models infeasible. This article delves into how to efficiently deploy MoE on edge devices using techniques such as quantization, pruning, and expert caching.

2. Technical Principle Analysis

2.1 Core Components of MoE Architecture

An MoE layer consists of three key parts:

graph LR
    A[Input Tensor] --> B[Gating Network Router]
    B --> C{Expert Selection}
    C -->|Top-K Experts| D[Expert 1]
    C -->|Top-K Experts| E[Expert 2]
    C -->|...| F[Expert N]
    D --> G[Weighted Fusion]
    E --> G
    F --> G
    G --> H[Output Tensor]

Gating Network (Router): Typically a small MLP that computes a weight distribution from input to each expert and selects the Top-K experts.

Expert: An independent FFN layer, each processing specific data patterns. The number of experts N is usually 8–64, and the number of activated experts K is 1–4.

Weighted Fusion: The outputs of the selected experts are weighted-summed by their gating weights.

2.2 Computational Characteristics of Sparse Activation

The computational complexity of MoE can be expressed as:

O_total = O_router + K * O_expert + O_fusion

Where O_expert is the computation of a single expert. When K « N, the total computation is approximately 1/(N/K) times that of a dense model with equivalent parameters. However, note:

  • Although the gating network is small, it must compute weights for all experts (O(N)), which is non-negligible when N is large.
  • Data distribution after expert selection incurs communication overhead (multi-GPU scenarios) or memory copy overhead (single device).

2.3 Key Bottlenecks on Edge Devices

Bottleneck TypeSpecific ManifestationSeverity
Total ParametersModel file can reach tens of GB, cannot fit in edge storageFatal
Dynamic SparsityEach inference activates different experts, causing irregular memory access patternsModerate
Quantization PrecisionEdge devices often require INT8 quantization, but MoE is more sensitive to quantizationHigh
Expert LoadSome experts are activated frequently, creating computation hotspotsModerate

3. System Architecture Design

3.1 Layered Deployment Architecture

For edge devices, we design a three-layer architecture:

graph TB
    subgraph Cloud
        A[Full-Precision MoE Model] --> B[Quantization Compression]
        B --> C[Expert Library Generation]
    end
    
    subgraph Edge Device
        D[Lightweight Router] --> E{Expert Cache Pool}
        E --> F[Expert 1]
        E --> G[Expert 2]
        E --> H[...]
        F --> I[Inference Engine]
        G --> I
        H --> I
    end
    
    subgraph Offline Optimization
        J[Expert Clustering] --> K[Quantization Calibration]
        K --> L[Cache Strategy]
    end

Cloud: Responsible for model training, quantization compression, expert clustering, and cache strategy generation.

Edge: Runs a lightweight gating network and expert cache pool, loading experts on demand.

Offline Optimization: Analyzes expert usage frequency through profiling to optimize cache strategies.

3.2 Expert Cache Pool Design

Edge devices have limited memory and cannot host all experts simultaneously. We introduce an LRU cache pool:

// ExpertCache manages a pool of cached experts
type ExpertCache struct {
    mu       sync.RWMutex
    maxSize  int                    // Maximum number of cached experts
    experts  map[string]*Expert    // Expert name -> Expert object
    lruList  *list.List            // LRU linked list
    loadFunc func(name string) (*Expert, error) // Function to load expert from storage
}

// Expert represents a single expert structure
type Expert struct {
    Name   string
    Weights []float32 // Quantized weights
    Biases  []float32
    Freq   int64      // Access frequency counter
}

Cache Strategy:

  1. Initial load: Preload the Top-20% most frequently used experts based on offline profiling results.
  2. Dynamic replacement: Use the LRU algorithm; when the cache is full, evict the least recently used expert.
  3. Prefetch mechanism: Predict the next batch of potentially activated experts based on the gating network’s historical selection patterns.

3.3 Quantization-Aware Routing

Quantization introduces accuracy loss, especially affecting the weight distribution of the gating network. We design quantization-aware routing:

// QuantizedRouter implements a quantization-aware gating network
type QuantizedRouter struct {
    // INT8 quantization parameters
    scale     float32
    zeroPoint int32
    
    // Quantized weight matrix (INT8)
    weightQ  []int8
    bias     []float32
    
    // Number of experts
    numExperts int
}

// Forward performs forward propagation, returning Top-K expert indices and weights
func (r *QuantizedRouter) Forward(input []float32) ([]int, []float32) {
    // 1. Quantize input to INT8
    inputQ := quantize(input, r.scale, r.zp)
    
    // 2. INT8 matrix multiplication (input x weights)
    logits := make([]int32, r.numExperts)
    for i := 0; i < r.numExperts; i++ {
        for j := 0; j < len(inputQ); j++ {
            logits[i] += int32(inputQ[j]) * int32(r.weightQ[i*len(inputQ)+j])
        }
    }
    
    // 3. Dequantize and add bias
    scores := make([]float32, r.numExperts)
    for i := 0; i < r.numExperts; i++ {
        scores[i] = float32(logits[i]) * r.scale + r.bias[i]
    }
    
    // 4. Softmax + Top-K selection
    return topKSoftmax(scores, K)
}

4. Complete Golang Example Code

4.1 Core Implementation of MoE Inference Engine

package moe

import (
    "container/list"
    "encoding/binary"
    "fmt"
    "math"
    "os"
    "sync"
    "time"
)

// MoEConfig configures the MoE inference engine
type MoEConfig struct {
    NumExperts    int     // Total number of experts
    TopK          int     // Number of activated experts
    HiddenSize    int     // Hidden layer dimension
    QuantBits     int     // Quantization bits (8/16)
    CacheSize     int     // Expert cache pool size
    PrefetchRatio float64 // Prefetch ratio (0.0~1.0)
}

// MoEEngine is the MoE inference engine
type MoEEngine struct {
    config    MoEConfig
    router    *QuantizedRouter
    cache     *ExpertCache
    stats     *EngineStats
}

// EngineStats collects performance statistics
type EngineStats struct {
    TotalInferences   int64
    CacheHits         int64
    CacheMisses       int64
    AvgInferenceTime  time.Duration
    mu                sync.Mutex
}

// NewMoEEngine creates a new MoE inference engine
func NewMoEEngine(cfg MoEConfig, routerPath string) (*MoEEngine, error) {
    // Load the quantized gating network
    router, err := LoadQuantizedRouter(routerPath, cfg.NumExperts, cfg.HiddenSize, cfg.QuantBits)
    if err != nil {
        return nil, fmt.Errorf("failed to load gating network: %w", err)
    }

    // Create the expert cache pool
    cache := NewExpertCache(cfg.CacheSize, func(name string) (*Expert, error) {
        // Load expert weights from the filesystem
        return loadExpertFromDisk(name, cfg.HiddenSize, cfg.QuantBits)
    })

    return &MoEEngine{
        config: cfg,
        router: router,
        cache:  cache,
        stats:  &EngineStats{},
    }, nil
}

// Infer performs a single inference
func (e *MoEEngine) Infer(input []float32) ([]float32, error) {
    start := time.Now()
    defer func() {
        e.stats.mu.Lock()
        e.stats.TotalInferences++
        e.stats.AvgInferenceTime = time.Duration(
            (int64(e.stats.AvgInferenceTime)*int64(e.stats.TotalInferences-1) +
                int64(time.Since(start))) / int64(e.stats.TotalInferences))
        e.stats.mu.Unlock()
    }()

    // 1. Gating network selects experts
    expertIndices, expertWeights := e.router.Forward(input)
    if len(expertIndices) != e.config.TopK {
        return nil, fmt.Errorf("gating network returned %d experts, expected %d", len(expertIndices), e.config.TopK)
    }

    // 2. Retrieve experts from cache (including prefetch)
    experts := make([]*Expert, e.config.TopK)
    for i, idx := range expertIndices {
        expertName := fmt.Sprintf("expert_%d", idx)
        expert, err := e.cache.Get(expertName)
        if err != nil {
            return nil, fmt.Errorf("failed to retrieve expert %s: %w", expertName, err)
        }
        experts[i] = expert
        e.stats.mu.Lock()
        if e.cache.wasMiss(expertName) {
            e.stats.CacheMisses++
        } else {
            e.stats.CacheHits++
        }
        e.stats.mu.Unlock()
    }

    // 3. Asynchronously prefetch the next batch of potential experts
    go e.prefetchExperts(input)

    // 4. Execute expert forward propagation
    outputs := make([][]float32, e.config.TopK)
    var wg sync.WaitGroup
    for i, expert := range experts {
        wg.Add(1)
        go func(idx int, exp *Expert) {
            defer wg.Done()
            outputs[idx] = exp.Forward(input)
        }(i, expert)
    }
    wg.Wait()

    // 5. Weighted fusion
    result := make([]float32, len(outputs[0]))
    for i := 0; i < len(result); i++ {
        var sum float32
        for j := 0; j < e.config.TopK; j++ {
            sum += expertWeights[j] * outputs[j][i]
        }
        result[i] = sum
    }

    return result, nil
}

// prefetchExperts prefetches experts based on historical patterns
func (e *MoEEngine) prefetchExperts(input []float32) {
    // Use a simplified prediction model: frequency of recently activated experts
    // In production, a more sophisticated sequence prediction model can be deployed
    predicted := e.router.PredictNextExperts(input, int(float64(e.config.CacheSize)*e.config.PrefetchRatio))
    for _, name := range predicted {
        e.cache.Prefetch(name)
    }
}

// GetStats returns performance statistics
func (e *MoEEngine) GetStats() EngineStats {
    e.stats.mu.Lock()
    defer e.stats.mu.Unlock()
    return *e.stats
}

4.2 Quantization Utility Functions

// QuantizeWeights quantizes float32 weights to INT8
func QuantizeWeights(weights []float32, bits int) ([]int8, float32, int32, error) {
    if bits != 8 {
        return nil, 0, 0, fmt.Errorf("only INT8 quantization is currently supported")
    }

    // Compute quantization parameters
    var minVal, maxVal float32 = math.MaxFloat32, -math.MaxFloat32
    for _, w := range weights {
        if w < minVal {
            minVal = w
        }
        if w > maxVal {
            maxVal = w
        }
    }

    // Symmetric quantization
    scale := max(maxVal, -minVal) / 127.0
    zeroPoint := int32(0) // Zero point is 0 for symmetric quantization

    // Quantize
    quantized := make([]int8, len(weights))
    for i, w := range weights {
        q := int32(math.Round(float64(w / scale)))
        if q > 127 {
            q = 127
        } else if q < -128 {
            q = -128
        }
        quantized[i] = int8(q)
    }

    return quantized, scale, zeroPoint, nil
}

// Dequantize converts INT8 quantized values back to float32
func Dequantize(quantized []int8, scale float32, zeroPoint int32) []float32 {
    result := make([]float32, len(quantized))
    for i, q := range quantized {
        result[i] = float32(int32(q)-zeroPoint) * scale
    }
    return result
}

// max is a helper function
func max(a, b float32) float32 {
    if a > b {
        return a
    }
    return b
}

4.3 Expert Loading and Cache Implementation

// loadExpertFromDisk loads expert weights from disk
func loadExpertFromDisk(name string, hiddenSize int, quantBits int) (*Expert, error) {
    // Expert weight file naming convention: experts/{name}.bin
    filename := fmt.Sprintf("experts/%s.bin", name)
    f, err := os.Open(filename)
    if err != nil {
        return nil, fmt.Errorf("failed to open expert file %s: %w", filename, err)
    }
    defer f.Close()

    // Read metadata
    var numWeights int32
    if err := binary.Read(f, binary.LittleEndian, &numWeights); err != nil {
        return nil, fmt.Errorf("failed to read weight count: %w", err)
    }

    // Read quantized weights
    weights := make([]float32, numWeights)
    if quantBits == 8 {
        // INT8 quantized weights
        var scale float32
        var zeroPoint int32
        binary.Read(f, binary.LittleEndian, &scale)
        binary.Read(f, binary.LittleEndian, &zeroPoint)

        quantized := make([]int8, numWeights)
        if err := binary.Read(f, binary.LittleEndian, &quantized); err != nil {
            return nil, fmt.Errorf("failed to read quantized weights: %w", err)
        }
        weights = Dequantize(quantized, scale, zeroPoint)
    } else {
        // float32 raw weights
        if err := binary.Read(f, binary.LittleEndian, &weights); err != nil {
            return nil, fmt.Errorf("failed to read raw weights: %w", err)
        }
    }

    // Read biases
    biases := make([]float32, hiddenSize)
    if err := binary.Read(f, binary.LittleEndian, &biases); err != nil {
        return nil, fmt.Errorf("failed to read biases: %w", err)
    }

    return &Expert{
        Name:    name,
        Weights: weights,
        Biases:  biases,
    }, nil
}

// ExpertCache implements the expert cache pool
type ExpertCache struct {
    mu       sync.RWMutex
    maxSize  int
    experts  map[string]*list.Element // Expert name -> linked list node
    lruList  *list.List               // LRU linked list
    loadFunc func(name string) (*Expert, error)
    missSet  map[string]bool          // Tracks recent misses
}

// cacheEntry represents a cache entry
type cacheEntry struct {
    name   string
    expert *Expert
}

// NewExpertCache creates a new expert cache pool
func NewExpertCache(maxSize int, loadFunc func(string) (*Expert, error)) *ExpertCache {
    return &ExpertCache{
        maxSize:  maxSize,
        experts:  make(map[string]*list.Element),
        lruList:  list.New(),
        loadFunc: loadFunc,
        missSet:  make(map[string]bool),
    }
}

// Get retrieves an expert; loads from disk if not cached
func (c *ExpertCache) Get(name string) (*Expert, error) {
    c.mu.Lock()
    defer c.mu.Unlock()

    if elem, ok := c.experts[name]; ok {
        // Cache hit, move to front of list
        c.lruList.MoveToFront(elem)
        c.missSet[name] = false
        return elem.Value.(*cacheEntry).expert, nil
    }

    // Cache miss, load from disk
    expert, err := c.loadFunc(name)
    if err != nil {
        return nil, fmt.Errorf("failed to load expert %s: %w", name, err)
    }

    // If cache is full, evict the least recently used expert
    if c.lruList.Len() >= c.maxSize {
        backElem := c.lruList.Back()
        if backElem != nil {
            entry := backElem.Value.(*cacheEntry)
            delete(c.experts, entry.name)
            c.lruList.Remove(backElem)
        }
    }

    // Insert new expert
    entry := &cacheEntry{name: name, expert: expert}
    elem := c.lruList.PushFront(entry)
    c.experts[name] = elem
    c.missSet[name] = true

    return expert, nil
}

// Prefetch prefetches an expert into the cache
func (c *ExpertCache) Prefetch(name string) {
    c.mu.Lock()
    defer c.mu.Unlock()

    // Skip if already present
    if _, ok := c.experts[name]; ok {
        return
    }

    // Load asynchronously (in production, use a worker pool)
    go func() {
        expert, err := c.loadFunc(name)
        if err != nil {
            fmt.Printf("failed to prefetch expert %s: %v\n", name, err)
            return
        }

        c.mu.Lock()
        defer c.mu.Unlock()

        // Re-check if loaded by another goroutine
        if _, ok := c.experts[name]; ok {
            return
        }

        // Eviction policy same as Get
        if c.lruList.Len() >= c.maxSize {
            backElem := c.lruList.Back()
            if backElem != nil {
                entry := backElem.Value.(*cacheEntry)
                delete(c.experts, entry.name)
                c.lruList.Remove(backElem)
            }
        }

        entry := &cacheEntry{name: name, expert: expert}
        elem := c.lruList.PushFront(entry)
        c.experts[name] = elem
    }()
}

// wasMiss checks if the last Get resulted in a miss
func (c *ExpertCache) wasMiss(name string) bool {
    miss, ok := c.missSet[name]
    if ok {
        delete(c.missSet, name)
    }
    return miss
}

5. Performance Optimization Recommendations

5.1 Quantization Strategy Selection

Quantization SchemeAccuracy LossMemory SavingsRecommended Scenario
INT8 Symmetric1%–3%4xMost edge devices
INT4 Asymmetric3%–8%8xExtremely memory-constrained scenarios
Mixed Precision (FP16+INT8)<1%2xDevices with FP16 support

Best Practices:

  • The gating network is more sensitive to precision; consider using FP16 or retaining higher quantization bit-widths.
  • Expert weights can tolerate more aggressive quantization, as errors from individual experts are averaged during fusion.
  • Use quantization-aware training (QAT) with a calibration dataset rather than simple post-training quantization (PTQ).

5.2 Expert Cache Optimization

// WarmupCache loads high-frequency experts at startup
func (e *MoEEngine) WarmupCache(topN int) {
    // Retrieve the list of hot experts from offline profiling results
    hotExperts := e.router.GetHotExperts(topN)
    for _, name := range hotExperts {
        _, err := e.cache.Get(name)
        if err != nil {
            fmt.Printf("failed to warmup expert %s: %v\n", name, err)
        }
    }
}

Cache Parameter Tuning:

  • Suggested cache size is 2–3 times the number of active experts.
  • When the cache hit rate falls below 80%, increase the cache size or optimize the prefetch strategy.
  • Use memory-mapped files (mmap) for loading experts to reduce memory copies.

5.3 Computation Graph Optimization

// Vectorized expert forward propagation (using SIMD)
func (e *Expert) Forward(input []float32) []float32 {
    output := make([]float32, len(e.Biases))
    
    // Use Go assembly or CGO to call SIMD libraries
    // This is a placeholder; in practice, call BLAS or implement manually
    vectorMatMul(input, e.Weights, output, len(input), len(e.Biases))
    
    // Add biases
    for i := 0; i < len(output); i++ {
        output[i] += e.Biases[i]
    }
    
    // Activation function (e.g., ReLU)
    for i := 0; i < len(output); i++ {
        if output[i] < 0 {
            output[i] = 0
        }
    }
    
    return output
}

Optimization Points:

  1. Use BLAS libraries for matrix multiplication.
  2. Merge multiple small matrix multiplications into one large operation (for batch inference).
  3. Pad expert inputs for alignment to leverage cache lines.

5.4 Memory Management Techniques

// Use object pools to reduce memory allocations
var expertOutputPool = sync.Pool{
    New: func() interface{} {
        return make([]float32, 0, 1024) // Pre-allocate capacity
    },
}

func (e *Expert) ForwardWithPool(input []float32) []float32 {
    output := expertOutputPool.Get().([]float32)
    output = output[:len(e.Biases)]
    // ... computation logic
    return output
}

// Return the output to the pool after use
func returnOutput(output []float32) {
    expertOutputPool.Put(output[:0])
}

6. Production Environment Best Practices

6.1 Model Compression and Distribution

Compression Pipeline:

  1. Train the original MoE model (FP32).
  2. Perform INT8 quantization using a calibration dataset.
  3. Cluster experts and generate an expert index table.
  4. Split the model into: gating network (small file) + expert file collection.
  5. Use differential compression (e.g., zstd) to further reduce size.

Distribution Strategy:

  • Initial deployment: Full download of all experts (can be batched).
  • Incremental updates: Download only new or updated experts.
  • On-demand loading: Based on the device’s usage scenario, download only domain-relevant experts.

6.2 Monitoring and Adaptation

// Adaptive cache strategy
type AdaptiveCache struct {
    base     *ExpertCache
    hitRate  float64
    threshold float64 // Hit rate threshold
}

func (a *AdaptiveCache) Adjust() {
    // Adjust every 100 inferences
    if a.base.stats.TotalInferences%100 == 0 {
        currentHitRate := float64(a.base.stats.CacheHits) / float64(a.base.stats.TotalInferences)
        if currentHitRate < a.threshold {
            // Hit rate too low, try increasing cache or optimizing prefetch
            a.base.maxSize = int(float64(a.base.maxSize) * 1.2)
            fmt.Printf("Cache hit rate %.2f%% below threshold, expanding to %d\n", currentHitRate*100, a.base.maxSize)
        }
    }
}

Key Metrics:

  • Inference latency P50/P95/P99
  • Cache hit rate
  • Expert load count
  • Peak memory usage
  • Power consumption (for battery-powered devices)

6.3 Security Considerations

  1. Model Protection: Encrypt expert weights at rest and decrypt at runtime.
  2. Tamper Prevention: Use digital signatures to verify expert file integrity.
  3. Privacy Isolation: Process sensitive data locally only; do not transmit to the cloud.

6.4 Failure Handling

// Fallback strategy when expert loading fails
func (e *MoEEngine) InferWithFallback(input []float32) ([]float32, error) {
    result, err := e.Infer(input)
    if err != nil {
        fmt.Printf("MoE inference failed, using fallback: %v\n", err)
        // Fallback 1: Use the most recent expert from cache
        // Fallback 2: Use a lightweight dense model
        return e.fallbackModel.Forward(input)
    }
    return result, nil
}

7. Conclusion

7.1 Key Findings

Through this exploration, we draw the following conclusions:

  1. MoE is feasible on edge devices but requires deep optimization: The raw MoE architecture cannot be deployed directly. However, through quantization (4x–8x compression), expert caching (reducing resident memory), and prefetching techniques, inference latency can be controlled within 50ms, and memory usage can be reduced to under 500MB.

  2. Quantization is the core bottleneck: The gating network is sensitive to quantization precision; mixed precision (FP16+INT8) is recommended. Expert weights can tolerate more aggressive quantization (INT4), but attention must be paid to outlier handling.

  3. Cache strategy determines performance: An LRU cache combined with history-based prefetching can achieve hit rates above 90%, avoiding frequent disk I/O.

  4. Engineering practices require continuous iteration: In production environments, metrics such as cache hit rate and inference latency must be monitored, and configurations must be dynamically adjusted based on actual workloads.

7.2 Future Outlook

  • Hardware-Software Co-Design: Edge NPUs can be customized with hardware accelerators tailored to the sparse activation characteristics of MoE.
  • On-Device Training: Support fine-tuning of specific experts on edge devices for personalization.
  • Federated MoE: Multiple edge devices share an expert library, protecting privacy through federated learning.

7.3 Practical Recommendations

For teams planning to deploy MoE on edge devices, we recommend the following steps:

  1. Prototype Validation: Run benchmarks on the target device using the Golang code in this article.
  2. Quantization Experiments: Compare the impact of different quantization schemes on accuracy and performance.
  3. Cache Tuning: Adjust cache size and prefetch strategy based on actual access patterns.
  4. Gradual Rollout: Deploy on a subset of devices first, collect data for optimization, then launch fully.

References:

  • Shazeer et al. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (2017)
  • Fedus et al. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (2022)
  • EdgeMoE: Fast On-Device Inference of MoE Models (2023)