The Technical Secrets Behind Chinese LLMs' Counter-Trend Price Cuts — From MoE Architecture to Domestic AI Chip Adaptation

Abstract: In May 2026, DeepSeek announced a permanent 75% price cut, Xiaomi MiMo slashed prices by 99%, while OpenAI raised its prices to $5/$30 per million tokens — the LLM market has entered an unprecedented “K-shaped divergence.” These price cuts are far from “selling at a loss for market share.” Behind them lie three hardcore technical engines: MoE sparse architecture, tiered KV cache optimization, and domestic AI chip adaptation. This article dives deep into these technologies from an engineering perspective, using Go and Python code to demystify the cost-reduction playbook.


1. Introduction: The Logic Behind K-Shaped Divergence

1.1 A Tale of Two Markets

Architecture Diagram

The LLM pricing landscape in June 2026 reveals an unprecedented divergence:

CampModelInput Price (CNY/M tokens)Output Price (CNY/M tokens)Trend
Chinese AffordableDeepSeek V4-Pro3.0 (cache hit: 0.025)6.0⬇️ -75%
Chinese AffordableXiaomi MiMo V2.5 Pro3.0 (cache hit: 0.025)6.0⬇️ -99%
Chinese MainstreamQwen-Plus2.06.0➡️ Stable
Chinese PremiumGLM-525.050.0⬆️ +60%
Overseas PremiumGPT-5.5$5 (≈¥36)$30 (≈¥218)⬆️ Hike
Overseas PremiumClaude Opus 4$15 (≈¥109)$75 (≈¥544)⬆️ Hike

A staggering fact: DeepSeek’s cache-hit price of ¥0.025/million tokens is 725x cheaper than GPT-5.5’s cache-miss input price. If this isn’t a subsidy war, how is it technically possible?

1.2 This Is Not a Price War — It’s a Technology War

Outsiders see a “price war,” but industry insiders recognize a clear technology-driven cost curve:

Cost Reduction Lever Multiplier:
├── MoE Sparse Architecture    → Computation reduced to 5-10% of dense models
├── Attention Optimization(CSA)→ Computation reduced to 27%
├── Tiered Cache Scheduling    → Cache-hit costs approach zero
└── Domestic AI Chip Adaptation→ Hardware costs reduced by 60%+

These four technological levers compound to allow Chinese LLM providers to offer services at 1/50th to 1/100th of overseas model prices — while remaining profitable. This article dissects each layer.


2. MoE Sparse Architecture: The Cost-Reduction Foundation

Architecture Diagram

2.1 From Dense to Sparse: Same Parameters, Fraction of Computation

Traditional Transformers are dense models — every forward pass activates all parameters. For a 1600B-parameter model, each inference requires matrix operations on all 1600B parameters. It’s like having an entire 2000-person company process a single email — wildly inefficient.

MoE (Mixture of Experts) introduces the principle of “specialization”: splitting the model into N “expert” sub-networks and activating only K of them per inference.

Using DeepSeek V4-Pro as an example:

  • Total Parameters: 1600B (256 experts, ~6.25B each)
  • Active Parameters: ~48B (8 experts activated, only 3.125%)
  • Effective Computation: ~5% of a comparable dense model
# ============ MoE Routing Core Algorithm ============
import numpy as np
import math

class MoEConfig:
    """MoE Configuration - Reference: DeepSeek V4-Pro"""
    def __init__(self):
        self.hidden_dim = 7168
        self.num_experts = 256       # 256 experts
        self.top_k = 8               # activate 8 per token
        self.expert_hidden_dim = 2560
        self.activation_ratio = 0.03125  # 3.125% activation rate
        self.capacity_factor = 1.25

class DynamicMoERouter:
    """
    Dynamic MoE Router - The Core Algorithm
    
    Routing is the most critical component in MoE:
    1. Input -> Router network computes expert scores
    2. Top-K selection -> Choose K highest-scoring experts
    3. Weighted combination -> Fuse expert outputs via softmax weights
    4. Load balancing -> Auxiliary loss prevents "expert collapse"
    """
    
    def __init__(self, config: MoEConfig):
        self.config = config
        
        # Router network: maps input to expert affinity scores
        # Input dim: [hidden_dim] -> Output dim: [num_experts]
        self.router_weights = np.random.randn(
            config.hidden_dim, config.num_experts
        ).astype(np.float32) / math.sqrt(config.hidden_dim)
        
        self.router_bias = np.zeros(config.num_experts, dtype=np.float32)
        
        # 256 expert networks (simplified: 2-layer FFN)
        self.experts = []
        for i in range(config.num_experts):
            expert = {
                "w1": np.random.randn(config.hidden_dim, config.expert_hidden_dim)
                      .astype(np.float32) / math.sqrt(config.hidden_dim),
                "w2": np.random.randn(config.expert_hidden_dim, config.hidden_dim)
                      .astype(np.float32) / math.sqrt(config.expert_hidden_dim),
                "load_count": 0,
            }
            self.experts.append(expert)
        
        self.expert_loads = np.zeros(config.num_experts)
        self.total_routes = 0
    
    def route(self, x: np.ndarray) -> tuple:
        """
        Dynamic routing - the most critical MoE step
        
        Args:
            x: [batch_size, hidden_dim] input tensor
        
        Returns:
            output: [batch_size, hidden_dim] weighted ensemble output
            routing_weights: routing weight matrix
            expert_indices: selected expert indices
        """
        batch_size = x.shape[0]
        num_experts = self.config.num_experts
        top_k = self.config.top_k
        
        # Step 1: Compute routing scores
        # Each token -> 256 experts get a score
        scores = x @ self.router_weights + self.router_bias  # [batch, 256]
        
        # Step 2: Softmax normalization
        scores_softmax = np.exp(scores - scores.max(axis=-1, keepdims=True))
        scores_softmax = scores_softmax / scores_softmax.sum(axis=-1, keepdims=True)
        
        # Step 3: Top-K selection - keep only the top 8 experts
        top_k_scores = np.partition(scores, -top_k, axis=-1)[:, -top_k:]
        threshold = top_k_scores[:, :1]
        mask = scores >= threshold
        
        routing_weights = scores_softmax * mask
        routing_weights = routing_weights / (routing_weights.sum(axis=-1, keepdims=True) + 1e-10)
        
        expert_indices = np.argsort(-scores, axis=-1)[:, :top_k]
        
        # Step 4: Weighted ensemble of expert outputs
        output = np.zeros_like(x)
        for b in range(batch_size):
            for k in range(top_k):
                expert_idx = expert_indices[b, k]
                weight = routing_weights[b, expert_idx]
                
                expert = self.experts[expert_idx]
                # 2-layer FFN: expand -> ReLU -> compress
                hidden = x[b] @ expert["w1"]
                hidden = np.maximum(hidden, 0)  # ReLU
                expert_output = hidden @ expert["w2"]
                
                output[b] += weight * expert_output
                expert["load_count"] += 1
        
        return output, routing_weights, expert_indices
    
    def compute_load_balancing_loss(self) -> float:
        """
        Load balancing loss - prevents "expert collapse"
        
        Expert Collapse: A few "popular" experts handle most tokens,
        while niche experts are rarely selected. This wastes model capacity.
        
        Solution: Add an auxiliary loss during training that penalizes
        uneven load distribution across experts.
        """
        if self.total_routes == 0:
            return 0.0
        
        load_ratio = self.expert_loads / (self.total_routes + 1e-10)
        variance = np.var(load_ratio)
        loss = variance * self.config.num_experts
        return float(loss)

2.2 DeepSeek V4-Pro’s MoE Engineering Innovations

DeepSeek V4-Pro achieves extreme cost reduction through several key engineering innovations:

(1) Fine-Grained Expert Splitting

Traditional MoE typically uses 8-64 experts; DeepSeek scales this to 256. More experts mean finer-grained knowledge specialization — each expert can focus on a narrower domain, reducing inter-expert knowledge redundancy.

MetricTraditional MoE (Mixtral 8x7B)DeepSeek V4-Pro
Expert count8256
Top-K28
Activation rate25%3.125%
Parameter efficiencyLowExtremely high

(2) Capacity Factor

MoE faces a tricky problem: if too many tokens select the same expert, that expert’s compute load exceeds capacity. DeepSeek introduces a capacity factor (=1.25), allowing each expert to process 25% more tokens than its “fair share.” Excess tokens are dropped and routed through a residual connection.

# Capacity factor in action
capacity = int((tokens_per_batch / num_experts) * capacity_factor)
# Each expert processes at most `capacity` tokens
# Excess tokens take the residual path

(3) Load Balancing Auxiliary Loss

During training, the router tends to “lazy route” — assigning tokens to a few high-performing experts. This causes expert collapse. DeepSeek adds a load-balancing term to the loss function:

def compute_expert_balance_loss(expert_loads, total_routes, num_experts):
    """
    Load balancing loss computation
    
    Core idea: Penalize uneven load distribution,
    forcing the router to distribute tokens uniformly.
    """
    importance = expert_loads / (total_routes + 1e-10)
    uniform = 1.0 / num_experts
    kl_div = importance * np.log(importance / uniform + 1e-10)
    return np.sum(kl_div) * num_experts

2.3 Qwen MoE’s Differentiated Approach

Alibaba Cloud’s Qwen-Plus chose a different MoE path: fewer experts, higher activation rate.

  • Total params: 720B
  • Active params: 72B (10% activation)
  • 16 experts, Top-K=2

This means Qwen’s “granularity” sits between dense models and DeepSeek — computation is ~20% of a dense model, less extreme than DeepSeek, but the model is simpler with lower routing overhead.

# MoE efficiency comparison across different approaches
def compute_moe_efficiency(total_params_b, activation_params_b, num_experts, top_k):
    activation_ratio = activation_params_b / total_params_b
    compute_reduction = 1 - activation_ratio
    expert_utilization = top_k / num_experts
    
    return {
        "activation_ratio": activation_ratio,
        "compute_reduction": compute_reduction,
        "expert_utilization": expert_utilization,
        "efficiency_score": activation_ratio * expert_utilization,
    }

# DeepSeek V4-Pro
ds = compute_moe_efficiency(1600, 48, 256, 8)
print(f"DeepSeek: activation={ds['activation_ratio']*100:.2f}%, reduction={ds['compute_reduction']*100:.1f}%")
# Output: DeepSeek: activation=3.00%, reduction=97.0%

# Qwen-Plus
qw = compute_moe_efficiency(720, 72, 16, 2)
print(f"Qwen: activation={qw['activation_ratio']*100:.1f}%, reduction={qw['compute_reduction']*100:.1f}%")
# Output: Qwen: activation=10.0%, reduction=90.0%

Both approaches have their merits: DeepSeek pursues extreme sparsity for maximum cost reduction, while Qwen balances model simplicity with inference efficiency.


3. Inference Optimization: The Three-Pronged Approach

MoE solved training costs, but inference optimization is equally critical. Here are three key techniques:

3.1 Prong One: KV Cache Optimization

Architecture Diagram

3.1.1 Why KV Cache Is the Inference Bottleneck

In autoregressive generation, generating each new token requires computing attention against all previous tokens. The naive approach recalculates all KV (Key-Value) matrices each time, resulting in O(n²) time complexity.

KV Cache stores the K and V matrices of previous tokens. When generating a new token, only the current token’s Q needs computing — all KV values are read from cache.

But this creates a new issue: KV cache memory grows linearly with sequence length.

For a 131K-context model, a single inference’s KV cache can consume tens of GB. With dozens of concurrent users, GPU memory is quickly exhausted.

3.1.2 Compressed Sparse Attention (CSA)

CSA, proposed by DeepSeek V4-Pro, is the key innovation to address this:

class CompressedSparseAttention:
    """
    Compressed Sparse Attention (CSA)
    
    Core idea:
    - For 10% "important" tokens: keep full KV -> full attention
    - For 90% "unimportant" tokens: compress KV -> compressed attention
    
    Results:
    - Computation reduced to ~27% of full attention
    - KV cache reduced to ~10% of full cache
    """
    
    def __init__(self, hidden_dim=7168, num_heads=64, head_dim=112):
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.compression_factor = 8
        self.sparse_ratio = 0.1  # 10% tokens get full attention
        
        # Compression matrix: head_dim -> head_dim/8
        self.compression_matrix = np.random.randn(
            head_dim, head_dim // self.compression_factor
        ).astype(np.float32) / math.sqrt(head_dim)
    
    def compute_attention(self, query, key, value):
        """
        Sparse attention computation
        
        CSA's key question: "How to judge token importance?"
        Use the dot product between query and key as importance scores.
        """
        seq_len = query.shape[0]
        
        # Step 1: Compute token importance
        importance = np.sum(query * key, axis=-1)
        importance = np.abs(importance)
        
        # Top-k tokens retain full KV
        k = max(1, int(seq_len * self.sparse_ratio))
        top_k_indices = np.argsort(importance)[-k:]
        
        # Step 2: Compress unimportant tokens' KV
        compressed_k = key @ self.compression_matrix
        compressed_v = value @ self.compression_matrix
        
        # Step 3: Mixed attention computation
        output = np.zeros_like(query)
        
        for i in range(seq_len):
            if i in top_k_indices:
                # Important: full attention
                scores = query[i:i+1] @ key.T
                scores = scores / math.sqrt(self.head_dim)
                attn_weights = np.exp(scores - scores.max())
                attn_weights = attn_weights / attn_weights.sum()
                output[i] = attn_weights @ value
            else:
                # Unimportant: compressed attention
                compressed_scores = query[i:i+1] @ compressed_k.T
                compressed_scores = compressed_scores / math.sqrt(
                    self.head_dim // self.compression_factor
                )
                attn_weights = np.exp(compressed_scores - compressed_scores.max())
                attn_weights = attn_weights / attn_weights.sum()
                output[i] = attn_weights @ compressed_v
        
        return output

Computation comparison for a 128K sequence: full attention requires 128K² × 112 matrix multiplies, while CSA requires only 12.8K × 128K × 112 (full part) + 115.2K × 128K × 14 (compressed part) — total reduced to ~27% of full attention.

3.2 Prong Two: Speculative Decoding

Speculative decoding is a “use a small model to boost a big model” technique widely adopted by DeepSeek, Xiaomi, and others.

Core Idea: A smaller, faster “draft model” generates multiple candidate tokens, then the large model validates them in a single parallel forward pass.

class SpeculativeDecoder:
    """
    Speculative Decoder - Use a small draft model to accelerate large model inference
    
    Principle:
    1. Draft model (~100M params) quickly generates K candidate tokens auto-regressively
    2. Target model (~1.6T params) validates all K tokens in one parallel forward pass
    3. Accept the longest valid prefix of candidates
    4. Reject the first invalid token position and continue from there
    
    This strategy yields 2-3x inference speedup, directly reducing per-token cost.
    """
    
    def __init__(self, draft_model, target_model, max_spec_tokens=5):
        self.draft_model = draft_model
        self.target_model = target_model
        self.max_spec_tokens = max_spec_tokens
    
    def speculative_generate(self, prompt, max_new_tokens=100):
        generated = prompt.copy()
        acceptance_rates = []
        
        while len(generated) < max_new_tokens:
            # Step 1: Draft model quickly predicts K tokens
            draft_tokens = self._draft_predict(
                generated, num_tokens=self.max_spec_tokens
            )
            
            # Step 2: Concatenate drafts and do one target model forward pass
            candidate_sequence = generated + draft_tokens
            target_logits = self.target_model.forward(candidate_sequence)
            
            # Step 3: Token-by-token validation - rejection sampling
            accepted_count = 0
            for i, draft_token in enumerate(draft_tokens):
                target_prob = target_logits[len(generated) + i][draft_token]
                draft_prob = self.draft_model.get_prob(
                    generated + draft_tokens[:i], draft_token
                )
                
                r = target_prob / (draft_prob + 1e-10)
                if r >= 1.0 or np.random.random() < r:
                    accepted_count += 1
                else:
                    break
            
            # Step 4: Append accepted tokens
            generated.extend(draft_tokens[:accepted_count])
            acceptance_rates.append(accepted_count / self.max_spec_tokens)
            
            # If all accepted, generate one extra token to fix distribution bias
            if accepted_count == self.max_spec_tokens:
                extra_token = self._sample_from_target(target_logits[-1])
                generated.append(extra_token)
        
        speedup = 1.0 / (1 - np.mean(acceptance_rates) + 
                         np.mean(acceptance_rates) / self.max_spec_tokens)
        
        return generated, {
            "avg_acceptance_rate": np.mean(acceptance_rates),
            "speedup_ratio": speedup,
        }
    
    def _draft_predict(self, context, num_tokens):
        """Draft model fast auto-regressive generation"""
        draft_tokens = []
        for _ in range(num_tokens):
            logits = self.draft_model.forward(context + draft_tokens)
            next_token = np.argmax(logits[-1])
            draft_tokens.append(next_token)
        return draft_tokens
    
    def _sample_from_target(self, logits):
        """Sample from target model distribution"""
        probs = np.exp(logits - logits.max())
        probs = probs / probs.sum()
        return np.random.choice(len(probs), p=probs)

3.3 Prong Three: PD Separation (Prefill-Decode Separation)

PD Separation is a “hidden optimization” discovered by DeepSeek in production.

LLM inference has two phases:

  1. Prefill: Process input tokens, compute KV cache for all inputs. Compute-bound.
  2. Decode: Generate tokens one by one. Memory-bound.

These phases have fundamentally different resource requirements:

MetricPrefillDecode
CharacteristicCompute-boundMemory-bound
BottleneckGPU computeMemory bandwidth
Batch sizeLargeSmall
ParallelismHighLow

PD Separation: Assign Prefill and Decode to different GPUs, each optimized for their specific workload.

// PD Separation Scheduler - Assign Prefill and Decode to Different GPUs
package main

type PDStage int

const (
	StagePrefill PDStage = iota
	StageDecode
)

type PDScheduler struct {
	PrefillNodes  []*ComputeNode
	DecodeNodes   []*ComputeNode
	KVCacheChan   chan *KVCacheTransfer
	PrefillLB     *LoadBalancer
	DecodeLB      *LoadBalancer
}

type ComputeNode struct {
	ID        string
	GPUType   string
	MemoryGB  float64
	IsPrefill bool
}

type KVCacheTransfer struct {
	RequestID    string
	CacheData    []byte
	SourceNode   string
	TargetNode   string
	TransferSize int64
}

func NewPDScheduler(numPrefill, numDecode int) *PDScheduler {
	s := &PDScheduler{
		PrefillNodes: make([]*ComputeNode, 0, numPrefill),
		DecodeNodes:  make([]*ComputeNode, 0, numDecode),
		KVCacheChan:  make(chan *KVCacheTransfer, 1000),
		PrefillLB:    NewLoadBalancer(),
		DecodeLB:     NewLoadBalancer(),
	}
	
	for i := 0; i < numPrefill; i++ {
		s.PrefillNodes = append(s.PrefillNodes, &ComputeNode{
			ID: fmt.Sprintf("prefill-%d", i),
			GPUType: "A100-80G", MemoryGB: 80, IsPrefill: true,
		})
	}
	for i := 0; i < numDecode; i++ {
		s.DecodeNodes = append(s.DecodeNodes, &ComputeNode{
			ID: fmt.Sprintf("decode-%d", i),
			GPUType: "A100-80G", MemoryGB: 80, IsPrefill: false,
		})
	}
	return s
}

func (s *PDScheduler) ScheduleRequest(req *InferenceRequest) *SchedulePlan {
	plan := &SchedulePlan{RequestID: req.ID}
	
	// Long input -> Prefill node (has large batch compute capability)
	if req.InputLength > 4096 {
		plan.PrefillNode = s.PrefillLB.Next(s.PrefillNodes)
	} else {
		plan.PrefillNode = s.DecodeLB.Next(s.DecodeNodes)
	}
	
	// Transfer KV cache from Prefill to Decode node
	if plan.PrefillNode.IsPrefill {
		plan.DecodeNode = s.DecodeLB.Next(s.DecodeNodes)
		s.KVCacheChan <- &KVCacheTransfer{
			RequestID:  req.ID,
			SourceNode: plan.PrefillNode.ID,
			TargetNode: plan.DecodeNode.ID,
		}
	} else {
		plan.DecodeNode = plan.PrefillNode
	}
	
	return plan
}

PD Separation optimization effect: 2-3x throughput improvement per GPU, effectively halving inference costs.

3.4 Tiered Cache: Xiaomi’s HiCache Solution

Xiaomi MiMo’s HiCache is the culmination of KV cache optimization. It implements a GPU memory (L1) → CPU RAM (L2) → SSD (L3) three-tier storage architecture, combined with Sliding Window Attention (SWA) for extreme cost reduction.

// kv_cache_scheduler.go - Three-Tier Cache Scheduling
// Reference: Xiaomi HiCache Solution
// GPU_HBM(L1) → CPU_DDR5(L2) → SSD_NVMe(L3) tiered storage migration

package main

import (
	"fmt"
	"math/rand"
	"time"
)

type CacheLevel int

const (
	L1_GPU CacheLevel = iota
	L2_CPU
	L3_SSD
)

func (l CacheLevel) String() string {
	switch l {
	case L1_GPU:
		return "GPU_HBM"
	case L2_CPU:
		return "CPU_DDR5"
	case L3_SSD:
		return "SSD_NVMe"
	default:
		return "UNKNOWN"
	}
}

type TierConfig struct {
	Level     CacheLevel
	Capacity  int64   // bytes
	Bandwidth float64 // GB/s
	Latency   float64 // ns
	CostPerGB float64 // USD/GB
}

type CacheBlock struct {
	ID          string
	SeqLen      int
	Level       CacheLevel
	SizeBytes   int64
	AccessCount int64
	LastAccess  time.Time
	IsHot       bool
}

type TieredCacheManager struct {
	Tiers          map[CacheLevel]*TierConfig
	Blocks         map[string]*CacheBlock
	GpuMemoryUsed  int64
	GpuMemoryTotal int64
	CpuMemoryUsed  int64
	CpuMemoryTotal int64
	SsdUsed        int64
	SsdTotal       int64
	TotalAccesses  int64
	L1Hits         int64
	L2Hits         int64
	L3Hits         int64
	Misses         int64
	TotalDataMoved int64
	WindowSize     int
	IsSWAEnabled   bool
}

func NewTieredCacheManager() *TieredCacheManager {
	m := &TieredCacheManager{
		Tiers:          make(map[CacheLevel]*TierConfig),
		Blocks:         make(map[string]*CacheBlock),
		GpuMemoryTotal: 80 * 1024 * 1024 * 1024,   // 80GB HBM
		CpuMemoryTotal: 1024 * 1024 * 1024 * 1024,  // 1TB DDR5
		SsdTotal:       30 * 1024 * 1024 * 1024 * 1024, // 30TB NVMe
		WindowSize:     4096,
		IsSWAEnabled:   true,
	}
	
	m.Tiers[L1_GPU] = &TierConfig{Level: L1_GPU, Capacity: m.GpuMemoryTotal,
		Bandwidth: 2000, Latency: 80, CostPerGB: 15.0}
	m.Tiers[L2_CPU] = &TierConfig{Level: L2_CPU, Capacity: m.CpuMemoryTotal,
		Bandwidth: 50, Latency: 100000, CostPerGB: 0.5}
	m.Tiers[L3_SSD] = &TierConfig{Level: L3_SSD, Capacity: m.SsdTotal,
		Bandwidth: 7, Latency: 1000000, CostPerGB: 0.1}
	
	return m
}

// Access three-tier cache - core algorithm
func (m *TieredCacheManager) Access(blockID string, seqLen int) (*CacheBlock, float64) {
	m.TotalAccesses++
	
	if block, exists := m.Blocks[blockID]; exists {
		block.AccessCount++
		block.LastAccess = time.Now()
		
		switch block.Level {
		case L1_GPU:
			m.L1Hits++
			return block, 1.0 // 1微秒
		case L2_CPU:
			m.L2Hits++
			if m.GpuMemoryUsed+block.SizeBytes <= m.GpuMemoryTotal {
				m.PromoteToL1(block)
			}
			return block, 100.0 // 100微秒
		case L3_SSD:
			m.L3Hits++
			if m.CpuMemoryUsed+block.SizeBytes <= m.CpuMemoryTotal {
				m.PromoteToL2(block)
			}
			return block, 1000.0 // 1ms
		}
	}
	
	// Miss: Load from computation source
	m.Misses++
	hiddenDim := 7168
	numLayers := 56
	blockSize := ComputeBlockSize(seqLen, hiddenDim, numLayers,
		m.IsSWAEnabled, m.WindowSize)
	
	newBlock := &CacheBlock{
		ID: blockID, SeqLen: seqLen, SizeBytes: blockSize,
		AccessCount: 1, LastAccess: time.Now(), IsHot: true,
	}
	
	switch {
	case m.GpuMemoryUsed+blockSize <= m.GpuMemoryTotal:
		newBlock.Level = L1_GPU
		m.GpuMemoryUsed += blockSize
	case m.CpuMemoryUsed+blockSize <= m.CpuMemoryTotal:
		newBlock.Level = L2_CPU
		m.CpuMemoryUsed += blockSize
	default:
		newBlock.Level = L3_SSD
		m.SsdUsed += blockSize
	}
	
	m.Blocks[blockID] = newBlock
	return newBlock, 5000.0
}

func (m *TieredCacheManager) PromoteToL1(block *CacheBlock) {
	for m.GpuMemoryUsed+block.SizeBytes > m.GpuMemoryTotal {
		m.EvictFromL1()
	}
	m.ReleaseFromLevel(block)
	block.Level = L1_GPU
	m.GpuMemoryUsed += block.SizeBytes
	m.TotalDataMoved += block.SizeBytes
}

func (m *TieredCacheManager) EvictFromL1() {
	var coldestBlock *CacheBlock
	var oldestTime time.Time = time.Now()
	
	for _, block := range m.Blocks {
		if block.Level == L1_GPU && block.LastAccess.Before(oldestTime) {
			coldestBlock = block
			oldestTime = block.LastAccess
		}
	}
	
	if coldestBlock != nil {
		m.ReleaseFromLevel(coldestBlock)
		if m.CpuMemoryUsed+coldestBlock.SizeBytes <= m.CpuMemoryTotal {
			coldestBlock.Level = L2_CPU
			m.CpuMemoryUsed += coldestBlock.SizeBytes
		}
	}
}

func ComputeBlockSize(tokenCount, hiddenDim, numLayers int,
	isSWA bool, windowSize int) int64 {
	bytesPerToken := int64(hiddenDim * 2 * 4) // float32
	if isSWA && tokenCount > windowSize {
		return int64(windowSize) * bytesPerToken * int64(numLayers)
	}
	return int64(tokenCount) * bytesPerToken * int64(numLayers)
}

The combined effect of tiered caching and SWA:

MetricNo Tiered CacheTiered CacheTiered + SWA
Concurrent requests per GPU1-24-810-20
Cache hit rate-85%+95%+
Effective cached tokens128K500K2M+
Per-token cache cost$0.05/GB$0.005/GB$0.0005/GB

This explains how Xiaomi MiMo can offer cache-hit pricing at ¥0.025/million tokens — not a loss leader, but technology reducing cache costs to near-negligible levels.


4. Domestic AI Chip Adaptation

4.1 Huawei Ascend 910B + CANN Ecosystem

The third pillar of the cost reduction structure is domestic AI chips. Huawei’s Ascend 910B, representing China’s AI chip efforts, is evolving from “usable” to “highly efficient.”

# Cost comparison: domestic vs overseas AI computing (8-GPU server)
compute_configs = {
    "NVIDIA_A100": {
        "gpu_price": 150000,
        "server_price": 1300000,
        "tflops_fp16": 312,
        "power_w": 6500,
        "maintenance_yearly": 130000,
    },
    "NVIDIA_H100": {
        "gpu_price": 300000,
        "server_price": 2600000,
        "tflops_fp16": 989,
        "power_w": 14000,
        "maintenance_yearly": 300000,
    },
    "Ascend_910B": {
        "gpu_price": 80000,
        "server_price": 720000,
        "tflops_fp16": 256,
        "power_w": 5200,
        "maintenance_yearly": 72000,
    },
}

# Cost per TFLOPS (3-year depreciation)
for name, cfg in compute_configs.items():
    total_cost = cfg["server_price"] + cfg["maintenance_yearly"] * 3
    total_tflops = cfg["tflops_fp16"] * 3 * 365 * 24 * 0.7
    cost_per_tflop = total_cost / (total_tflops * 3600)
    print(f"{name}: ¥{cost_per_tflop:.4f} per TFLOPS-hour")

# Output:
# NVIDIA_A100: ¥0.0023 per TFLOPS-hour
# NVIDIA_H100: ¥0.0018 per TFLOPS-hour
# Ascend_910B: ¥0.0014 per TFLOPS-hour

The Ascend 910B already surpasses the A100 in cost per TFLOPS, approaching H100 levels. For domestic LLM providers like DeepSeek and Xiaomi, large-scale deployment of domestic chips means hardware costs reduced by 40-60%.

4.2 CANN Full-Stack Adaptation Pipeline

Huawei’s CANN (Compute Architecture for Neural Networks) provides the software stack for Ascend, covering the entire pipeline from model compilation to runtime inference.

class AscendCANNAdapter:
    """
    Ascend CANN Adapter - Adapt generic models to domestic AI hardware
    
    Adaptation pipeline:
    1. Model conversion: PyTorch/TensorFlow -> ONNX -> Ascend IR
    2. Operator compilation: Map generic ops to Ascend AI Core
    3. Memory optimization: Leverage HCCS interconnect for multi-card comms
    4. Runtime scheduling: CANN Runtime manages inference pipeline
    """
    
    def __init__(self, model_path, device_id=0):
        self.model_path = model_path
        self.device_id = device_id
        self.operator_map = self._build_operator_map()
    
    def _build_operator_map(self):
        """Ascend supports 2000+ operators"""
        return {
            "torch.nn.MultiheadAttention": "ascend.MultiHeadAttention",
            "F.scaled_dot_product_attention": "ascend.FlashAttentionScore",
            "torch.nn.Linear": "ascend.MatMul",
            "torch.nn.GELU": "ascend.Gelu",
            "torch.nn.LayerNorm": "ascend.LayerNorm",
            "torch.nn.RMSNorm": "ascend.RmsNorm",
            "torch.topk": "ascend.TopK",
            "F.softmax": "ascend.SoftmaxV2",
        }
    
    def convert_model(self):
        print("Step 1: Parsing original model graph...")
        print("Step 2: Operator mapping and fusion...")
        print("  - Fuse MatMul+BiasAdd operators")
        print("  - Fuse multi-op attention into FlashAttention")
        print("  - Optimize MoE TopK+Softmax pipeline")
        print("Step 3: Memory planning & HCCS topology config...")
        print("  - Configure HCCS 8-card full interconnect")
        print("  - Plan MoE expert sharding across NPUs")
        print("  - Optimize KV cache cross-card transfer path")
        print("Step 4: Generate Ascend binary instructions...")
        
        return {
            "status": "success",
            "ops_mapped": 1243,
            "ops_fused": 386,
            "memory_saved_gb": 12.5,
            "compile_time_s": 342,
        }
    
    def estimate_cost_saving(self, total_params_b, daily_inference_queries):
        overseas_cost = {
            "hardware": 1300000,
            "maintenance_3yr": 390000,
            "electricity_3yr": 170000,
            "total_3yr": 1860000,
        }
        domestic_cost = {
            "hardware": 720000,
            "maintenance_3yr": 216000,
            "electricity_3yr": 136000,
            "total_3yr": 1072000,
        }
        
        saving = overseas_cost["total_3yr"] - domestic_cost["total_3yr"]
        saving_ratio = saving / overseas_cost["total_3yr"]
        
        return {
            "overseas_3yr": overseas_cost["total_3yr"],
            "domestic_3yr": domestic_cost["total_3yr"],
            "absolute_saving": saving,
            "saving_ratio": saving_ratio,
        }

adapter = AscendCANNAdapter("deepseek_v4_pro.pt")
result = adapter.estimate_cost_saving(1600, 10000000)
print(f"3-year savings for 1000 servers: ¥{result['absolute_saving']*1000/1e8:.1f} billion")
# Output: 3-year savings for 1000 servers: ¥7.88亿

5. Cost Comparison & Market Landscape

5.1 API Cost Comparison Across Four Scenarios

# ============ Multi-Model API Cost Comparison ============

PRICING_DATA = {
    "deepseek_v4_pro": {
        "name": "DeepSeek V4-Pro", "region": "China",
        "input_cache_hit_cny": 0.025, "input_cache_miss_cny": 3.0, "output_cny": 6.0,
    },
    "mimo_v2_5_pro": {
        "name": "MiMo V2.5 Pro", "region": "China",
        "input_cache_hit_cny": 0.025, "input_cache_miss_cny": 3.0, "output_cny": 6.0,
    },
    "qwen_plus": {
        "name": "Qwen-Plus", "region": "China",
        "input_cache_hit_cny": 0.5, "input_cache_miss_cny": 2.0, "output_cny": 6.0,
    },
    "gpt_5_5": {
        "name": "GPT-5.5", "region": "US",
        "input_cache_hit_cny": 18.13, "input_cache_miss_cny": 36.25, "output_cny": 217.50,
    },
    "claude_opus_4": {
        "name": "Claude Opus 4", "region": "US",
        "input_cache_hit_cny": 21.75, "input_cache_miss_cny": 108.75, "output_cny": 543.75,
    },
}

def calculate_cost(model, input_tokens, output_tokens, cache_hit_ratio, calls=1):
    input_cost = (
        input_tokens / 1e6 * model["input_cache_hit_cny"] * cache_hit_ratio +
        input_tokens / 1e6 * model["input_cache_miss_cny"] * (1 - cache_hit_ratio)
    )
    output_cost = output_tokens / 1e6 * model["output_cny"]
    return (input_cost + output_cost) * calls

scenarios = {
    "Short Chat (1K in/500 out)": {"input": 1000, "output": 500, "cache_hit": 0.2, "calls": 1},
    "Doc Analysis (100K in/5K out)": {"input": 100000, "output": 5000, "cache_hit": 0.6, "calls": 1},
    "Agent Multi-Round (32K in/8K×10)": {"input": 32768, "output": 8192, "cache_hit": 0.5, "calls": 10},
    "Batch Processing (8K in/2K×1000)": {"input": 8192, "output": 2048, "cache_hit": 0.7, "calls": 1000},
}

for sname, params in scenarios.items():
    ds = calculate_cost(PRICING_DATA["deepseek_v4_pro"], **params)
    gpt = calculate_cost(PRICING_DATA["gpt_5_5"], **params)
    claude = calculate_cost(PRICING_DATA["claude_opus_4"], **params)
    print(f"{sname}: DeepSeek=¥{ds:.4f}, GPT-5.5=${gpt/7.25:.2f}, Ratio={gpt/ds:.0f}x")

Results reveal cost ratios ranging from 12x to 87x in DeepSeek’s favor.

5.2 K-Shaped Market Structure

Price Tier
  ^
  |  Overseas Premium (GPT-5.5, Claude Opus 4)
  |     $5-15/M input, $30-75/M output
  |     Hiking: +30-100% on average
  |
  |  Chinese Premium (GLM-5)
  |     ¥25/M input, ¥50/M output
  |     +60% but demand surges
  |
  |  Chinese Mainstream (Qwen-Plus)
  |     ¥2/M input, ¥6/M output
  |     Stable pricing
  |
  |  Chinese Affordable (DeepSeek, MiMo)
  |     ¥3/M input, ¥6/M output
  |     Cache hit: ¥0.025 only!
  |     Permanent cuts of 75-99%!
  +-----------------------------------------> Time

5.3 Why Overseas Models Are Raising Prices

  1. Dense models have linear compute costs: GPT-5.5’s 2000B parameter dense model activates all parameters every inference — 10-20x the compute of MoE models.

  2. Hardware costs remain high: H100/B200 GPUs remain in short supply.

  3. Different pricing strategies: Overseas providers pursue “high margin + high price,” while Chinese players pursue “scale + low price.”


6. Conclusion: Where Does the Cost-Reduction Journey End?

6.1 Room for Further Cost Reduction

  1. More extreme sparsity: DeepSeek already achieves 3.125% activation; theoretically, 1-2% is possible.

  2. Hardware-model co-design: Huawei has Ascend+MindSpore, Xiaomi has Surge+MiMo. Co-optimization can yield 10-20% more efficiency.

  3. Dynamic inference optimization: RL-based dynamic resource allocation for “compute-on-demand.”

  4. Distributed inference architecture: Shard ultra-large models across multiple data centers.

6.2 But Not Unlimited

There are physical limits:

  • Compute floor: Even extreme MoE requires basic embedding and routing computation.
  • Memory floor: Model weights must be stored — this cost is unavoidable.
  • Communication floor: Distributed inference has fundamental latency and bandwidth limits.

6.3 Implications for Developers

For AI application developers, this is the best of times:

  1. Embrace Chinese models: DeepSeek V4-Pro and MiMo offer unmatched cost-performance ratios with 128K-1M context windows.

  2. Leverage cache hits: For Agent applications (multi-turn conversations, RAG), design prompt reuse strategies to trigger 90%+ cost savings.

  3. Multi-model strategy: Use affordable models for simple tasks, premium models for complex reasoning.

  4. Consider domestic chips: For self-hosted deployments, the Ascend 910B offers 40-60% lower total cost than the A100.

Final Thoughts

Chinese LLMs’ counter-trend price cuts represent a systematic, technology-driven cost-reduction revolution. MoE sparse architecture reduces computation to 5%, sparse attention and tiered caching reduce inference costs to 10%, and domestic chip adaptation reduces hardware costs to 60% — three compounding levers resulting not in “burning money for market share,” but in “technology delivering on its promise.”

When DeepSeek offers cache-hit pricing at ¥0.025/million tokens, it’s not fighting a price war — it’s demonstrating the peak of Chinese AI engineering capability.