Unified Architecture of Multimodal Large Models: From LLaVA-NeXT to Gemini 2.0
Background: Why Unified Multimodal Architecture Is a Must-Have for AI Infrastructure
In 2023, when GPT-4V first demonstrated image understanding capabilities, the industry was still immersed in the narrative of “multimodal alignment.” By the end of 2024, LLaVA-NeXT achieved video-level understanding in an open-source format, while Gemini 2.0 natively supported multimodal joint reasoning across audio, image, video, and 3D point clouds. The technological leap behind this represents a paradigm shift in AI architecture from “perceptual stitching” to “cognitive unification.”
Traditional multimodal systems suffer from three critical flaws:
- Modal Isolation: Text, image, and audio each use independent encoders, with cross-modal interaction relying on shallow feature alignment.
- Latency Explosion: Video processing requires frame-by-frame feeding into the vision model, with a 30-second video on V100 taking over 10 minutes for inference.
- Training Fragmentation: The three-stage separation of pretraining, alignment fine-tuning, and instruction fine-tuning leads to knowledge forgetting.
LLaVA-NeXT and Gemini 2.0 provide a unified solution: convert all modalities into a unified token sequence and perform end-to-end reasoning within the Transformer architecture. This is not just a technical route choice but a prerequisite for building general-purpose AI agents—only by eliminating modal boundaries can models simultaneously understand dialogue, background music, and facial expressions in a video, just like humans.
Technical Principles: Three Leaps from Alignment to Unification
First Generation: Feature Alignment Paradigm (CLIP Era)
Early multimodal models (e.g., CLIP, ALIGN) adopted a dual-tower architecture, projecting images and text into a shared semantic space via contrastive learning. The essence of this method is “finding similarity,” not “true understanding.” For example, a model could recognize that an image of a “cat” corresponds to the text “cat,” but could not answer “why is this cat smiling.”
Second Generation: Bridging Paradigm (LLaVA 1.5)
The LLaVA series introduced a bridging architecture of “vision encoder + projection layer + LLM.” The core innovations include:
- Using CLIP ViT-L/14 as the vision encoder.
- Mapping image patch tokens to the LLM’s embedding space via a learnable Linear projection layer.
- During LLM inference, visual tokens and text tokens participate together in self-attention computation.
However, this architecture has significant bottlenecks: video processing requires frame-by-frame feature extraction and cannot handle audio, 3D, or other modalities.
Third Generation: Unified Tokenization Paradigm (LLaVA-NeXT and Gemini 2.0)
This is the core technical focus of this article. The central idea of unified tokenization is: encode any modal data into a token sequence with the same dimension and semantic structure. The specific implementation includes three key components:
Modality Encoder Family: Design dedicated encoders for each modality, but with a unified output format of a 3D tensor [B, L, D] (B=batch, L=sequence length, D=hidden dimension).
Dynamic Sequence Compression: Modalities like video and 3D point clouds can generate extremely long token sequences (e.g., a 30-second video at 1fps produces 900 patch tokens). These need to be compressed to a manageable length (typically 256-1024 tokens) via downsampling or attention pooling.
Unified Attention Mechanism: All modality tokens are mixed and computed inside the LLM using Rotary Position Embedding (RoPE) and causal attention, enabling cross-modal reasoning.
System Architecture Design: Distributed Inference System for Multimodal Fusion
Overall Architecture Layers
+-------------------+ +-------------------+ +-------------------+
| Access Layer | | Orchestration | | Inference |
| (Multimodal Data | --> | Layer | --> | Engine Layer |
| Reception) | | (Tokenization & | | (Unified |
+-------------------+ | Scheduling) | | Transformer) |
| +-------------------+ +-------------------+
| | |
v v v
+-------------------+ +-------------------+ +-------------------+
| Image Encoder | | Dynamic Sequence | | KVCache Manager |
| Cluster | | Compressor | | Distributed |
| Audio Encoder | | Modality Routing | | Attention |
| Cluster | | Table | | Computation |
| Video Encoder | | Priority Queue | | Mixed Precision |
| Cluster | | | | Scheduler |
+-------------------+ +-------------------+ +-------------------+
Core Design Principles
Principle 1: Modality-Agnostic Token Representation All modality encoder outputs must satisfy:
- Dimension Consistency:
hidden_dim = 4096(consistent with LLM hidden layer dimension). - Unified Position Encoding: Use absolute position encoding + modality identifier.
- Unified Attention Mask: Support causal/bidirectional mixed masks across modalities.
Principle 2: Dynamic Resource Allocation Video encoding requires GPUs, while text encoding only needs CPUs. The system must dynamically allocate computing resources based on input modalities. Adopt a “decode-then-infer” strategy: after all modalities are encoded, they are sent together to the Transformer for inference.
Principle 3: Streaming Processing Support For real-time video or audio streams, use a sliding window mechanism, processing fixed-duration data (e.g., 2 seconds) at a time, and reduce redundant computation through state reuse.
Key Data Structures
// Multimodal Unified Token Structure
type UnifiedToken struct {
TokenID uint64 // Token index in LLM vocabulary
Embedding []float32 // Encoded embedding vector (dimension=hidden_dim)
Modality string // "text", "image", "audio", "video", "pointcloud"
Position int // Absolute position in the sequence
IsPadding bool // Whether it is a padding token
ModalityID int // Modality identifier (to distinguish encoding spaces)
}
// Multimodal Input Request
type MultimodalRequest struct {
RequestID string
TextInput string // Text prompt
Images []ImageData // Image list (supports multiple images)
AudioClip []byte // Audio data (PCM format)
VideoStream []VideoFrame // Video frame sequence
PointCloud []Point3D // 3D point cloud data
Config InferenceConfig // Inference parameters
}
// Modality Encoder Interface
type ModalityEncoder interface {
Encode(data interface{}) ([]UnifiedToken, error)
ModalityType() string
InputCost() ResourceCost // Computing resources required for encoding
}
Core Implementation: Golang Multimodal Inference Engine
Modality Encoder Registration and Factory Pattern
package engine
import (
"context"
"fmt"
"sync"
)
// Modality Encoder Registry
var encoderRegistry = struct {
sync.RWMutex
encoders map[string]ModalityEncoder
}{
encoders: make(map[string]ModalityEncoder),
}
// Register Modality Encoder
func RegisterEncoder(encoder ModalityEncoder) {
encoderRegistry.Lock()
defer encoderRegistry.Unlock()
typ := encoder.ModalityType()
if _, exists := encoderRegistry.encoders[typ]; exists {
panic(fmt.Sprintf("encoder for modality %s already registered", typ))
}
encoderRegistry.encoders[typ] = encoder
}
// Get Modality Encoder
func GetEncoder(modality string) (ModalityEncoder, error) {
encoderRegistry.RLock()
defer encoderRegistry.RUnlock()
enc, ok := encoderRegistry.encoders[modality]
if !ok {
return nil, fmt.Errorf("unsupported modality: %s", modality)
}
return enc, nil
}
// Image Encoder Implementation (Based on ViT Architecture)
type ImageEncoder struct {
vitModel *ViTModel // Vision Transformer model
projector *LinearProjector // Dimension projection layer
}
func (e *ImageEncoder) Encode(data interface{}) ([]UnifiedToken, error) {
imageData, ok := data.(ImageData)
if !ok {
return nil, fmt.Errorf("invalid image data type")
}
// 1. Image preprocessing: resize to 224x224, normalize
processed, err := preprocessImage(imageData, 224, 224)
if err != nil {
return nil, fmt.Errorf("image preprocessing failed: %w", err)
}
// 2. ViT encoding: split image into patches and encode
patchTokens, err := e.vitModel.Encode(processed)
if err != nil {
return nil, fmt.Errorf("ViT encoding failed: %w", err)
}
// 3. Dimension projection: project from ViT output dimension to LLM hidden dimension
unifiedTokens := make([]UnifiedToken, len(patchTokens))
for i, token := range patchTokens {
projected := e.projector.Forward(token.Embedding)
unifiedTokens[i] = UnifiedToken{
TokenID: uint64(i + 1), // Start from 1, 0 reserved for special tokens
Embedding: projected,
Modality: "image",
Position: i,
IsPadding: false,
ModalityID: 1, // Image modality identifier is 1
}
}
return unifiedTokens, nil
}
// Video Encoder Implementation (Frame-by-frame encoding + temporal compression)
type VideoEncoder struct {
frameEncoder *ImageEncoder // Reuse image encoder
temporalPool *TemporalPooler // Temporal dimension pooling
}
func (e *VideoEncoder) Encode(data interface{}) ([]UnifiedToken, error) {
frames, ok := data.([]VideoFrame)
if !ok {
return nil, fmt.Errorf("invalid video data type")
}
// 1. Frame-by-frame encoding (parallel processing)
var wg sync.WaitGroup
frameTokens := make([][]UnifiedToken, len(frames))
errCh := make(chan error, len(frames))
for i, frame := range frames {
wg.Add(1)
go func(idx int, f VideoFrame) {
defer wg.Done()
tokens, err := e.frameEncoder.Encode(f)
if err != nil {
errCh <- fmt.Errorf("frame %d encoding failed: %w", idx, err)
return
}
frameTokens[idx] = tokens
}(i, frame)
}
wg.Wait()
close(errCh)
if err := <-errCh; err != nil {
return nil, err
}
// 2. Temporal dimension pooling: merge frame tokens into video token sequence
// Use Attention Pooling to compress temporal dimension
videoTokens := e.temporalPool.Pool(frameTokens)
// 3. Add modality identifier
for i := range videoTokens {
videoTokens[i].Modality = "video"
videoTokens[i].ModalityID = 2
videoTokens[i].Position = i
}
return videoTokens, nil
}
Unified Inference Engine Core
// Multimodal Inference Engine
type MultimodalEngine struct {
llmModel *TransformerModel // Base LLM
tokenizer *Tokenizer // Text tokenizer
maxTokenLength int // Maximum token length
kvCache *KVCache // KVCache manager
}
// Inference Entry
func (e *MultimodalEngine) Infer(ctx context.Context, req MultimodalRequest) (string, error) {
// 1. Collect encoding results from all modalities
allTokens := make([]UnifiedToken, 0)
// Text encoding
textTokens, err := e.encodeText(req.TextInput)
if err != nil {
return "", fmt.Errorf("text encoding failed: %w", err)
}
allTokens = append(allTokens, textTokens...)
// Image encoding (supports multiple images)
for _, img := range req.Images {
enc, _ := GetEncoder("image")
imgTokens, err := enc.Encode(img)
if err != nil {
return "", fmt.Errorf("image encoding failed: %w", err)
}
allTokens = append(allTokens, imgTokens...)
}
// Audio encoding
if len(req.AudioClip) > 0 {
enc, _ := GetEncoder("audio")
audioTokens, err := enc.Encode(req.AudioClip)
if err != nil {
return "", fmt.Errorf("audio encoding failed: %w", err)
}
allTokens = append(allTokens, audioTokens...)
}
// Video encoding (supports streaming)
if len(req.VideoStream) > 0 {
enc, _ := GetEncoder("video")
videoTokens, err := enc.Encode(req.VideoStream)
if err != nil {
return "", fmt.Errorf("video encoding failed: %w", err)
}
allTokens = append(allTokens, videoTokens...)
}
// 2. Sequence assembly and position encoding
assembled := e.assembleSequence(allTokens)
if len(assembled) > e.maxTokenLength {
return "", fmt.Errorf("sequence length %d exceeds limit %d",
len(assembled), e.maxTokenLength)
}
// 3. Autoregressive inference
outputTokens, err := e.llmModel.Generate(ctx, assembled, req.Config)
if err != nil {
return "", fmt.Errorf("generation failed: %w", err)
}
// 4. Decode output
result := e.tokenizer.Decode(outputTokens)
return result, nil
}
// Sequence assembly: add special tokens and reorder positions
func (e *MultimodalEngine) assembleSequence(tokens []UnifiedToken) []int64 {
// Add system prompt and modality separators
seq := make([]int64, 0, len(tokens)+4)
// Start token
seq = append(seq, e.tokenizer.BosTokenID)
// Add tokens in original order (preserve modality information)
for _, t := range tokens {
if t.IsPadding {
continue
}
seq = append(seq, int64(t.TokenID))
}
// End token
seq = append(seq, e.tokenizer.EosTokenID)
return seq
}
Streaming Video Processing Optimization
// Video Stream Processor: Supports sliding window
type VideoStreamProcessor struct {
windowSize int // Sliding window size (frames)
overlap int // Window overlap frames
frameBuffer []VideoFrame // Frame buffer
encoder *VideoEncoder
lastOutput []UnifiedToken // Last output cache
}
// Process video stream chunk
func (p *VideoStreamProcessor) ProcessChunk(chunk []VideoFrame) ([]UnifiedToken, error) {
// 1. Update buffer
p.frameBuffer = append(p.frameBuffer, chunk...)
// 2. If buffer is not full, return empty
if len(p.frameBuffer) < p.windowSize {
return nil, nil
}
// 3. Extract sliding window
window := p.frameBuffer[:p.windowSize]
// 4. Remove processed frames from buffer (retain overlap)
p.frameBuffer = p.frameBuffer[p.windowSize-p.overlap:]
// 5. Encode window
tokens, err := p.encoder.Encode(window)
if err != nil {
return nil, fmt.Errorf("window encoding failed: %w", err)
}
// 6. Deduplicate: remove tokens overlapping with last output
if p.lastOutput != nil {
tokens = p.deduplicate(tokens, p.lastOutput)
}
p.lastOutput = tokens
return tokens, nil
}
// Token deduplication: based on timestamp and position
func (p *VideoStreamProcessor) deduplicate(current, previous []UnifiedToken) []UnifiedToken {
// Simple implementation: assume overlapping tokens have the same position
seen := make(map[int]bool)
for _, t := range previous {
seen[t.Position] = true
}
result := make([]UnifiedToken, 0)
for _, t := range current {
if !seen[t.Position] {
result = append(result, t)
}
}
return result
}
Performance Optimization: From Inference Latency to Training Efficiency
Inference Optimization Strategies
Strategy 1: Parallel Modality Encoding In the LLaVA-NeXT architecture, image encoding is the main bottleneck. We adopt a “pre-encoding + caching” strategy: pre-encode user-uploaded images and cache the tokens to avoid redundant computation. This reduces image encoding time from 150ms to 5ms (when cached).
Strategy 2: Dynamic Sequence Compression Video token sequence length is proportional to the number of frames. Use Temporal Pooling to compress 196 patch tokens per frame into 4 video tokens (compression ratio 49:1) while maintaining semantic integrity. The specific implementation uses a Cross-Attention mechanism:
// Temporal Attention Pooling
type TemporalPooler struct {
queryTokens int // Number of compressed tokens
attention *MultiHeadAttention
}
func (p *TemporalPooler) Pool(frameTokens [][]UnifiedToken) []UnifiedToken {
// Input: [num_frames, patches_per_frame, hidden_dim]
// Output: [queryTokens, hidden_dim]
// 1. Flatten to 2D sequence
flattened := flatten(frameTokens)
// 2. Initialize learnable queries
queries := make([][]float32, p.queryTokens)
for i := range queries {
queries[i] = randomInit(hiddenDim)
}
// 3. Cross-Attention: queries extract information from flattened
pooled := p.attention.Forward(queries, flattened, flattened)
// 4. Convert to UnifiedToken
result := make([]UnifiedToken, p.queryTokens)
for i, emb := range pooled {
result[i] = UnifiedToken{
Embedding: emb,
Position: i,
}
}
return result
}
Strategy 3: KVCache Reuse In dialogue scenarios, visual tokens remain unchanged across multiple turns. By freezing the KVCache of visual tokens, redundant computation is avoided. Gemini 2.0’s paper shows this optimization can increase inference speed for multi-turn video dialogues by 4x.
Training Optimization
Mixed Modality Training Data Generation Use an automated annotation pipeline: extract key frames from video, transcribe audio to text, generate descriptions. During training, adopt a “modality random dropout” strategy to enhance model robustness when some modalities are missing.
Gradient Accumulation and Mixed Precision When training LLaVA-NeXT-7B on 8 A100 GPUs, adopt:
- Gradient accumulation steps: 8 (equivalent batch size=64)
- Mixed precision: FP16+BF16 mixed
- ZeRO-3 optimizer state sharding
- Training throughput: approximately 1200 tokens/s/GPU
System-Level Optimization
Compute Resource Scheduling Design a “modality-aware scheduler” that allocates GPU resources based on input modality type:
- Pure text: CPU inference (using ONNX Runtime quantized model)
- Text + image: 1 GPU (encoding + inference)
- Video + audio: 2 GPUs (parallel encoding + inference)
// Resource Scheduler
type ResourceScheduler struct {
gpuPool *GPUResourcePool
cpuPool *CPUResourcePool
modalityCost map[string]ResourceCost
}
func (s *ResourceScheduler) Schedule(req MultimodalRequest) (Allocation, error) {
totalCost := ResourceCost{GPU: 0, CPU: 0, Memory: 0}
// Calculate resource requirements for each modality
if len(req.Images) > 0 {
totalCost.Add(s.modalityCost["image"])
}
if len(req.VideoStream) > 0 {
totalCost.Add(s.modalityCost["video"])
}
if len(req.AudioClip) > 0 {
totalCost.Add(s.modalityCost["audio"])
}
// Add LLM inference cost
totalCost.Add(s.modalityCost["llm"])
// Allocate resources
allocation, err := s.gpuPool.Allocate(totalCost.GPU, totalCost.Memory)
if err != nil {
return Allocation{}, fmt.Errorf("resource allocation failed: %w", err)
}
return allocation, nil
}
Production Practice: From Laboratory to Industrial Deployment
Typical Case: Intelligent Video Surveillance System
We deployed a multimodal analysis system based on the LLaVA-NeXT architecture in a smart city project. The system simultaneously processes 16 video streams at 30fps, performing real-time analysis of pedestrian behavior, vehicle trajectories, and abnormal events.
Challenge 1: Real-Time Requirements Video stream processing latency must be less than 200ms (end-to-end). Solutions:
- Use a sliding window (2-second window, 0.5-second sliding step).
- Video encoding and LLM inference are asynchronous.
- Use NVIDIA Triton Inference Server for model deployment.
Challenge 2: Multi-Stream Concurrency 16 video streams require 16 encoder instances. By sharing LLM model weights (independent KVCache), GPU memory consumption is reduced from 16×24GB to 24GB+16×2GB=56GB (using A100 80GB).
Challenge 3: Modality Fusion In real-world scenarios, audio (dialogue, alarm sounds) and video need joint analysis. For example, “hearing glass breaking + seeing a person running” triggers an intrusion alert. We designed modality fusion rules:
- Audio events trigger video frame caching.
- Video analysis results are concatenated with audio features.
- Jointly sent to LLM for inference.
Monitoring and Tuning
Key Metrics
- Encoding latency: <50ms/frame (video), <30ms/image (image)
- Inference latency: <150ms/request (including 5 token outputs)
- Throughput: >1000 inferences/second (8-card A100 cluster)
- Memory usage: <72GB/card (LLaVA-NeXT-13B)
Common Fault Handling
- OOM (Out of Memory): Enable dynamic sequence compression, reducing video tokens from 4096 to 256.
- Repeated inference results: Adjust repetition penalty coefficient (repetition_penalty=1.15).
- Modality mismatch: Add modality validation at the input, such as automatically generating default descriptions when an image is detected without a text prompt.
Cost Optimization
In production environments, multimodal inference computing costs are 5-10 times higher than pure text. We adopt the following optimization methods:
Model Distillation Distill LLaVA-NeXT-13B into a 7B version, maintaining 90% accuracy while doubling inference speed. The distillation process uses KL divergence loss, with training data from real user requests.
Quantization Deployment Use INT8 quantization for LLM and encoders, reducing memory usage by 60% and increasing inference speed by 30%. On A100, the quantized 7B model reduces inference latency from 120ms to 80ms.
Caching Strategy
- Image caching: User-uploaded images are reused within 1 hour, directly returning cached tokens.
- Video keyframe caching: Video clips of the same scene (e.g., fixed-view surveillance cameras) are encoded only once.
Future Outlook and Summary
From LLaVA-NeXT to Gemini 2.0, multimodal models are undergoing a qualitative change from “alignment stitching” to “native unification.” This architectural evolution brings not only improved technical capabilities but also an exponential expansion of AI application scenarios.
Current Limitations
- Long video processing still needs optimization: a 30-minute video produces over 100,000 token sequences, which existing Transformer architectures struggle to handle.
- Limited modality count: 3D, tactile, olfactory, and other modalities are not yet mature.
- Scarce training data: High-quality multimodal alignment data is difficult to obtain.
Future Directions
- Infinite Context: Combine with state-space models (e.g., Mamba) to achieve video-level long sequence processing.
- Modal Generation: From “understanding” to “generation,” models can not only analyze videos but also generate corresponding audio descriptions.
- Agent Integration: Multimodal models serve as the perceptual core of agents, deeply integrated with tool invocation and memory systems.
Recommendations for the Technical Community For teams planning to build multimodal systems, my advice is:
- Start with the LLaVA-NeXT open-source solution to understand the core idea of unified tokenization.
- Pay attention to Gemini 2.0 technical reports, especially the design of dynamic sequence compression and modality routing.
- In engineering implementation, prioritize encoding parallelization and KVCache reuse, as these are the performance bottlenecks.
Unified multimodal architecture is not the endpoint but the necessary path to general-purpose AI agents. When models can simultaneously “see, hear, speak, and understand” like humans, the era of true intelligent interaction will truly arrive.
