Sapient Intelligence HRM-Text: The $1,500 1B-Parameter Reasoning Revolution

Monday, June 15, 2026

On May 18, 2026, Sapient Intelligence released HRM-Text—a 1B-parameter model trained from scratch for approximately $1,500 (16 H100 GPUs, under 2 days) on just 40B tokens. It achieves 56.2 on MATH, 84.5 on GSM8K, and 81.9 on ARC-Challenge—surpassing models 10-70× its size. Endorsed by HuggingFace CEO and Turing Award winner Yoshua Bengio’s team. This is not fine-tuning—it’s an architectural revolution from scratch.

Introduction: An Impossible Number

A ~1B parameter model scores 56.2 on MATH, 84.5 on GSM8K, 81.9 on ARC-Challenge. Training cost: ~$1,500. Sixteen H100 GPUs for under two days.

Your first instinct: this must be fine-tuned from a larger model. It can’t be trained from scratch at that budget.

But HRM-Text is indeed trained from scratch, using only ~40B unique tokens—roughly 1/225 of Llama 3.2 3B (9T tokens) and 1/900 of Qwen3.5 2B (36T tokens).

Paper: HRM-Text: Efficient Pretraining Beyond Scaling, arXiv:2605.20613

The HRM Architecture: Dual-Timescale Hierarchical Recurrence

How Standard Transformers Work

A standard Transformer consists of sequentially arranged, independently parameterized layers. Input propagates forward: layer 1 → layer 2 → … → output. To increase capability, you stack more layers, increase hidden dimensions, or train more parameters.

Analogy: passing a document to multiple different editors, each editing it once before passing it on.

HRM: “Two Editing Teams Revising the Same Draft”

HRM-Text introduces two modules operating at different timescales:

H-module (High-level/Strategic Brain): Updates slowly, maintains stable semantic context, provides long-term constraints. Think of a chief editor overseeing strategic direction.
L-module (Low-level/Execution Brain): Updates quickly, handles local computation and iterative refinement. Think of copy editors polishing details word by word.

Key Distinction: Not “Two Brains” Taped Together

This is critically different from the common “big brain + small brain” approach where two separately trained models communicate via text interfaces. HRM’s H and L modules belong to the same network. They don’t exchange messages through text space—they iterate the same internal state in the same latent space. What information flows between modules and how labor is divided are jointly determined by a unified optimization process.

HRM doesn’t bolt a planner onto an executor—it builds hierarchical computation into a single model.

8 Recurrent Updates Per Token

Per the paper’s configuration, each forward pass executes 2 high-level cycles, each containing:

Three L-module updates (fast detail refinement)
One H-module update (strategic adjustment)

Total per token: 8 recurrent updates—6 low-level + 2 high-level.

Effective depth analysis shows that deeper steps continue to produce incremental representational changes. The recurrence doesn’t simply repeat—it continuously modifies internal states.

Source: HRM-Text paper, arXiv:2605.20613, Section 3.1

MagicNorm and Progressive Training

Recurrent architectures face well-known stability challenges. Deeper recurrence offers more opportunity for self-correction, but activation variance accumulates and gradients vanish or explode. This is why RNNs were surpassed by Transformers in the 2010s.

HRM-Text introduces two key innovations:

MagicNorm: Stabilizing Both Forward and Backward Passes

MagicNorm simultaneously stabilizes forward activation growth and backward gradient flow. The module retains PreNorm structure internally (good for gradients), but adds an extra normalization at each recurrent module’s exit point. This constrains activation variance growth across cycles while preserving gradient pathways.

h' = γ · (h - μ) / σ + β

Where γ and β are learnable affine parameters, μ and σ are current state statistics. The key difference from standard LayerNorm: MagicNorm is placed at the exit point of each recurrent step, not between layers.

Warmup Deep Credit Assignment

This controls how far back gradients can propagate. Early in training, gradients only flow back through the last 2 recurrent steps. As training stabilizes, the range linearly increases to the last 5 steps.

Think of it as a progressive “accountability mechanism”: early training makes the model responsible only for its most recent computations; as stability improves, earlier computations are gradually held accountable.

Source: HRM-Text paper, Section 3.3

Training Objective: Task Completion + PrefixLM

Beyond Next-Token Prediction

Standard language models use autoregressive next-token prediction: given text, predict each subsequent token. Every word carries equal weight—filler phrases, function words, reasoning steps, all treated identically.

HRM-Text takes a radically different approach: it skips large-scale raw-text pretraining entirely and trains directly on instruction-response pairs from scratch. The model computes token-level loss only on the response portion.

Analogy: a teacher grading only the answer section, not the “copy the question” part.

PrefixLM: Bidirectional Instructions, Causal Responses

PrefixLM mask allows instruction tokens to attend bidirectionally to each other, while response tokens follow standard causal generation. This gives the model a quasi-encoder-decoder workflow within a decoder-only architecture: instructions function as encoding, responses as decoding.

Attention analysis reveals that PrefixLM produces higher attention entropy and more global, diverse attention patterns compared to pure causal masks—it fundamentally changes how the model utilizes instruction information.

Ablation Study: Three Components, One Result

Configuration	ARC-C	MATH	GSM8K
1B Transformer + full causal	51.91	35.44	48.37
+ Response-only prediction	62.88	47.04	69.75
+ PrefixLM	74.32	48.36	75.06
+ HRM architecture	81.91	56.16	84.53

The efficiency gain comes from the synergistic combination of all three innovations—not any single one.

Code Implementation 1: Dual-Timescale Recurrence Simulation

// Full code: outputs/代码/hrm_dual_timescale.go
type HRMNetwork struct {
    HWeights [][][]float64  // High-level (strategic)
    LWeights [][][]float64  // Low-level (execution)
    HState   []float64
    LState   []float64
    Latent   []float64      // Shared latent space
}

func (net *HRMNetwork) Forward(input []float64) []float64 {
    copy(net.LState, input)
    copy(net.HState, input)
    copy(net.Latent, input)
    
    for hCycle := 0; hCycle < 2; hCycle++ {      // 2 H cycles
        for lStep := 0; lStep < 3; lStep++ {     // 3 L updates each
            combined := fuse(net.LState, net.HState)
            net.LState = net.LUpdate(combined)
            net.Latent = updateLatent(net.Latent, net.LState)
        }
        hInput := fuse(net.HState, net.LState)
        net.HState = net.HUpdate(hInput)          // 1 H update
        net.Latent = updateLatent(net.Latent, net.HState)
    }
    // 8 total recurrent updates per token
    return finalOutput(net.Latent, input)
}

Code Implementation 2: Training Cost Comparison

# Full code: outputs/代码/hrm_cost_scaling_analysis.py
models = [
    {"name": "HRM-Text 1B",  "params": 1.0,  "tokens": 0.04,  "cost": 1500},
    {"name": "Llama 3.2 3B", "params": 3.0,  "tokens": 9.0,   "cost": 70000},
    {"name": "Qwen3.5 2B",   "params": 2.0,  "tokens": 18.5,  "cost": 120000},
    {"name": "Qwen3.5 72B",  "params": 72.0, "tokens": 36.0,  "cost": 1000000},
]

for m in models[1:]:
    print(f"HRM vs {m['name'].split()[0]}: "
          f"Cost={m['cost']/1500:.0f}x, "
          f"Data={(m['tokens']/0.04):.0f}x, "
          f"Params={(m['params']/1.0):.1f}x")

Output: Cost ratios range from 47× (Llama 3B) to 667× (Qwen 72B). Data ratios from 225× to 900×.

Code Implementation 3: Scaling Law Counter-Example

# Full code: outputs/代码/hrm_cost_scaling_analysis.py
class ScalingLawCounterExample:
    def __init__(self):
        self.param_exponent = 0.28  # Parameter scaling exponent
        self.data_exponent = 0.15   # Data scaling exponent
    
    def expected_performance(self, params_b, tokens_t):
        scale = ((params_b / 3.0) ** self.param_exponent * 
                 (tokens_t / 9.0) ** self.data_exponent)
        return 70.0 * scale  # Llama 3.2 3B baseline = 70

# HRM actual average: 70.8
# Scaling Law prediction: 22.8
# Exceeds expectation: 210%
# Breakthrough factor: 3.10x

Training Cost Comparison: Orders of Magnitude

Model	Params	Training Tokens	Cost (USD)	Cost Ratio	Data Ratio
HRM-Text 1B	1B	40B	$1,500	1×	1×
Llama 3.2 3B	3B	9T	~$70,000	47×	225×
Qwen3.5 2B	2B	18.5T	~$120,000	80×	462×
Gemma3 4B	4B	12T	~$90,000	60×	300×
OLMo3 7B	7B	15T	~$150,000	100×	375×
Qwen3.5 72B	72B	36T	~$1,000,000	667×	900×

Notably, HRM-Text’s MATH score (56.2) exceeds Qwen3.5 2B, Llama 3.2 3B, Gemma3 4B, and OLMo3 7B—all 2-7× larger and 47-100× more expensive to train.

Source: HRM-Text paper, Table 1

Challenging Scaling Law Systematically

The Core Assumption of Scaling Laws

Scaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022) establish a power-law relationship: larger models + more data + more compute = better performance. This has driven the entire field for five years.

HRM’s Counter-Example: Compute Depth as a New Axis

HRM-Text proves that Scaling Laws missed a critical dimension: computational structure.

Standard Transformers stack parameters. HRM reuses parameters through recurrence, letting limited parameters participate in deeper effective computation.

Effective compute depth = Parameters × Recurrence per parameter

Standard Transformer: each parameter participates once. HRM: each parameter participates 8 times. This means 1B HRM parameters ≈ 8B standard Transformer parameters in effective compute depth—at 1/8 the storage and training cost.

This is why HRM achieves competitive performance at a fraction of Scaling Law-predicted costs: it’s not fighting Scaling Law—it’s opening a new growth axis called compute depth.

Limitations

HRM-Text is not a complete replacement for Scaling Law:

Limited knowledge coverage: MMLU score (60.7) lags behind Qwen3.5 2B (64.5)
Different inference cost: 8 recursive steps increase per-token computation
Fixed recurrence depth: No adaptive computation time yet
Unknown scalability: Larger-scale HRM behavior remains to be explored

Industry Impact: Democratizing AI

HRM-Text’s implications extend far beyond academic benchmarks.

Breaking the “Scale Religion”

The AI industry has operated under an almost religious belief: only bigger models, more GPUs, and more expensive training produce better AI. Two consequences: AI research concentrates in a few mega-corporations, and “no 1000 H100s, no foundation model” becomes unwritten law.

HRM-Text shatters this myth with $1,500. It proves that architectural innovation beats scale-stacking on efficiency.

Democratizing Foundation Model Research

When training a competitive foundation model drops from millions to $1,500, AI research is no longer the exclusive domain of tech giants. Research institutes, startups, and university labs can participate. As Sapient CEO Guan Wang put it: “When training a capable foundation model costs $1,500, AI stops being an infrastructure question and becomes a strategy question.”

Edge Inference Becomes Real

At int4 quantization, HRM-Text occupies just 0.6 GiB—fitting on modern smartphones. This enables privacy-sensitive applications (healthcare, finance, legal) where data never leaves the device.

Academic Reception

HuggingFace CEO Endorsement

The model weights and code are fully open-sourced on GitHub and HuggingFace.

Bengio’s GRAM Follow-up

On May 19, 2026, Turing Award winner Yoshua Bengio co-authored “Generative Recursive Reasoning” (GRAM), which directly extends HRM’s hierarchical recurrent approach with probabilistic multi-trajectory reasoning.

Reasoning-Knowledge Decoupling

The team hints at early results in decoupling reasoning from knowledge storage—compact recurrent models focusing on computation while external retrieval handles facts. If successful, this could yield models with fewer parameters, stronger reasoning, and more controllable knowledge.

Conclusion: The Road Beyond Scaling

HRM-Text’s value isn’t just its benchmark scores or its training budget. It provides a reproducible, verifiable, and falsifiable case study: beyond scaling model size, redesigning computational structure can fundamentally change the relationship between performance, cost, and capability.

If the last decade’s growth axis was parameter scale, data scale, and compute scale, HRM explores a deeper question: Can the computational process itself become a new growth axis?

In an industry profoundly shaped by Scaling Law, this possibility is significant enough. Because the next generation of intelligent systems may grow not just from more parameters, more data, and more compute—but from a more fundamental question: how should a model think?

References:

Wang et al., “HRM-Text: Efficient Pretraining Beyond Scaling”, arXiv:2605.20613, 2026
Sapient Intelligence: sapient.inc/introducing-hrm-text
Machine Heart (Jiqizhixin) coverage, June 9, 2026
36Kr in-depth report, June 9, 2026
Bengio et al., “Generative Recursive Reasoning”, arXiv, May 2026
Kaplan et al., “Scaling Laws for Neural Language Models”, 2020
Hoffmann et al., “Training Compute-Optimal Large Language Models”, 2022