AnalysisApril 10, 2026·MuseSpark Community

Muse Spark Benchmarks Deep Dive: Strengths, Weaknesses & What They Mean

Muse Spark arrived with bold claims. Alexandr Wang said it represents “a new approach to intelligence.” The benchmarks tell a more nuanced story: dominant in some areas, trailing significantly in others. Here’s what actually matters.


The Full Benchmark Table

BenchmarkMuse SparkGPT-5.4Gemini 3.1 ProClaude Opus 4.6
AA Intelligence Index52575753
HealthBench Hard42.8 (#1)40.120.6
MMMU-Pro (vision)80.5%82.4%
TerminalBench (coding)59.075.168.5
ARC-AGI-2 (reasoning)42.576.1
GDPval-AA (agentic)1427167613201648
Output tokens used58M120M57M157M

Source: Meta AI / Artificial Analysis, April 2026. Bold blue = Muse Spark best; bold green = competitor best.

Where Muse Spark Leads

1. Medical & Health Intelligence

Muse Spark scores 42.8 on HealthBench Hard — first place, ahead of GPT-5.4 at 40.1. Gemini barely registers at 20.6. This isn’t luck. Meta worked with over 1,000 physicians on the training data. The result is a model that genuinely understands clinical reasoning, not just medical vocabulary. For any application touching health decisions, Muse Spark is currently the strongest model available.

2. Token Efficiency

Muse Spark achieves an AA Intelligence Index of 52 using only 58M output tokens. Claude Opus 4.6 uses 157M tokens for a score of 53 — nearly three times more compute for one extra index point. When Muse Spark’s API launches, this efficiency advantage could translate directly into lower costs for developers at scale.

Where Muse Spark Falls Short

1. Abstract Reasoning (ARC-AGI-2)

This is the hardest gap to ignore. 42.5 vs GPT-5.4’s 76.1 — a 33-point deficit. ARC-AGI-2 specifically tests novel, out-of-distribution reasoning: tasks designed so that pattern memorization alone cannot work. Closing this kind of gap requires architectural changes, not just more data. It’s also the hardest gap to close through fine-tuning.

2. Coding (TerminalBench)

At 59.0, Muse Spark trails GPT-5.4 (75.1) and Gemini 3.1 Pro (68.5) by a meaningful margin. TerminalBench tests real terminal-based coding tasks — the kind of work software engineers actually do. For developer tooling and code generation, Muse Spark is not the right choice today.

3. Agentic Tasks (GDPval-AA)

GDPval-AA measures performance on real-world work tasks delegated to AI agents. Muse Spark scores 1427 vs Claude Opus 4.6’s 1648 and GPT-5.4’s 1676. If you’re building autonomous agents that handle multi-step work tasks, the gap matters — those 200–250 points represent real failures on complex pipelines.

4. Overall Intelligence Index

An AA Index of 52 puts Muse Spark fourth overall — behind GPT-5.4 (57), Gemini 3.1 Pro (57), and Claude Opus 4.6 (53). It’s competitive, but it’s not the top general-purpose model. The headline “#1 HealthBench” is real; the broader “best model overall” claim is not.

What This Means for You

Use CaseBest Choice
Healthcare / medical AIMuse Spark
Vision / multimodal tasksMuse Spark (competitive, #2 overall)
Coding / software developmentGPT-5.4 or Gemini 3.1 Pro
Novel reasoning / logic puzzlesGPT-5.4 (far ahead)
General chat / writingAll models perform comparably

The Token Efficiency Angle

At 58M output tokens for an index score of 52, Muse Spark sits among the most computationally efficient frontier models evaluated. Compare this to Claude Opus 4.6 at 157M tokens for 53, or GPT-5.4 at 120M tokens for 57. Efficiency doesn’t show up in leaderboard rankings, but it shows up in invoices. If Meta prices its API to reflect this efficiency, Muse Spark could become the default cost-optimized choice for high-volume applications — particularly in the healthcare space where it already leads on quality.

Contemplating Mode: The Unknown Variable

Muse Spark’s Contemplating mode — extended long reasoning — has not launched yet. This is the mode most likely to move the needle on ARC-AGI-2. The Thinking mode already available is a step above Instant, but Contemplating represents a qualitatively different compute budget: the model is expected to reason through problems over extended chains before producing an answer.

GPT-5.4’s 76.1 on ARC-AGI-2 was achieved with its reasoning mode active. If Contemplating provides a comparable boost to Muse Spark’s baseline, the 33-point gap could narrow significantly. This is speculative — but it’s the single most important unknown in the current benchmark picture.


See how the features stack up beyond benchmarks, try Muse Spark for free, or read the full launch article.

Muse Spark Benchmarks Deep Dive: Strengths, Weaknesses & What They Mean