Intelligence is foundation
Podcast Subscribe
Artificial Intelligence Tuesday, 17 February 2026

Alibaba's 397 Billion Parameter Model Uses Only 17 Billion at a Time

Share: LinkedIn
Alibaba's 397 Billion Parameter Model Uses Only 17 Billion at a Time

Alibaba just released Qwen3.5, and the numbers tell an interesting story about where large language models are heading. The headline figure is 397 billion parameters - massive by any measure. But here's what makes this different: only 17 billion parameters activate for any given token.

This isn't just clever engineering for its own sake. It's a direct response to one of the biggest problems in deploying large models: they're phenomenally expensive to run at scale.

What Sparse Activation Actually Means

Traditional dense models activate every parameter for every calculation. A 100-billion-parameter model uses all 100 billion parameters, all the time. Qwen3.5 takes a different approach - it routes each token through only the parts of the network that matter for that specific calculation.

Think of it like a massive reference library where you don't need to consult every book for every question. You go to the section that's relevant. The model learns which parts of itself to activate based on what you're asking it to do.

The result? Alibaba reports throughput improvements between 8.6 and 19 times faster than comparable dense models. That's not marginal - that's the difference between something being practically deployable and something that sits in a research lab.

Native Multimodal and Extended Context

Qwen3.5 handles text, images, and other modalities natively, rather than bolting vision capabilities onto a text model after the fact. This matters because models that learn multiple modalities together tend to develop better representations of both.

The one-million-token context window is worth noting too. Extended context has become table stakes for frontier models, but actually making it work efficiently at that scale - especially with sparse activation - is non-trivial engineering.

For practical applications, this means you can feed in entire codebases, long documents, or extended conversations without constantly summarising or losing important details.

Why This Matters Beyond the Benchmarks

The interesting bit isn't just that Alibaba built a massive model. It's that they built a massive model that businesses might actually be able to afford to run.

Deployment costs have been the elephant in the room for large models. You can build something that scores brilliantly on benchmarks, but if it costs thousands per hour to serve, its real-world utility is limited to very specific high-value applications.

Sparse activation changes that equation. You get model capacity that scales with complexity - the network can be huge when it needs to be, efficient when it doesn't. For businesses evaluating whether to deploy larger models, this shifts the cost-benefit analysis significantly.

There's also a broader pattern here. We're seeing multiple approaches to the same problem: making large models practical. Mixture of Experts, sparse attention, quantisation, distillation - the field is converging on the idea that bigger isn't always better, but selective bigness might be.

Qwen3.5 sits in that conversation as a credible technical contribution. The proof will be in real-world deployment, but the engineering here is sound and the performance claims are backed by published benchmarks. Worth watching how this plays out over the next few months.

More Featured Insights

Quantum Computing
Iceberg Quantum Claims 10x Reduction in Qubits for Breaking RSA
Web Development
Running Local LLMs: Why Privacy and Cost Make It Worth the Setup

Today's Sources

Alibaba Qwen3.5-397B: Massive Models, Efficient Performance
Qwen3.5 Plus Available on Multiple AI Platforms
César de la Fuente: AI for Antibiotic Discovery
NVIDIA GB300 Blackwell Ultra Reduces AI Token Costs by 35x
Human-in-the-Loop AI Agents with Explicit Approval
Agentic AI for Insurance Underwriting with Self-Critique
Iceberg Quantum's Pinnacle Architecture: RSA-2048 in 100k Qubits
Majorana Qubits Successfully Demonstrated and Manipulated
Entanglement in Spin Chains Remains Finite at Any Temperature
No-Go Theorem on Fault-Tolerant Clifford Gadgets
Light-Matter Coupling Creates New Quasiparticles
Optical Switch Protocol Verifies Entanglement Without Destroying States
Running Local LLMs with Ollama and NeuroLink
O(n) Methods to Check if List is Sorted
Building Your Own Circuit Breaker in Spring Boot
Database Transactions Explained: Isolation, Concurrency, and Locking
Password Hashing After Bcrypt: Learning Argon2 and Scrypt
WebMCP: Standard Protocol for AI Agents to Access Web Tools

Listen

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Free Daily Briefing

Start Every Morning Smarter

Luma curates the most important AI, quantum, and tech developments into a 5-minute morning briefing. Free, daily, no spam.

  • 8:00 AM Morning digest ready to listen
  • 1:00 PM Afternoon edition catches what you missed
  • 8:00 PM Daily roundup lands in your inbox

We respect your inbox. Unsubscribe anytime. Privacy Policy

© 2026 MEM Digital Ltd t/a Marbl Codes
About Sources Podcast Audio Privacy Cookies Terms Thou Art That
RSS Feed