TL;DR

Cheap but smart reasoning models like Kimi K2 Thinking, DeepSeek V3.2 and Gemini Flash belong in the background of your stack, not the front. Treat them as a sub‑level intelligence layer that quietly handles routing, chunking, segmentation and batch analysis while your expensive frontier models stay focused on user-facing work. You trade latency for accuracy (where nobody is waiting), use cheap intelligence as a new primitive, and design your architecture so not everything has to be real-time.

Sub‑Level Intelligence: Using Cheap Reasoning Models To Quietly Upgrade Your Stack

Most teams still design AI systems as if every request deserves their smartest, most expensive model. You wire everything to GPT‑5.x, Claude Sonnet 4.5, Gemini 3 Pro or whatever your current "frontier" favorite is, then start shaving costs at the edges.

That’s backwards.

If intelligence has become cheap in the right places, the real question is not "Where can we afford to use smart models?" but "Where are we wasting frontier capacity on work that doesn’t need it?" The answer, in most modern AI stacks, is: everywhere beneath the user-facing layer.

This is where sub‑level intelligence comes in.

Sub‑level intelligence is a strategy for pushing reasoning down into the background of your system, and doing it with models that are radically cheaper, slower and more specialized than your flagship frontier models. Instead of treating "background" as dumb glue and ETL, you treat it as a quiet layer of thought: group chat routing, file chunking, podcast segmentation, nightly cleanups and batch analysis, all handled by models that are "slow but smart enough."

It’s a different way of thinking about where reasoning lives in your stack.

Layered AI models handling high-level and background tasks

The Latency‑for‑Accuracy Trade‑Off That Everyone Is Ignoring

When we talk about model performance, we mostly obsess over user-visible latency. First token time. Tokens per second. That’s valid for the top of the stack: a 30‑second pause before a chat reply is unacceptable if a human is staring at the screen.

But most of your system is not a live chat.

Nightly jobs don’t care if something takes 3 seconds or 30. File preprocessing can soak in latency while the user is asleep. Podcast segmentation doesn’t need to stream in real time; it needs to be right. Group chat speaker selection needs to be good enough that the right agent picks up the baton, not instantly perfect on the very first token.

Once you separate "human waiting" from "system thinking," a different trade‑off emerges: you can spend latency to buy accuracy, depth of reasoning and better tool use at orders of magnitude lower cost.

Cheap thinking models like Kimi K2 Thinking, DeepSeek V3.2, Gemini 2.5 Flash and now Gemini 3 Flash live exactly in that pocket. They’re not optimized to be your primary concierge. They’re optimized to be your back‑office analyst who can sit with a problem for a while.

In other words: when nobody is looking, you can afford to let your system think.

Background AI model quietly processing tasks overnight

Frontier Versus Sub‑Level: Two Tiers Of Intelligence

A useful way to picture this is as a two‑tier architecture.

At the top, you have your expensive, frontier models: GPT‑5.2, Claude Opus, Gemini 3 Pro, Sonnet 4.5. They handle user‑facing interactions, complex orchestration and any real‑time critical path. They need to be fast, coherent and reliable under pressure, and you pay accordingly. In benchmarks like GPQA Diamond, ARC‑AGI‑2, CharXiv and HLE with tools, they sit at or near the top of the charts.

Underneath that layer, you can carve out a sub‑level: cheap, smart, background intelligence. These are models like Kimi K2 Thinking, DeepSeek V3.2 and the Flash family that are dramatically cheaper per million tokens and still very strong on reasoning, especially when they can use tools.

This sub‑level is where you offload tasks such as group chat speaker selection, breaking a file into useful chunks, turning a raw transcript into segments, running nightly batch processing, doing file analysis and preprocessing, and making "minor but intelligent" decisions that glue the rest of the system together.

The crucial shift is conceptual. You stop thinking of these tasks as dumb pre/post‑processing and treat them as reasoning work that just doesn’t need to be real‑time or user‑visible.

Diagram of frontier models on top of sub-level thinking models

See also: Reasoning Podcasts: AI Debates Where You Can Hear Them Think for a concrete UX built on this layered pattern.

Why Kimi K2 Thinking Keeps Showing Up In The Benchmarks

Kimi K2 Thinking is a good example of what a sub‑level specialist looks like in practice.

On paper, it’s not the fastest model on the market. Artificial Analysis reports around 81 tokens per second with a substantial "thinking" delay before the first token for non‑streaming output. That looks bad if you’re optimizing for chat snappiness.

But that delay is where the model is doing its work.

When you look at the benchmarks that actually exercise reasoning and tool use, K2 Thinking starts to shine. Moonshot’s own reported results, aggregated by FelloAI, show it outperforming or matching frontier models like GPT‑5 and Claude 4.5 on benchmarks like Humanity’s Last Exam, HMMT 2025, GPQA‑Diamond and BrowseComp, while staying competitive on code benchmarks like SWE‑Bench Verified and LiveCodeBench.

Third‑party evaluations reinforce this. Artificial Analysis’s τ²‑Bench Telecom, which specifically stress‑tests agentic tool use, has Kimi K2 Thinking at the top, with a 93% score versus 87% for GPT‑5. That gap might not sound huge until you remember this is a benchmark designed to mirror complex, multi‑step, tool‑heavy workflows. Those are exactly the workflows you want handled by sub‑level intelligence.

NIST’s CAISI evaluation further confirms that K2 Thinking is not just a "cheap toy," but a model that can hold its own in serious reasoning tasks alongside incumbents like GPT‑5 and Opus 4.

Taken together, the picture is clear: K2 Thinking is slower to speak, but very good at what it says—especially when it’s allowed to reason quietly and call tools.

That is exactly what you want in background tasks.

Cheap Intelligence Is A New Primitive

Why does this matter? Because the cost curves have shifted.

Look at the updated pricing landscape. Gemini 3 Flash arrives with input at roughly $0.50 and output at $3.00 per million tokens, with a 1‑million‑token context window. Gemini 2.5 Flash lands even lower on input, around $0.30, while Kimi K2 Thinking sits around $0.60 input and $2.50 output with 256K context and open weights. DeepSeek V3.2 drops even further in price, around $0.27 input and $1.10 output for 128K context.

Compare that to frontier models like Gemini 3 Pro, GPT‑5.2 Extra High, Claude Sonnet 4.5. You are paying an order of magnitude more for each million tokens, often for capabilities that you only truly need on the top layer.

The point is not that you should never use GPT‑5.2 or Claude 4.5. The point is that intelligence has become cheap enough in the mid‑tier that it is rational to sprinkle it liberally in places you previously treated as dumb logic.

If a sub‑level model can think for you overnight, why are you still hard‑coding heuristics?

If a background model can handle group chat speaker selection based on context, roles and goals, why is that logic still a bunch of brittle if‑else trees?

If a "slow" model can chunk files more intelligently, preserving semantic boundaries and task relevance, why are you still splitting on token counts?

The moment you accept that cheap reasoning is a primitive—something you can call as casually as a database query—the architecture you design starts to change.

Concrete Sub‑Level Use Cases

The founder quotes from the original strategy document sketch a clear set of jobs for this sub‑level layer.

One is group chat speaker selection. In a multi‑agent system, the hardest problem is often not what to say, but who should speak next and why. You want a model that can sit above the agents, read the conversation, understand the user’s goal and decide which agent is best suited to take the next turn. That is reasoning work, but it doesn’t need to return in under a second. It just needs to be consistently good, and it benefits heavily from tool use.

Another is content chunking. When a user uploads a long file—a transcript, a PDF, an internal spec—you want to break it into segments that are both semantically meaningful and task‑aware. Naive splitting by length wastes context and breaks apart ideas. Sub‑level intelligence can read the entire document, think about how it maps to future tasks, and generate chunks that match the work you’ll actually do later.

The same goes for podcast processing. You might use a local stack that combines K2 Thinking and DeepSeek to turn a raw audio transcript into segments, titles and show notes. The listener doesn’t care how long it took your system to think overnight. They care that the next morning, their podcast is sliced into coherent parts that feel like a human editor touched them.

Nightly batch processing is another huge zone. This is where you clean up logs, reconcile states, backfill embeddings, regenerate summaries, re‑rank content and fix subtle mis‑routes in your agent graph. None of this needs frontier latency. All of it benefits from cheap, deep reasoning.

File analysis and preprocessing lives in the same bucket. Instead of handing raw documents directly to the high‑level orchestrator, you can have a sub‑level model annotate them, classify them, extract schemas and pre‑compute the scaffolding that makes later tasks easier and cheaper.

You can treat all of this as one category: background task intelligence. It’s the brain you run at night and between user interactions, constantly reshaping the substrate your main agents operate on.

Entropy, Intelligence And Controlled Chaos

There is a real concern here: every time you add more intelligent components to your system, you add entropy. More models, more variability, more places for things to go weird.

The founder’s framing is helpful: you are trading entropy for intelligence. The bet is that if you craft the sub‑level properly, the intelligence you gain more than compensates for the randomness you introduce.

The key is constraint.

You do not give your sub‑level models full control over the user experience. You give them specific, well‑defined jobs with clear inputs and outputs. You let them propose chunk boundaries, routing decisions, nightly cleanup actions, but you still wrap them in guard rails and monitoring.

You also lean into their strengths. For example, Kimi K2 Thinking’s tool use profile is unusual: Moonshot recommends a specific configuration—temperature around 1.0, streaming enabled, and high max_tokens—to get the best from its thinking and tool‑calling interleaving. That means you don’t just drop it into a generic framework and hope for the best. You talk to the provider directly, parse out the thinking tokens and the tool calls, and wire them into your agent stack intentionally.

This is why it doesn’t always perform optimally inside off‑the‑shelf frameworks. It needs an environment that respects its native interaction model. When you give it that environment, you get a sub‑level actor that is remarkably good at orchestrating tools in the background.

The entropy you introduce is controlled. The intelligence you unlock is compounding.

Designing A Stack Where Not Everything Is Real‑Time

Once you accept that some parts of your system are allowed to be slow, you can design flows that explicitly separate "thinking time" from "reply time."

For user‑facing interactions, you still route to frontier models and keep your latencies tight. That’s the high‑level layer.

For everything else, you deliberately schedule work where latency is a non‑issue: nightlies, idle periods, background jobs. You queue up tasks that benefit from deep reasoning—re‑chunking content based on emerging usage patterns, re‑evaluating agent routing rules, re‑ranking actions based on new data—and let your sub‑level models grind through them at their own pace.

The cost is low enough that you can be generous. Running Gemini 3 Flash or Kimi K2 Thinking over your corpus overnight is no longer an extravagance. It’s a standard operating procedure.

You can even imagine a world where every major state change in your system kicks off a background "reflection" job: a cheap reasoning model looks at what just happened, how agents behaved, which tools were called, what worked and what didn’t, and writes back suggestions or corrections for the next day. That’s not science fiction anymore. It’s just a matter of wiring.

The frontier models then become the sharp edge of a much larger blade, one that is constantly honed by cheaper thinking underneath.

Where This Is Going

The release of Gemini 3 Flash, the continued evolution of DeepSeek, and the strong performance of Kimi K2 Thinking are all pointing in the same direction: intelligence is no longer a single monolith at the top of your stack. It’s a layered fabric.

At the top, you have your expensive, opinionated, highly capable frontier models.

Beneath them, you can assemble a mesh of cheap reasoning models, each tuned for different parts of the substrate: tool use, orchestration, chunking, evaluation, background analysis. Some will be proprietary. Many will be open weights you can run locally. All of them will be good enough that it feels irresponsible to leave that layer as dumb glue.

The teams that win won’t be the ones who simply "upgrade" to the latest GPT or Gemini. They’ll be the ones who redesign their systems so that sub‑level intelligence is everywhere, quietly thinking in the background, making small but compounding decisions that the user never sees directly—but always feels.

If frontier models are the face of your AI product, sub‑level intelligence is the nervous system.

And the nervous system is where evolution does its best work.

Sources

Artificial Analysis – τ²‑Bench Telecom: benchmarking agentic tool use across models.
Moonshot / Kimi K2 Thinking provider docs and benchmark summaries.
DeepSeek V3.2 and V3.2 Speciale technical reports and benchmark disclosures.
NIST CAISI evaluation results on reasoning models.
FelloAI consolidations of GPQA, HMMT, Humanity’s Last Exam and tool‑use benchmarks.

Sub‑Level Intelligence: Using Cheap Reasoning Models To Quietly Upgrade Your Stack

TL;DR

Sub‑Level Intelligence: Using Cheap Reasoning Models To Quietly Upgrade Your Stack

The Latency‑for‑Accuracy Trade‑Off That Everyone Is Ignoring

Frontier Versus Sub‑Level: Two Tiers Of Intelligence

Why Kimi K2 Thinking Keeps Showing Up In The Benchmarks

Cheap Intelligence Is A New Primitive

Concrete Sub‑Level Use Cases

Entropy, Intelligence And Controlled Chaos

Designing A Stack Where Not Everything Is Real‑Time

Where This Is Going

Related posts

Sources