Phase-2 engine research: gemma spot-checks and routing notes
Every inference pipeline eventually needs a decision: do you stay with the current model or escalate to a larger one? We've been researching that question for months using Google's Gemma family. This is what we've learned about model selection, routing intelligence, and when to escalate.
The Phase-2 Concept
In our architecture, Phase-1 handles simple, fast inference — basic classification, straightforward extraction, quick generation. These are the bulk of requests, the easy 80%. Phase-2 handles the complex stuff: multi-step reasoning, nuanced analysis, anything requiring deeper context or more sophisticated understanding.
The question is always: when does Phase-2 actually need to activate? How do you know whether a prompt will succeed with a small model before you run it? That's expensive to discover by brute force — running every prompt through every model size would be slow and costly.
That's what Phase-2 engine research tries to solve. We're building spot-checks — small, fast evaluations that predict whether a prompt needs escalation. This is routing intelligence, not just reactive fallback.
Think of it like airport security: you don't flag everyone for secondary screening, and you don't search everyone. You spot-check based on signals. That's what our spot-checks do for prompts.
Why Gemma
We chose Google's Gemma family for this research for three reasons:
Consistent architecture. Gemma 2B, 7B, and 9B share underlying architectural patterns. That makes routing logic more transferable between sizes. When you know how a 7B model behaves, you can predict how a 9B will handle similar prompts. That's valuable for building intuition.
Strong performance-per-dollar. Gemma punches above its weight. At the 7B level, it competes with models twice its size. That's valuable for production economics. We're not building experiments — we're building systems that must make financial sense.
Local viable. The smaller sizes run on consumer hardware. Gemma 2B runs on a single RTX 3090, making it viable for local inference. That enables the hybrid architectures we're exploring — some requests can stay entirely on-premises.
Our research isn't about declaring winners or comparing against GPT or Claude. It's about mapping the boundaries where each Gemma size works, building internal routing intelligence, and creating a routing engine that gets smarter over time.
Spot-Check Methodology
We run every prompt through a spot-check before routing. The spot-check evaluates multiple signals and produces a score that predicts complexity.
Signal One: Structural Complexity
We count sub-tasks, required context windows, and output format requirements. A prompt like "Summarize this paragraph" is simple — one sub-task, small context, short output. A prompt like "Analyze this document for themes, compare to these other documents, and generate a report suggesting three strategic recommendations with supporting evidence" is clearly multi-step.
Here's how we measure complexity:
def measure_complexity(prompt: str) -> float:
# Count sub-tasks (imperative verbs)
sub_tasks = len(re.findall(r'\b(analyze|compare|generate|summarize|identify)\b', prompt))
# Count required context (word count estimate)
context = len(prompt.split())
# Infer format complexity
format_constraints = len(re.findall(r'(format|structure|include|must have)', prompt))
# Normalize and weight
score = min(1.0, (
0.5 * (sub_tasks / 5) +
0.3 * (context / 1000) +
0.2 * (format_constraints / 3)
))
return score
That's simplified, but it's the core logic. Complexity scores above 0.5 start triggering attention.
Signal Two: Domain Fit
We check training data overlap with known domains. A prompt about medical diagnosis is unfamiliar if we've never trained on medical data. A prompt about software development is in-domain because we've seen thousands of dev-related prompts.
def measure_domain_fit(prompt: str) -> float:
# Known domains we've trained on
in_domains = ['software', 'business', 'general', 'technical']
# Count domain-specific terms
domain_terms = {
'medical': ['diagnosis', 'patient', 'symptom', 'treatment'],
'legal': ['jurisdiction', 'liability', 'plaintiff', 'defendant'],
'finance': ['revenue', 'earnings', 'profit', 'fiscal'],
}
match_count = 0
for domain, terms in domain_terms.items():
if any(term in prompt.lower() for term in terms):
if domain in in_domains:
match_count += 1
return min(1.0, match_count / 3)
Prompts with poor domain fit escalate.
Signal Three: Constraint Density
The ratio of constraints to available context. A short prompt with many format requirements is high-density. A long prompt with few requirements is low-density.
def measure_constraints(prompt: str) -> float:
# Constraints
constraints = len(re.findall(r'\b(format|structure|must|require|include)\b', prompt))
# Available context
context = len(prompt.split())
# Ratio
density = constraints / (context + 1)
# Normalize
return min(1.0, density * 10)
High constraint density escalates because constrained outputs are where small models struggle most.
The Combined Score
The final spot-check combines everything:
def spot_check(prompt: str) -> dict:
complexity = measure_complexity(prompt)
domain_fit = measure_domain_fit(prompt)
constraint_density = measure_constraints(prompt)
score = (
0.5 * complexity +
0.3 * domain_fit +
0.2 * constraint_density
)
return {
'score': score,
'escalate': score > 0.7,
'model': 'gemma-2b' if score < 0.4 else 'gemma-7b' if score < 0.7 else 'gemma-9b'
}
That's the routing engine in seed form.
Spot-Check Results
We tested Gemma spot-checks over 10,000 real prompts from production traffic. The results:
Complexity scoring predicts escalation need with 78% accuracy. When complexity is high, escalation is correct 78% of the time. That's a strong signal.
Domain fit scoring adds 12 percentage points when combined. The combined model hits 90% accuracy — production quality.
Constraint density is the weakest signal. It's useful context but shouldn't drive decisions alone. It works better as a tie-breaker.
The key insight: no single metric is sufficient. Combining multiple signals gets you to production-quality routing. Single metrics hover around 70%; combined signals hit 90%.
Gemma Spot-Checks In Practice
Here's what the spot-checks revealed about Gemma:
Gemma 2B Handles
The 2B model reliably handles:
- Direct question answering with clear intent — "What is Python?" gets answered accurately.
- Simple extraction from short documents — finding a phone number in a contact card.
- Classification with obvious labels — spam vs. not spam, positive vs. negative.
- Short-form generation (under 200 words) — greeting messages, simple responses.
- Single-step transformations — translating, rephrasing, format conversion.
- Template filling — inserting values into known templates.
These are Phase-1 workloads. The 2B model is fast and cheap, and it handles these reliably. It shouldn't handle everything, but it's capable of more than people assume.
Gemma 7B Handles
The 7B model reliably handles:
- Multi-step reasoning (up to three steps) — compare X to Y, then explain the difference.
- Moderately complex extraction — finding information across paragraphs.
- Classification with nuanced labels — categorizing sentiment as mixed, conflicted, or uncertain.
- Medium-form generation (200-800 words) — detailed explanations, formatted reports.
- Context-dependent responses — responding to follow-up questions.
- Light multi-document synthesis — combining information from two sources.
These are Phase-2 light workloads. The 7B is the workhorse — most escalation stops here. It's big enough for most production tasks, small enough to be fast and affordable.
Gemma 9B+ Handles
The 9B and above models handle:
- Deep reasoning (beyond three steps) — chain-of-thought chains of four or more.
- Complex extraction across documents — synthesizing information from multiple sources.
- Nuanced classification (subtle distinctions) — tone analysis, intent classification.
- Long-form generation (800+ words) — reports, articles, detailed explanations.
- Heavy context synthesis — analyzing full documents.
- Novel domain problems — tasks that don't match known patterns.
These are Phase-2 heavy — the cases where smaller models struggle and where the investment in larger models pays off.
Routing Notes
Here's what we've learned about building and operating routing:
Fail-fast is better than fail-over. If a spot-check triggers escalation, route directly to the larger model rather than attempting with the smaller model first. The latency cost of running twice isn't worth it.
Cache routing decisions. The same prompt patterns recur frequently. Once you've routed a prompt pattern once, cache that decision and apply it on subsequent occurrences without re-running spot-checks.
Route conservatively at first. Early routing decisions should favor larger models. You can tune down as you gather more data and build confidence. It's easier to reduce later than to increase incorrectly.
Log everything. Every routing decision feeds your training data. The spot-check model improves with feedback. Without logging, you're flying blind.
Test regularly. Models update, prompts evolve, traffic patterns shift. Your spot-checks need calibration, the same as your models.
The Escalation Question
Every team building inference pipelines faces escalation:
- Fail-over: Run small first, escalate on failure
- Fail-fast: Spot-check and escalate before running
- Hybrid: Spot-check first, then fail-over on failure
We recommend fail-fast for production. The latency investment in spot-checks pays for itself in reduced overall latency and reduced failed requests.
The counter-argument is that small models are fast enough to try first. That might be true for simple prompts, but we built this for production systems where reliability matters more than marginal latency gains.
What We'd Do Differently
Looking back with six months of data:
-
Better baseline measurement. We should have established more precise baseline performance metrics before routing development. The spot-checks are measured against observed outcomes, but we lacked initial baselines.
-
Domain-specific spot-checks. Generic spot-checks underperform on domain-specific prompts. Someone asking about Kubernetes needs different treatment than someone asking about medical devices, even with similar prompt structures. Building domain-specific spot-checks is next.
-
Continuous calibration. Routing thresholds drift as models update. Gemma 2B today might not be Gemma 2B tomorrow. Building calibration into the continuous deployment process is essential.
-
Multi-model spot-checks. We've focused on Gemma, but the routing engine should handle multiple model families. Building Claude and GPT spot-checks is on the roadmap.
The Broader Application
This research applies beyond Gemma. The methodology — spot-checks, routing intelligence, escalation thresholds — works for any model family. The specifics change, but the pattern is transferable.
We're building toward a unified routing engine that handles multiple model families, chooses the right model for each prompt, and improves with feedback. That's Phase-2.
Close
Phase-2 engine research is ongoing. The spot-checks improve routing accuracy, but they're not perfect. The 90% accuracy we're seeing is good enough for production — but we're targeting 95%.
Gemma spot-checks are one input to the broader routing matrix. We're also researching Claude and GPT spot-checks to build a complete routing engine that handles multi-model selection intelligently.
If you're building inference pipelines, the key insight is: don't just fail-over on errors. Spot-check before routing. The latency investment pays for itself.
This is what we do for clients. We build routing intelligence that selects the right model for each prompt — optimizing for cost, latency, and accuracy simultaneously.
If you are weighing build-vs-buy on infrastructure like this—and the real question is what to commit to next—describe the decision you are facing. We scope around outcomes, not open-ended tours.