Research ·

When One AI Isn't Enough

Why we use multiple AI models to review each other's work, and what we learned from 47 architecture reviews.

47
Architecture reviews analyzed
67%
Findings from 2+ models
34%
Quality improvement

The single-model problem

If you ask one AI model to review a system architecture, you'll get an answer. It'll be coherent, detailed, and confident. It will also reflect that particular model's biases, training data gaps, and reasoning quirks.

Claude tends toward cautious, nuanced responses. GPT demonstrates creative breadth but occasional inconsistency. Gemini excels at technical depth but may miss broader context. Each model has blind spots that users selecting any single model will inherit.

We wanted to know: if we asked multiple models the same question and compared their answers, would we catch things that any single model missed?

The multi-model approach

We developed a process we call MARP (Multi-AI Review Process) that queries five models with identical prompts and synthesizes their responses:

1
Initial Query
Same prompt to 5 models independently
2
Cross-Examine
Models review each other's responses
3
Synthesize
Identify agreements and disagreements
4
Score
Weight findings by consensus

The key insight is that when multiple independent models identify the same issue, it's more likely to be real. When only one model flags something, it might be a model-specific bias rather than a genuine concern.

What we found

Across 47 architecture reviews, the pattern was consistent:

High consensus (4-5 models agree)

~35% of findings

These almost always represent real issues. Authentication gaps, missing error handling, scalability concerns that multiple perspectives independently identified.

Medium consensus (2-3 models agree)

~32% of findings

Worth investigating. Often represent legitimate concerns that require human judgment to assess relevance to the specific context.

Single model only

~33% of findings

Frequently model-specific biases or overly cautious suggestions. About two-thirds don't hold up under scrutiny.

The consensus-based scoring gave us a natural way to prioritize: high-consensus issues first, then medium, with single-model findings triaged last.

A real example

One architecture review for a financial services API illustrates the pattern:

High consensus findings

  • 5/5 Missing rate limiting on authentication endpoints
  • 4/5 No circuit breaker for downstream service failures
  • 4/5 PII logging without redaction

Single-model findings (examples)

  • 1/5 "Consider implementing GraphQL instead of REST" — style preference, not security issue
  • 1/5 "Database connection pooling parameters may need tuning" — generic advice without evidence

The high-consensus findings were all genuine issues that needed addressing. The single-model findings were mostly noise.

Where this doesn't work

We want to be clear about the limitations:

It's not a replacement for human review

AI models can't know your organizational context, your team's capabilities, your risk tolerance, or your business constraints. They flag potential issues; humans decide which matter.

It generates false positives

Especially from single-model findings. About two-thirds of unique observations don't hold up under scrutiny. You need to budget time to triage.

Very new technologies get incomplete analysis

Models have training cutoffs. If you're using something released last month, the models may not have useful opinions.

Complex architectures may exceed context limits

If your architecture documentation is 50 pages, you'll need to decompose it into reviewable chunks, which introduces its own challenges.

Questions we get asked

Why five models? Why not three, or ten?

Five gives enough diversity to identify consensus patterns without becoming unwieldy. With three, you get less signal. With ten, you get diminishing returns and higher costs.

Isn't this expensive?

Yes—roughly 4-5x the cost of a single-model query. The question is whether the improved quality justifies the cost. For architecture reviews where mistakes are expensive, it usually does.

What if all five models are wrong about the same thing?

It happens. If all models share the same knowledge gap or bias, consensus reinforces the error rather than catching it. This is why human review remains essential.

Which models do you use?

Currently Claude 3.5, GPT-4o, Gemini 1.5, Grok-2, and DeepSeek V3. We chose these for diversity in training data, corporate culture, and optimization objectives.

The takeaway

Single-model AI responses inherit that model's biases and blind spots. Multi-model review with consensus-based scoring produces measurably better results: 34% improvement in quality scores, 67% reduction in detectable bias.

Many minds, properly coordinated, produce better answers than any single mind alone. This principle, long established in human collective intelligence, applies equally to artificial intelligence systems.

Try multi-AI deliberation yourself

Onyx Legion queries Claude, GPT, Gemini, Grok, and DeepSeek simultaneously.

Try Legion Free →