When One AI Isn't Enough
Why we use multiple AI models to review each other's work, and what we learned from 47 architecture reviews.
The single-model problem
If you ask one AI model to review a system architecture, you'll get an answer. It'll be coherent, detailed, and confident. It will also reflect that particular model's biases, training data gaps, and reasoning quirks.
Claude tends toward cautious, nuanced responses. GPT demonstrates creative breadth but occasional inconsistency. Gemini excels at technical depth but may miss broader context. Each model has blind spots that users selecting any single model will inherit.
We wanted to know: if we asked multiple models the same question and compared their answers, would we catch things that any single model missed?
The multi-model approach
We developed a process we call MARP (Multi-AI Review Process) that queries five models with identical prompts and synthesizes their responses:
The key insight is that when multiple independent models identify the same issue, it's more likely to be real. When only one model flags something, it might be a model-specific bias rather than a genuine concern.
What we found
Across 47 architecture reviews, the pattern was consistent:
High consensus (4-5 models agree)
~35% of findingsThese almost always represent real issues. Authentication gaps, missing error handling, scalability concerns that multiple perspectives independently identified.
Medium consensus (2-3 models agree)
~32% of findingsWorth investigating. Often represent legitimate concerns that require human judgment to assess relevance to the specific context.
Single model only
~33% of findingsFrequently model-specific biases or overly cautious suggestions. About two-thirds don't hold up under scrutiny.
The consensus-based scoring gave us a natural way to prioritize: high-consensus issues first, then medium, with single-model findings triaged last.
A real example
One architecture review for a financial services API illustrates the pattern:
High consensus findings
- 5/5 Missing rate limiting on authentication endpoints
- 4/5 No circuit breaker for downstream service failures
- 4/5 PII logging without redaction
Single-model findings (examples)
- 1/5 "Consider implementing GraphQL instead of REST" — style preference, not security issue
- 1/5 "Database connection pooling parameters may need tuning" — generic advice without evidence
The high-consensus findings were all genuine issues that needed addressing. The single-model findings were mostly noise.
Where this doesn't work
We want to be clear about the limitations:
It's not a replacement for human review
AI models can't know your organizational context, your team's capabilities, your risk tolerance, or your business constraints. They flag potential issues; humans decide which matter.
It generates false positives
Especially from single-model findings. About two-thirds of unique observations don't hold up under scrutiny. You need to budget time to triage.
Very new technologies get incomplete analysis
Models have training cutoffs. If you're using something released last month, the models may not have useful opinions.
Complex architectures may exceed context limits
If your architecture documentation is 50 pages, you'll need to decompose it into reviewable chunks, which introduces its own challenges.
Questions we get asked
Why five models? Why not three, or ten?
Five gives enough diversity to identify consensus patterns without becoming unwieldy. With three, you get less signal. With ten, you get diminishing returns and higher costs.
Isn't this expensive?
Yes—roughly 4-5x the cost of a single-model query. The question is whether the improved quality justifies the cost. For architecture reviews where mistakes are expensive, it usually does.
What if all five models are wrong about the same thing?
It happens. If all models share the same knowledge gap or bias, consensus reinforces the error rather than catching it. This is why human review remains essential.
Which models do you use?
Currently Claude 3.5, GPT-4o, Gemini 1.5, Grok-2, and DeepSeek V3. We chose these for diversity in training data, corporate culture, and optimization objectives.
The takeaway
Single-model AI responses inherit that model's biases and blind spots. Multi-model review with consensus-based scoring produces measurably better results: 34% improvement in quality scores, 67% reduction in detectable bias.
Many minds, properly coordinated, produce better answers than any single mind alone. This principle, long established in human collective intelligence, applies equally to artificial intelligence systems.
Try multi-AI deliberation yourself
Onyx Legion queries Claude, GPT, Gemini, Grok, and DeepSeek simultaneously.