ログイン中: ゲストモード

ResearchFeb 22, 2026Abcas Security Research

What a Single Score Hides: Decomposing a 10-Server MCP Sample by Verification Perspective

We decomposed a 10-server MCP sample by verification perspective and found that the most useful outcome was not a single score, but a traceable explanation of where risk came from.

Terminology

TermMeaning
Verification perspectiveOne inspection angle, such as provenance, behavior, or historical records
Inspection verdictPASS / WARN / BLOCK based on the collected evidence
Behavioral contradictionA mismatch between declared behavior and observed behavior
Partial verificationA case where only some evidence sources were available

Lead

Compressing MCP server review into a single score is convenient, but it removes the most operationally useful information: why the verdict was reached.

To see what gets lost, we decomposed a 10-server MCP sample by verification perspective. The central finding is that the real value of multi-perspective inspection is not cosmetic scoring precision. It is verdict traceability.

Key Findings

  1. In the 10-server sample, 7 servers required review or blocking.
  2. 5 of 10 servers were judged with only partial evidence available.
  3. Behavioral contradictions appeared in 2 servers and materially changed the final outcome.
  4. Historical inspection status split evenly between 5 previously lower-risk and 5 caution cases.
  5. The final verdict often depended on the interaction between multiple perspectives, not one dominant score.

Dataset

ItemValue
Sample size10 MCP servers
MethodVerdicts decomposed by verification perspective
FocusTraceability, contradiction impact, and partial-evidence handling
ScopeExploratory analysis

What the Single Score Would Hide

A server rated "70" can hide multiple very different realities:

  • a verifiable publisher but contradictory behavior
  • a consistent tool definition but unknown provenance
  • incomplete connectivity data but strong historical inspection evidence

Those are not variations of the same risk. They call for different responses.
That is why a single score is weak as an operational explanation layer.

What We Observed in the 10-Server Sample

Final verdict distribution

VerdictCountRatio
PASS330%
WARN660%
BLOCK110%

The sample shows that review pressure concentrated outside PASS. In practice, the question was not "how do we rank all servers on a single scale?" but "what kind of evidence pushed each one into review?"

Partial verification was common

ModeCountRatio
Full verification550%
Partial verification550%

Half of the sample lacked some expected evidence source. That matters because real-world inspection often operates under incomplete information. A model that assumes full evidence all the time is operationally unrealistic.

Contradictions were rare but high-impact

MetricValue
Servers with behavioral contradictions2/10
Previously lower-risk in historical records5
Previously caution in historical records5

Only 2 servers showed contradictions, but both materially affected the final verdict. This makes contradiction detection an impact-heavy signal, even when it is not the most frequent one.

Why Multi-Perspective Inspection Matters

1. It separates different risk types

Two WARN verdicts are not interchangeable. One may come from unclear provenance. Another may come from behavior that does not match the declared tool contract.

Without perspective-level decomposition, those two cases look similar in the UI while requiring different follow-up.

2. It keeps partial evidence usable

If a server cannot be fully inspected at one layer, a perspective-based model still allows a conservative verdict from the evidence that is available. That prevents operational dead zones where the only answer is "unknown."

3. It makes correction faster

When a verdict turns out to be wrong, the most useful question is not "was the score off?" It is "which evidence layer drove the wrong outcome?" Multi-perspective inspection shortens that diagnosis path.

Operational Implications

The practical value of this model is not abstract. It changes how teams respond:

  1. For WARN handling: teams can distinguish review for provenance from review for runtime behavior.
  2. For audits: approval decisions can cite the actual evidence layer that justified them.
  3. For tuning: misclassifications can be corrected at the layer that caused them, instead of reworking the entire verdict model.

In other words, explainability is not a reporting luxury. It is part of operational quality.

What This Article Does Not Claim

This sample is too small to prove that every deployment program should use exactly the same weighting or exactly the same number of inspection layers.

The narrower conclusion is enough:

  • in this sample, important verdict changes were not well explained by a single aggregated number
  • perspective-level evidence made the outcomes easier to defend and easier to correct

Limitations

  1. The sample size is 10, so population-level statistical claims are limited.
  2. This article does not attempt to optimize weighting across perspectives.
  3. Inspection implementation details remain undisclosed.
  4. Time-series change tracking is outside the scope of this article.

Conclusion

The main weakness of a single score is not that it is imprecise. It is that it hides the structure of the decision.

In this 10-server MCP sample, the useful output was not one score per server. It was a verdict that could be traced back to provenance, behavior, historical evidence, and partial-information handling. That traceability is what makes review, correction, and organizational trust possible.


MCP Guard keeps verdicts explainable by preserving evidence across multiple verification perspectives.