3,601 Full Re-Scans: What 99.9% Completion Actually Means
We analyze the operational results of a full 3,601-server re-scan and examine what a 99.9% completion rate means — and what remains to be improved.
Terminology
| Term | Meaning |
|---|---|
| Full re-scan | Re-executing inspections on all registered MCP servers |
| Completion rate | The percentage of inspections that finished successfully |
| Processing failure | Cases where inspection did not complete due to timeout or source-fetch errors |
| Size-based diversion | A safety mechanism that routes oversized packages away from the standard processing pipeline |
| Continuous monitoring | Ongoing observation during inspection execution to verify the absence of hangs or stalls |
Lead
The reliability of an inspection platform is not measured by whether one scan works — it is measured by whether thousands run to completion without failure. A single successful inspection is a prerequisite, not an achievement. The operational question is: when thousands of inspections are executed continuously, what completion rate is achieved, what exceptions occur, and are those exceptions predictable and manageable?
This report analyzes the operational results of a full re-scan across 3,601 MCP servers and examines what 99.9% completion actually means — including what that number conceals.
Key Findings
- 3,597 out of 3,601 servers completed successfully (99.9% completion rate).
- Processing failures were 2 (0.06%), caused by timeout or temporary source unavailability.
- Size-based routing diverted 2 servers (0.06%), functioning as designed.
- Across 31.2 hours of continuous monitoring, no global queue freeze or hang was observed.
- Even with high completion, the next quality bar is exception explainability — can every failure be explained and classified?
Background: Why Full Re-Scans Are Necessary
The Limitations of Point-in-Time Inspection
Software safety inspections are often performed once — at deployment time. This "point-in-time" approach has a fundamental limitation: it cannot reflect changes that occur after the inspection date. Package updates, dependency changes, maintainer transitions, and vulnerability database updates all invalidate the original assessment.
In the MCP ecosystem, this problem is particularly acute. Server update frequency is high, the ecosystem itself is evolving rapidly, and the shelf life of any single safety verdict is inherently limited. Periodic re-inspection is not a luxury — it is an essential operational practice that complements the initial deployment-time check.
The Operational Significance of "Full" Re-Scans
Scanning all servers, rather than only recently updated ones, carries specific technical and operational meaning. Selective re-scanning (e.g., only servers with detected changes) is more efficient but depends on the reliability of change detection itself. Full re-scanning is a comprehensive approach that does not rely on change-detection accuracy, structurally reducing the risk of missed updates.
However, full re-scanning is only operationally viable if three conditions are met: processing time stays within acceptable bounds, completion rate is sufficiently high, and exceptions are manageable. This report evaluates whether these conditions hold using real operational data.
Methodology
| Item | Value |
|---|---|
| Scope | 3,601 MCP servers |
| Window | 2026-03-26 to 2026-03-27 |
| Monitoring duration | 31.2 hours |
| Metrics | State distribution, completion rate, failure rate, size-diversion count |
All registered servers were inspected in a continuous run. Each inspection's terminal state (completed / size-diverted / failed) was recorded, along with failure cause classification, processing time distribution, and queue backlog patterns.
Results
Overall Summary
| State | Count | Ratio |
|---|---|---|
| Completed successfully | 3,597 | 99.89% |
| Size-based routing diversion | 2 | 0.06% |
| Processing failure | 2 | 0.06% |
| Total | 3,601 | 100% |
Failure Analysis
Both processing failures were caused by external factors:
| Failure Cause | Count | Description |
|---|---|---|
| Timeout | 1 | The package source did not respond within the time limit |
| Fetch failure | 1 | The package source was temporarily unavailable |
Neither failure was caused by an internal issue within the inspection infrastructure. Critically, the system correctly detected both timeout and fetch-failure conditions, skipped the affected inspections, and continued processing subsequent servers without interruption.
Size-Based Diversions
The 2 size-based diversions occurred when packages exceeded the processing size limit. This diversion is a deliberate safety measure designed to prevent oversized packages from affecting the stability of other inspections. Diverted servers are logged for separate, individual handling.
Continuous Monitoring Results
Throughout the 31.2-hour monitoring window, the following was confirmed:
- No processing hangs or stalls occurred
- No significant queue backlog buildup was observed
- No substantial throughput degradation was detected
- No abnormal memory usage increase was recorded
These results demonstrate that the inspection infrastructure can sustain continuous processing at the 3,601-server scale without stability issues.
Discussion
1. 99.9% Is Evidence of Robustness, Not Perfection
A 99.9% completion rate is a strong result, but it is not 100%. From an operational quality perspective, what matters is not the failure count itself, but three questions: Can we explain why each failure occurred? Are failure patterns predictable? Do failures cascade to affect other inspections?
In this run, both failures were identified, classified as external-cause events, and confirmed to have no impact on other inspections. This satisfies the quality standard of "failures occurred, but they are understood, reproducible, and non-cascading."
The next quality milestone is automatic retry of failed inspections with success-rate tracking, and trend analysis across multiple re-scan cycles to enable preventive action.
2. Size-Based Diversion Is a Safety Measure, Not a Failure
When designing operational metrics, the decision of whether to classify size-based diversions as "failures" or "normal routing" has significant implications. This report treats them as distinct from failures.
The rationale is straightforward: refusing to process oversized packages through the standard pipeline is a safety-first choice. Forcing processing of oversized packages could cause secondary problems — increased processing time, memory exhaustion, or degraded performance for other inspections. Setting a size limit and diverting to an alternative handling path is a design decision that protects overall system stability.
Reflecting this distinction in operational reports prevents false alarms: when the completion rate drops, operators need to know whether the drop reflects a real problem or normal behavior within design parameters.
3. 31.2 Hours Is a Starting Point for Stability Assessment
31.2 hours is sufficient for a single full-scan run, but as a stability assessment, it represents only one data point. The monitoring window was long enough to capture patterns invisible in short tests — queue backlog distribution, throughput consistency, and latency variance — but several dimensions require ongoing measurement:
- Day-of-week and time-of-day effects: Package source response times may vary by day or hour
- Cumulative processing effects: Whether repeated re-scans cause performance degradation over time
- Scale effects: How processing time and completion rate change as the server count grows to 4,000, 5,000, or beyond
4. Operational Viability of Periodic Re-Scans
Processing 3,601 servers in 31.2 hours demonstrates that weekly full re-scans are realistic. At a weekly cadence, each server's safety verdict would be based on information no more than one week old. While not real-time, this is a substantial improvement over monthly or quarterly inspection cycles, significantly reducing the time window during which changes go undetected.
Compared to industry standards in software supply-chain monitoring, weekly full re-scans represent a high inspection frequency. This cadence strikes a reasonable balance between detection speed and operational cost.
Operational Recommendations
- Separate failure metrics from diversion metrics: In operational reports, clearly distinguish processing failures from size-based diversions, tracking counts and trends for each independently.
- Standardize failure cause classification: Categorize failures into "timeout," "fetch failure," and "other," with defined response procedures for each category.
- Implement automatic retry: Introduce automatic retry for failed inspections to minimize the impact of transient external issues.
- Track long-term trends: Continue weekly full re-scans and monitor completion rate, failure rate, and processing time over time to detect gradual degradation.
Limitations
- Based on a single execution window (31.2 hours); reproducibility across multiple runs is not yet validated.
- Root-cause analysis of individual failures is covered in a separate report.
- Individual server information is excluded for publication; only aggregated data is reported.
- The impact of package-source response-time variation on overall processing time has not been quantitatively isolated.
Conclusion
The re-scan operation has moved past the "does it run?" stage and into the "can every exception be explained and managed?" stage. Processing 3,601 servers in 31.2 hours with 99.9% completion demonstrates that weekly periodic re-scans are operationally viable.
The next improvement axis is not marginal completion-rate gains, but establishing a reproducible improvement cycle for failure events and validating scalability as the server population grows. Inspection infrastructure reliability should be evaluated not just by the headline number, but by whether exceptions can be rapidly identified, classified, and resolved when they occur.
MCP Guard's inspection platform has demonstrated full re-scan completion across 3,601 servers in 31.2 hours.
