Most engineering organizations discover their fragile services the hard way — through production incidents, weekend pages, and post-mortems that end with "we knew this was a problem." The knowledge exists, scattered across incident channels, stale ADRs, and the instincts of senior engineers who have been burned before. But nobody synthesizes it.
A technical risk heat map changes this. It is a monthly automated report that pulls four distinct signal layers — incident history, codebase health metrics, architecture decision record (ADR) audits, and PR review friction — and produces a ranked fragility register for every service in your portfolio. The output is not a dashboard with vanity metrics. It is a prioritized list of where your next outage will probably come from, scored by signals that correlate with breakage rather than with size or complexity alone.
Why Fragility Matters More Than Complexity
Large codebases are not inherently fragile. Small services with accumulated neglect are.
Engineering teams habitually confuse size with risk. The 200,000-line monolith gets all the attention, while a 3,000-line billing service with zero tests, four unresolved TODO comments referencing a deprecated API, and a dependency on a library three major versions behind quietly accumulates brittleness.
Meta's Diff Risk Score research, published in August 2025[1], demonstrated that predictive models trained on historical incident correlation outperform those based on code complexity metrics alone. The signals that predict breakage are not lines of code or cyclomatic complexity — they are patterns of neglect, uncertainty, and unresolved decisions.
A technical risk heat map captures exactly these signals. It asks: which services have the highest concentration of deferred maintenance, unreviewed architectural choices, and review-cycle friction? Those are the services that will surprise you next.
Layer 1: Incident History Per Service
Past breakage is the strongest predictor of future breakage.
The first signal layer is the most straightforward: pull 90 days of incident data from your ITSM platform (PagerDuty, Opsgenie, Rootly, FireHydrant) and aggregate by affected service. But raw incident counts are misleading. A service with ten P4 informational alerts is not more fragile than one with two P1 outages.
The scoring formula weights severity and recency. P1 incidents in the last 30 days score 10 points each. P2s score 5. P3s score 2. Everything decays by 50% for each 30-day period — an incident from 60 days ago counts half as much as one from last week. The result is a recency-weighted severity score that captures both the magnitude and the trend of service instability.
Critically, this layer also captures incident adjacency — when Service A's incident was caused by Service B's failure, both services accumulate score, but with different tags: "origin" versus "blast-radius victim." This distinction matters for prioritization.
| Severity | Base Score | 0-30 Days | 30-60 Days | 60-90 Days |
|---|---|---|---|---|
| P1 — Full outage | 10 | 10.0 | 5.0 | 2.5 |
| P2 — Degraded service | 5 | 5.0 | 2.5 | 1.25 |
| P3 — Minor impact | 2 | 2.0 | 1.0 | 0.5 |
| P4 — Informational | 1 | 1.0 | 0.5 | 0.25 |
Layer 2: Codebase Health — TODOs, Test Gaps, Dependency Age
Static signals extracted from the code itself.
The second layer mines the codebase directly. Three sub-signals combine into a codebase health score:
TODO/FIXME/HACK Concentration: Count annotated debt markers per thousand lines of code. A service with 0.5 TODOs per KLOC is maintained. One with 8 per KLOC is accumulating shortcuts nobody returns to fix. Weight TODO comments that reference specific issues or dates more heavily — they indicate known problems with deferred resolutions.
Test Coverage Gaps: Overall coverage percentage is a blunt instrument. Instead, measure coverage specifically on files that changed in the last 90 days. A service with 85% overall coverage but 20% coverage on recently-modified files is more fragile than one with 60% overall coverage that is consistently tested where it changes. Focus on the delta between overall and active-file coverage.
Dependency Staleness: For each service, calculate the average age of its direct dependencies compared to their latest available versions. A service pinned to a version 18 months behind current, especially for security-critical dependencies, carries compounding risk. Weight dependencies with known CVEs in the gap between pinned and current versions at 3x.
codebase-health-scorer.tsinterface CodebaseHealthSignals {
todoConcentration: number; // TODOs per KLOC
testCoverageGap: number; // overall% - active_file%
staleDependencies: number; // avg months behind latest
cveExposure: number; // count of CVEs in dep gap
}
function scoreCodebaseHealth(signals: CodebaseHealthSignals): number {
const todoScore = Math.min(signals.todoConcentration / 10, 1) * 25;
const coverageScore = Math.min(signals.testCoverageGap / 50, 1) * 30;
const staleScore = Math.min(signals.staleDependencies / 24, 1) * 20;
const cveScore = Math.min(signals.cveExposure / 5, 1) * 25;
return Math.round(todoScore + coverageScore + staleScore + cveScore);
}Layer 3: ADR Audit — Decisions That Were Never Revisited
Architecture Decision Records with "revisit in Q3" that nobody ever revisited.
Architecture Decision Records are powerful when maintained[2]. They become dangerous when abandoned. The third signal layer parses your ADR repository and flags two categories of risk:
Unresolved Revisit Markers: ADRs frequently include language like "revisit after migration completes," "temporary until we evaluate alternatives in Q3," or "accepted risk — reassess in 6 months." The agent scans for these temporal markers and checks whether any follow-up ADR or PR addressed them. An ADR from 18 months ago that says "revisit in Q2 2025" with no follow-up is an unresolved architectural bet that may no longer be valid.
Superseded-But-Not-Updated ADRs: When a newer ADR partially contradicts an older one without explicitly superseding it, teams operate on conflicting assumptions. The agent detects overlapping decision scopes across ADRs and flags pairs where the older record was never marked as superseded.
Each unresolved ADR tagged to a specific service increments that service's fragility score. The weight increases with the age of the unresolved decision — a 6-month-old "revisit" is a mild concern; a 2-year-old one is a landmine.
Layer 4: PR Review Signals — Friction as a Risk Indicator
The fourth and most subtle signal layer comes from analyzing PR review patterns. Not the content of the code — the dynamics of the review process itself. Three sub-signals indicate areas where the team is uncertain, confused, or struggling:
Review Cycle Count: PRs that go through more than five review cycles before merging indicate areas where requirements, implementation approach, or team understanding are misaligned. High cycle counts per service directory correlate with future defects because they reveal areas where the team lacks shared mental models.
Uncertainty Language in Reviews: Natural language analysis of review comments for phrases like "I think this is right but…", "not sure about this approach", "this might break", "we should probably", "let's revisit", and "I don't fully understand." Concentration of uncertainty language in reviews for a specific service is a leading indicator of knowledge gaps that produce bugs.
Time-to-First-Review: PRs that sit unreviewed for days in specific service directories often indicate that nobody feels confident reviewing that code. This ownership vacuum is itself a fragility signal — when the one person who understands the service goes on vacation, changes accumulate without meaningful review.
1-2 review cycles per PR on average
First review within 4 hours
Confident language: 'LGTM', 'clean approach'
Multiple qualified reviewers available
Consistent review quality across team members
5+ review cycles, frequent request-changes loops
First review delayed 2+ days
Hedging language: 'I think this works…', 'not sure'
Single reviewer bottleneck, others decline
Review depth varies wildly by who reviews
Building the Fragility Register Output
The output of the monthly agent is a fragility register — a ranked table of every service with its composite score and per-layer breakdown. This is not a dashboard that lives on a TV screen nobody watches. It is a document delivered to engineering leadership with three sections:
- Red Zone (score 70-100): Services requiring immediate attention. Schedule dedicated remediation sprints or reduce deploy frequency until stabilized.
- Yellow Zone (score 40-69): Services with accumulating risk. Add to the next quarter's technical debt budget. Assign specific owners for the highest-contributing signal layer.
- Green Zone (score 0-39): Services operating within acceptable risk parameters. Monitor for trend changes.
The register includes a trend indicator for each service — whether its fragility score increased, decreased, or stayed flat compared to last month. A service that moved from 35 to 52 in one month deserves more attention than one that has been stable at 55 for six months.
Data Sources Required
- ✓
PagerDuty / Opsgenie / Rootly — incident history with service tagging
- ✓
GitHub / GitLab — PR review metadata, cycle counts, comment text
- ✓
SonarQube / custom scripts — TODO/FIXME counts, test coverage per directory
- ✓
Dependabot / Renovate — dependency staleness and CVE exposure
- ✓
ADR repository — decision records with temporal markers
Signal Layer Weights (Recommended Starting Point)
Incident History: 35% — strongest direct predictor of future incidents
Codebase Health: 25% — captures accumulated neglect and maintenance debt
PR Review Friction: 25% — leading indicator of knowledge gaps
ADR Audit: 15% — captures strategic risk from unresolved decisions
Fragility Register Operating Rules
Red Zone services cannot accept new feature work until fragility score drops below 70
Adding features to fragile services compounds instability. Stabilize first.
Every service that moves from Green to Yellow requires an owner assignment within 5 business days
Trend direction matters more than absolute score. Catch degradation early.
The register must be reviewed in the monthly engineering leadership sync
A report nobody reads provides zero value. Build it into existing cadences.
Override requests for Red Zone feature work require VP-level approval with a written remediation plan
Exceptions should be deliberate and documented, not quietly normalized.
How do you handle services with no incident history?
No incidents does not mean no risk — it may mean insufficient monitoring. For services with zero incident history, increase the weight of codebase health and PR friction signals by 1.5x. Also flag services with no alerts configured as a separate monitoring gap category.
What if teams game the TODO count by removing markers without fixing the underlying issue?
Cross-reference TODO removal commits with actual code changes. If a commit only removes comment markers without modifying the surrounding code, flag it as cosmetic cleanup rather than genuine remediation. Track this as a separate integrity signal.
How long until the heat map becomes predictive?
You need at least three months of monthly snapshots to establish meaningful trends. After six months, you can run correlation analysis between fragility scores and subsequent incidents to validate and refine your signal weights. Most teams see strong predictive correlation by month four.
The first monthly report flagged our payments service at 78 — the highest in the register. Nobody was surprised, but nobody had quantified it before. Having the number made it impossible to keep deprioritizing the remediation work. We got budget for a dedicated stabilization sprint within a week.
The technical risk heat map succeeds because it makes invisible fragility visible and quantified. Teams already know intuitively which services are brittle — the heat map gives that intuition a number, a trend line, and a framework for action. Start with incident history and codebase health signals, which require the least integration work. Add ADR audit and PR friction analysis in month two once you have baseline data. Within a quarter, you will have a fragility model tuned to your organization's actual failure patterns.
- [1]Diff Risk Score (DRS): AI-Aware Software Development — Meta Engineering(engineering.fb.com)↩
- [2]Architecture Decision Records (ADR) Process — AWS Prescriptive Guidance(docs.aws.amazon.com)↩
- [3]Master Architecture Decision Records: Best Practices for Effective Decision Making — AWS(aws.amazon.com)↩
- [4]The Modern Risk Prioritization Framework for 2026 — Safe Security(safe.security)↩