Pre-Deploy Risk Score: Automated Deploy Safety Analysis

Every engineering team has a deploy horror story. The Friday afternoon push that cascaded through three dependent services. The release that landed while two P1 incidents were already burning. The config change nobody realized touched a shared dependency used by fourteen microservices.

The pre-deploy risk score fixes this by building an automated checkpoint that fires the moment a deploy enters the queue. Instead of relying on gut checks and tribal knowledge about whether "now is a good time," this agent systematically pulls signals from across your infrastructure and team state to produce a single verdict: HOLD, PROCEED, or WATCH — each backed by a numerical confidence score and a plain-English explanation of what it found.

Why Deploys Fail Contextually, Not Just Technically

The code passes CI. The tests are green. And it still breaks production.

Most deployment failures in mature systems are not caused by broken code in isolation. They are caused by context collisions — the intersection of a technically valid change with an environment that was not ready for it. Consider a database migration that runs fine in staging but collides with an active A/B experiment that doubled write traffic on the affected table. Or a feature flag rollout that touches the same API surface as an ongoing incident remediation.

According to a 2025 analysis from Overmind, a substantial share of deployment-related incidents involved infrastructure dependencies that the deploying engineer was unaware of^[1] — with estimates suggesting over half of contextual failures could be anticipated with better dependency visibility.^[4] The code worked. The timing did not. A pre-deploy risk score addresses this directly by evaluating the deployment context, not just the deployment artifact.

7-day

Incident rate window

A/B tests

Active experiment count

Blast radius

Dependency graph depth

Time-to-Friday

Weekend proximity score

P2 bugs

Outstanding bug count

On-call load

Engineer incident burden

The Six Signal Layers of a Pre-Deploy Risk Score

Each layer contributes a weighted sub-score to the final verdict.

1
Active A/B Experiments
The agent queries your experimentation platform (LaunchDarkly, Split, Statsig) for all running experiments that touch the same services or feature areas as the pending deploy. Each overlapping experiment increases the risk multiplier because deploy-induced variance corrupts experiment results and experiment-induced traffic patterns can amplify deploy side effects.
2
On-Call Engineer Incident Load
A deploy landing while the on-call engineer is already managing two active incidents means slower response if something goes wrong. The agent checks PagerDuty or Opsgenie for the current on-call roster and their open incident count over the past 48 hours. High fatigue scores trigger an automatic WATCH or HOLD.
3
Blast Radius from Dependency Graph
This is the heaviest signal. The agent parses your Terraform state files, CloudFormation stacks, or Kubernetes manifests to build a runtime dependency graph. It then traces which downstream services, databases, and queues are affected by the resources being modified. A change to a shared VPC security group has a fundamentally different blast radius than a change to a single Lambda function.^[2]
4
Seven-Day Incident Rate
If the target service has experienced multiple incidents in the past week, deploying additional changes compounds instability. The agent pulls incident history from your ITSM tool and weights recent incidents by severity. A service with two P2s in the last three days gets a dramatically higher risk score than one that has been stable for months.
5
Outstanding P2 Bugs
Unresolved high-priority bugs indicate existing instability in the codebase. The agent checks Jira or Linear for open P2 bugs tagged against the services being deployed. Each outstanding bug signals unresolved brittleness that a new deploy could aggravate.
6
Time-to-Friday and Calendar Signals
Deploying at 4:47 PM on a Friday before a holiday weekend is a different proposition than deploying Tuesday at 10 AM. The agent calculates hours until end-of-business Friday, checks for company holidays, and factors in the geographic distribution of your on-call team. This is not superstition — it is staffing reality.

Pre-Deploy Risk Score — Agent Workflow DAG

Deploy Queued Event

A/B Experiment Check

On-Call Load Check

Blast Radius Analysis

7-Day Incident Rate

Outstanding P2 Bugs

Time-to-Friday Score

Weighted Aggregator

HOLD / PROCEED / WATCH

The agent fires on deploy queue event, gathers six signal layers in parallel, then aggregates into a weighted verdict.

Building the Blast Radius Estimator from IaC Topology

How to extract dependency graphs from Terraform and CloudFormation for real-time risk scoring.

The blast radius estimator is the most technically involved component of the pre-deploy risk score, and it deserves its own treatment. The core idea is straightforward: parse your Infrastructure-as-Code state to build a directed acyclic graph of resource dependencies, then calculate how many nodes are reachable from the set of resources being modified.

For Terraform, the starting point is terraform graph or parsing the state file directly. The open-source blast-radius tool pioneered interactive visualization of these dependency graphs using d3.js.^[2] For production risk scoring, you want the graph data without the visualization — pipe terraform graph -type=plan into a parser that extracts nodes and edges, then run a breadth-first traversal from the changed resources.^[3]

For CloudFormation, the approach differs slightly. CloudFormation stacks expose DependsOn relationships explicitly, and you can query the stack's resource list via the AWS API. The key addition is cross-stack references — when one stack exports a value that another imports, the dependency is implicit but the blast radius is real. Your parser needs to follow Fn::ImportValue references across stack boundaries.^[5]

The scoring formula weights direct dependents more heavily than transitive ones, with a decay factor at each hop. A change that directly affects 3 services and transitively touches 12 more scores differently than one that directly affects 12 services with no transitive reach.

blast-radius-scorer.ts

interface DependencyNode {
  resourceId: string;
  resourceType: string;
  directDependents: string[];
  transitiveDependents: string[];
}

function calculateBlastRadius(
  changedResources: string[],
  graph: Map<string, DependencyNode>
): { score: number; affectedServices: string[]; depth: number } {
  const visited = new Set<string>();
  const queue: { id: string; depth: number }[] = [];
  let maxDepth = 0;

  // Seed with changed resources
  for (const id of changedResources) {
    queue.push({ id, depth: 0 });
    visited.add(id);
  }

  // BFS through dependency graph
  let score = 0;
  while (queue.length > 0) {
    const { id, depth } = queue.shift()!;
    const node = graph.get(id);
    if (!node) continue;

    // Decay factor: direct = 1.0, each hop reduces by 0.4
    const depthWeight = Math.pow(0.6, depth);
    score += depthWeight;
    maxDepth = Math.max(maxDepth, depth);

    for (const dep of node.directDependents) {
      if (!visited.has(dep)) {
        visited.add(dep);
        queue.push({ id: dep, depth: depth + 1 });
      }
    }
  }

  return {
    score: Math.round(score * 100) / 100,
    affectedServices: [...visited],
    depth: maxDepth,
  };
}

The Weighted Scoring Formula and Verdict Thresholds

How sub-scores combine into HOLD, PROCEED, or WATCH.

Signal	Weight	Low (0-3)	Medium (4-6)	High (7-10)
Blast Radius	0.25	≤2 direct deps	3-8 direct deps	>8 or cross-region
7-Day Incident Rate	0.20	0-1 incidents	2-3 incidents	4+ or any P1
On-Call Load	0.15	0 open incidents	1-2 open incidents	3+ or recent P1
A/B Experiments	0.15	0 overlapping	1-2 overlapping	3+ overlapping
Outstanding P2 Bugs	0.10	0-1 open	2-4 open	5+ open
Time-to-Friday	0.15	>24 hours	8-24 hours	<8 hours

The final risk score is a weighted sum of normalized sub-scores, producing a value between 0 and 10. The verdict mapping below is a starting point — calibrate thresholds against your own incident history after 60–90 days of data:

PROCEED (score 0.0 – 3.9, confidence ≥ 70%, as a starting point): All signals are within acceptable ranges. The deploy can proceed with standard monitoring.
WATCH (score 4.0 – 6.4, or confidence 50–69%): Elevated risk detected. Deploy is allowed but the agent triggers enhanced monitoring — shorter canary windows, tighter rollback thresholds, and a Slack alert to the on-call channel.
HOLD (score 6.5 – 10.0, or any single signal at 9+): The agent blocks the deploy pipeline and pages the deploy author with a summary of which signals triggered the hold. A manual override requires two approvals.

The confidence score reflects data completeness. If the agent could not reach the experimentation API or the incident tracker timed out, confidence drops — and a low-confidence PROCEED can become a WATCH purely on uncertainty. All thresholds should be treated as initial heuristics and adjusted based on your false-positive and false-negative rates.

Calibration is required — these weights are starting points

The signal weights and verdict thresholds above are illustrative defaults based on common deployment failure patterns. Your infrastructure, team size, deployment cadence, and incident history will require different calibration. Teams that skip the 60–90 day calibration phase often find that blast radius is underweighted (needs to increase from 0.25 toward 0.30–0.35) and time-to-Friday is overweighted for teams with non-Friday incident patterns. Run in advisory mode first.

Without Pre-Deploy Risk Score

Engineers rely on gut feel about deploy timing
Blast radius unknown until something breaks
On-call fatigue is invisible to the deployer
Friday deploys depend on social pressure, not data
Experiment contamination discovered weeks later
Incident-during-deploy response is reactive

With Pre-Deploy Risk Score

Every deploy gets an objective risk assessment
Blast radius quantified from IaC dependency graphs
On-call state is a first-class deployment signal
Time-based risk is calculated, not debated
Experiment overlap flagged before code reaches prod
High-risk windows identified and held proactively

Wiring the Pre-Deploy Risk Score Into Your CI/CD Pipeline

Integration points for GitHub Actions, ArgoCD, and custom pipelines.

.github/workflows/deploy-gate.yml

name: Pre-Deploy Risk Gate
on:
  deployment:
    types: [created]

jobs:
  risk-score:
    runs-on: ubuntu-latest
    outputs:
      verdict: ${{ steps.score.outputs.verdict }}
      confidence: ${{ steps.score.outputs.confidence }}
    steps:
      - uses: actions/checkout@v4

      - name: Gather deploy context
        id: context
        run: |
          echo "changed_services=$(gh api repos/$REPO/pulls/$PR/files | jq -r '.[].filename' | sort -u)" >> $GITHUB_OUTPUT

      - name: Calculate risk score
        id: score
        run: |
          npx deploy-risk-agent \
            --services "${{ steps.context.outputs.changed_services }}" \
            --pagerduty-token "${{ secrets.PD_TOKEN }}" \
            --launchdarkly-token "${{ secrets.LD_TOKEN }}" \
            --terraform-state "s3://infra-state/prod" \
            --jira-project "ENG" \
            --output json > risk-report.json

          echo "verdict=$(jq -r .verdict risk-report.json)" >> $GITHUB_OUTPUT
          echo "confidence=$(jq -r .confidence risk-report.json)" >> $GITHUB_OUTPUT

      - name: Gate decision
        if: steps.score.outputs.verdict == 'HOLD'
        run: |
          echo "::error::Deploy HELD — risk score exceeded threshold"
          exit 1

Calibrating the Score With a Feedback Loop

A risk scoring system is only as good as its calibration. After every deploy — whether it was scored PROCEED, WATCH, or an overridden HOLD — track the outcome. Did the deploy cause an incident within 24 hours? Was a rollback required? Did any experiment results get invalidated?

Store these outcomes alongside the original risk scores in a tracking table. After 60-90 days of data, you can run a logistic regression to validate whether your weights are predictive. Common findings from teams who have done this calibration:

Blast radius is almost always underweighted initially. Teams typically need to increase it from 0.25 to 0.30-0.35.^[4]
Time-to-Friday is overweighted by teams with Monday-heavy incident patterns. Adjust based on your actual incident distribution by day-of-week.
On-call load becomes more predictive when you include the engineer's sleep schedule (derived from activity timestamps), not just their incident count.

Pre-Deploy Risk Score Implementation Checklist

Set up webhook listener for deploy queue events
Integrate experimentation platform API (LaunchDarkly/Split/Statsig)
Connect PagerDuty or Opsgenie for on-call state
Build Terraform/CloudFormation dependency graph parser
Wire incident history query from ITSM tool
Connect issue tracker for P2 bug counts
Implement time-to-Friday calculator with holiday awareness
Define initial signal weights and verdict thresholds
Set up CI/CD gate with override mechanism
Create outcome tracking for calibration data collection
Schedule monthly weight recalibration review

What if one of the signal sources is unavailable during scoring?

The agent should degrade gracefully. If a signal source times out or returns an error, that signal gets a score of 5 (neutral-high) and the overall confidence score drops proportionally. A low-confidence PROCEED is automatically promoted to WATCH. Never let a missing signal result in an unconditional PROCEED.

How do you handle monorepo deploys where everything changes at once?

For monorepos, scope the blast radius analysis to the build targets that actually changed, not the entire repository. Tools like Bazel, Nx, and Turborepo expose affected-project graphs that map directly to dependency analysis. The risk score should evaluate per-deployable-unit, not per-commit.

Should the risk score block deploys automatically or just advise?

Start advisory-only for the first 30 days. Let the team see the scores, disagree with them, and build intuition about what the system flags. Once the false-positive rate drops below 10%, switch HOLD verdicts to blocking with a two-person override. WATCH verdicts should always be advisory.

How does this interact with feature flags and progressive delivery?

Feature flags reduce blast radius by limiting exposure, which should lower the blast radius sub-score. If a deploy is behind a feature flag that starts at 1% rollout, the effective blast radius is 1% of the calculated value. The agent should query your feature flag configuration to apply this multiplier.

We went from two deploy-related incidents per month to zero in the first quarter after rolling out the risk score. The HOLD verdicts caught three deploys that would have hit us during active experiments — each one would have cost us weeks of invalid experiment data.

— Marcus Chen, Staff Platform Engineer, Series C Fintech

The pre-deploy risk score is not about slowing down your release velocity. Teams that implement it consistently report higher deploy frequency because engineers feel confident deploying more often when they know the system will catch bad timing. The score replaces anxiety with data, and replaces post-incident "we should have known" with pre-incident "the system flagged it."

Start with blast radius and incident rate — those two signals alone catch the majority of contextual deploy failures.^[4] Add the remaining signals incrementally as you build API integrations with your toolchain. Within 90 days of calibration, you will have a risk model tuned to your specific infrastructure and team patterns.

Key terms in this piece

pre-deploy risk scoredeployment risk assessmentblast radius analysisCI/CD safetydeploy gateinfrastructure dependency graphon-call fatiguedeployment automation

Sources

[1]The Register — Overmind: The Tool That Maps Your Infrastructure's Blast Radius Before You Break It(theregister.com)↩
[2]28mm — blast-radius: Interactive visualizations of Terraform dependency graphs(github.com)↩
[3]IBM — Blast Radius: Review the Impact of Changes in Your Terraform Files(ibm.com)↩
[4]Overmind — The Difference Between Terraform Plan and Overmind Blast Radius(overmind.tech)↩
[5]Firefly — Terraform Module Blast Radius: Methods for Resilient IaC in Platform Engineering(firefly.ai)↩

Share this article

X LinkedIn Hacker News

The Pre-Deploy Risk Score: AI That Reads the Room Before Every Release

Why Deploys Fail Contextually, Not Just Technically

The Six Signal Layers of a Pre-Deploy Risk Score

Active A/B Experiments

On-Call Engineer Incident Load

Blast Radius from Dependency Graph

Seven-Day Incident Rate

Outstanding P2 Bugs

Time-to-Friday and Calendar Signals

Building the Blast Radius Estimator from IaC Topology

The Weighted Scoring Formula and Verdict Thresholds

Calibration is required — these weights are starting points

Wiring the Pre-Deploy Risk Score Into Your CI/CD Pipeline

Calibrating the Score With a Feedback Loop

Pre-Deploy Risk Score Implementation Checklist

Related

How to Harden Your MCP Server Before It Becomes a Security Incident

The AI Coding Agent ROI Paradox: Why Individual Wins Don't Add Up

Measuring AI ROI Without Lying to Yourself