Risk management - The illusions of intuition in software engineering

Our engineering decisions feel logical. They're mostly vibes. When you pick a new framework, decide to refactor that crumbling service, or try to debug a 3am production fire, you're usually following a gut feeling dressed up as technical judgment. For low-stakes stuff - pizza toppings for the team lunch - that's fine. For decisions that determine your system's architecture for the next three years, gut feeling is just organized guessing. Here's what to use instead.

Expected value

Every technical decision has possible outcomes, each with a probability and a payoff - positive or negative. Multiply them, sum them up, and you get expected value (EV). That's it. That's the whole thing.

Say your team wants to rewrite an old microservice because the code is embarrassing to look at. Intuition says: obviously rewrite it. Let's do the math instead.

Estimate the probability the new implementation actually works out. Estimate how many hours of maintenance it saves over a year. Then estimate the probability it blows up, and what cleaning up that mess costs in engineering time. People avoid mathematically profitable refactors all the time - loss aversion makes the fear of wasted time about twice as heavy as the desire for cleaner code. Writing EV on a whiteboard strips a lot of that emotion away. You're no longer arguing about aesthetics. You're arguing about numbers.

Concrete example. Your notification service eats roughly 6 hours of dev time every two-week sprint - debugging flaky retries, manually fixing stuck jobs, explaining its quirks to new people. That's 6h × 26 sprints = 156 hours a year, or around $18,700 at a blended $120/h rate. You estimate the rewrite takes 3 weeks of one engineer's time - about 120 hours, $14,400. Your test coverage is decent, so you put the probability of a clean migration at 75%. If it goes sideways, you're looking at a week of rollback and firefighting - another 40 hours, $4,800.

Run it:

EV = (0.75 × $18,700) − (0.25 × $4,800) − $14,400 = $14,025 − $1,200 − $14,400 = −$1,575

Negative. Your gut said "obviously rewrite it." The math says wait. Maybe invest in better monitoring first to reduce those 6 hours. Maybe improve test coverage to push the success probability to 90% before attempting the migration. Either way, you now have a conversation instead of a coin flip.

Base rate neglect when choosing technology

You read a blog post about a graph database that solves all your performance problems. A company ten times your size uses it. Your gut says the migration will go fine - maybe 80% chance of success by your deadline. Almost certainly, you're ignoring the base rate.

The actual question is: how often do complete database paradigm shifts at companies the size of yours succeed on schedule? If you dug through industry post-mortems, you'd find the number is much lower than what your optimism suggests. Every time someone on your team advocates for a new technology based on one success story at one company, ask for data from cases similar to yours - similar team size, similar codebase, similar runway. Without that, you're betting big on anecdote.

The sunk cost trap

Your team built a proprietary deployment tool. Six months of work. It's slow, it's fragile, and every new hire needs a week just to get it running. Then someone finds an open-source alternative that does the same job in a fraction of a second, with actual documentation.

The economic case is obvious. Those six months are gone whether you switch or not. The only real question is whether you want to spend the next year maintaining something you already don't like. Most teams stay with the homegrown tool anyway - not because it's better, but because throwing it away feels like admitting failure. That's the sunk cost fallacy. In engineering, it often comes bundled with NIH syndrome (Not Invented Here): the belief that abandoning your own code is somehow a defeat.

The fix is simple, if uncomfortable. Pretend you're starting today with no legacy constraints. Given both options, would you build this tool from scratch? If the answer is no, the right move is also obvious.

Bayesian thinking during outages

When something breaks in production, most teams converge on the first plausible explanation someone mentions on Slack. "It's the database again." The rest of the team follows that thread, ignoring metrics that point elsewhere, and wastes time they don't have.

Bayes' theorem gives you a better structure. Your working hypothesis should always carry a probability, and that probability updates as new evidence comes in. How often does the database actually go down? Does the CPU spike pattern match a DB problem or a bad deployment? A single error in the logs shouldn't push your probability of "database failure" from 10% to 100%. You move from 10 to 20, then maybe to 50, as you see connection timeouts rather than API timeouts. The mode shifts away from group panic and toward methodical hypothesis elimination. It's slower to describe and faster in practice.

Here's the math behind one update. Your prior: database failures cause roughly 15% of your outages. New evidence: the on-call engineer notices query latency is normal but the background job queue has grown to 50,000 items. If this were truly a database problem, you'd expect to see latency spikes - and you don't. If it were a bad deployment (the other candidate), a backed-up job queue fits perfectly.

P(bad deployment | backed-up queue) 
= P(queue backs up | bad deployment) × P(bad deployment) / P(queue backs up) 
= (0.80 × 0.55) / ((0.80 × 0.55) + (0.10 × 0.15)) 
= 0.44 / (0.44 + 0.015) ≈ 97%

One metric moved the probability from "we don't know" to "almost certainly the deployment." The team rolls back before anyone has touched the database. Incident resolved in 14 minutes instead of the usual hour of chasing ghosts.

Survivorship bias on engineering blogs

Streaming platform uses 1,000 microservices. You read the blog post. You start thinking your e-commerce app with 15,000 users needs the same architecture.

What you're seeing is survivorship bias in its purest form. You're reading about a team that had hundreds of engineers, a dedicated platform org, and years of gradual migration. You're not reading about the startup that tried the same thing, burned through its AWS budget on cross-service networking, and couldn't ship features fast enough to survive. That story doesn't get a tech blog post. The graveyard of failed architectural overhauls is completely invisible.

When someone presents a pattern as a guaranteed path to success, ask about the denominator. How many teams at our size, with our headcount, trying this approach, actually made it work? That number is almost always smaller than the blog posts suggest.

Kelly's criterion and technical debt

Probability understood, cognitive biases accounted for. Now here's the question every tech lead actually struggles with: how much sprint capacity do we allocate to paying down debt?

The two failure modes are common. One team ignores it entirely under business pressure until a production incident forces a week-long scramble. Another team pushes to freeze all feature work for a month and "fix everything" - which kills business momentum and usually doesn't work anyway.

Kelly's Criterion comes from gambling and investing. The key insight is that "full Kelly" - going all-in when you have an edge - is mathematically optimal in theory but catastrophically risky in practice. One bad outcome and you're out. Engineers who demand a full feature freeze to fix tech debt are playing full Kelly. It's too aggressive.

The formula is f* = (p × b − q) / b, where p is the probability your debt-reduction work actually pays off, q is the probability it doesn't (1 − p), and b is the ratio of gain to cost. Say you've done the EV math and concluded there's a 70% chance the refactor succeeds and saves the team 3x its investment in hours over the next year. b = 3, p = 0.70, q = 0.30.

f = (0.70 × 3 − 0.30) / 3 = (2.1 − 0.30) / 3 = 1.80 / 3 = 60%

Full Kelly says commit 60% of your sprint capacity to the refactor. That's obviously insane - the product would stall. Professionals use a quarter to half of the Kelly number. Quarter Kelly here is 15%, half Kelly is 30%. That's the real-world range of 15–20% that experienced tech leads usually land on, and now you have a derivation for it instead of an assertion.

The fractional version is more useful: concentrate effort, but always maintain a delivery margin. You keep shipping, you keep the codebase from rotting further, and you give yourself a defensible position in conversations with management - because you have math, not just feelings.

From there, actually making the case requires a few concrete steps.

Step one: price the bleeding

Before proposing any refactor, put a number on the current pain. Engineers describe code as ugly or confusing. That's not a business argument. Time and money are.

Count how many hours per sprint your team actually loses to that module - fighting it, working around it, debugging it. Add the time spent on production incidents it's caused. Multiply by 12. That's the annual cost of leaving it alone. It's also the upper bound on what you can justify spending to fix it.

Step two: honest estimation

How long will the fix actually take? Your first guess is too low. It always is.

Go back and look at three similar initiatives at your company. Check how far they overran their original estimates. If you typically miss by 100%, double your current number. That's your realistic cost. The point isn't to demoralize anyone - it's to avoid committing to a refactor that runs out of budget halfway through and leaves the codebase in a worse state than before.

Step three: assign a probability to the risk

Nothing is zero-risk. There's always a chance the refactor breaks something that was quietly holding everything together. What's the realistic probability of a clean migration? 80% if your test coverage is solid, maybe 50% if it isn't. The remaining 20–50% - a rollback, a bad deploy, several days of emergency patching - has a real cost in hours and team stress. Write that down too.

Step four: calculate

$EV = (P{success} \times Gain) - (P{failure} \times Cost{bugs}) - Cost{investment}$

If the result is negative, the math says leave it. If it's strongly positive, you have an argument that's much harder to dismiss than "the code is messy."

Prioritization in practice

Run your three or four biggest pain points through this process. Each ends up with an EV. Divide by the time required to execute. You now have a priority-ordered list based on return per hour of engineering effort.

The painful part: it rarely ranks what programmers find most aesthetically offensive. It ranks what's actually slowing down delivery and costing the most money. That's the whole point. Developer intuition optimizes for what bothers developers. This optimizes for what matters to the system.

Happy calculating!