## How to upper-bound the probability of something bad

Scott Alexander has a new post decrying how rarely experts encode their knowledge in the form of detailed guidelines with conditional statements and loops—or what one could also call flowcharts or expert systems—rather than just blanket recommendations.  He gives, as an illustration of what he’s looking for, an algorithm that a psychiatrist might use to figure out which antidepressants or other treatments will work for a specific patient—with the huge proviso that you shouldn’t try his algorithm at home, or (most importantly) sue him if it doesn’t work.

Compared to a psychiatrist, I have the huge advantage that if my professional advice fails, normally no one gets hurt or gets sued for malpractice or commits suicide or anything like that.  OK, but what do I actually know that can be encoded in if-thens?

Well, one of the commonest tasks in the day-to-day life of any theoretical computer scientist, or mathematician of the Erdös flavor, is to upper bound the probability that something bad will happen: for example, that your randomized algorithm or protocol will fail, or that your randomly constructed graph or code or whatever it is won’t have the properties needed for your proof.

So without further ado, here are my secrets revealed, my ten-step plan to probability-bounding and computer-science-theorizing success.

Step 1. “1” is definitely an upper bound on the probability of your bad event happening.  Check whether that upper bound is good enough.  (Sometimes, as when this is an inner step in a larger summation over probabilities, the answer will actually be yes.)

Step 2. Try using Markov’s inequality (a nonnegative random variable exceeds its mean by a factor of k at most a 1/k fraction of the time), combined with its close cousin in indispensable obviousness, the union bound (the probability that any of several bad events will happen, is at most the sum of the probabilities of each bad event individually).  About half the time, you can stop right here.

Step 3. See if the bad event you’re worried about involves a sum of independent random variables exceeding some threshold. If it does, hit that sucker with a Chernoff or Hoeffding bound.

Step 4. If your random variables aren’t independent, see if they at least form a martingale: a fancy word for a sum of terms, each of which has a mean of 0 conditioned on all the earlier terms, even though it might depend on the earlier terms in subtler ways.  If so, Azuma your problem into submission.

Step 5. If you don’t have a martingale, but you still feel like your random variables are only weakly correlated, try calculating the variance of whatever combination of variables you care about, and then using Chebyshev’s inequality: the probability that a random variable differs from its mean by at most k times the standard deviation (i.e., the square root of the variance) is at most 1/k2.  If the variance doesn’t work, you can try calculating some higher moments too—just beware that, around the 6th or 8th moment, you and your notebook paper will likely both be exhausted.

Step 6. OK, umm … see if you can upper-bound the variation distance between your probability distribution and a different distribution for which it’s already known (or is easy to see) that it’s unlikely that anything bad happens. A good example of a tool you can use to upper-bound variation distance is Pinsker’s inequality.

Step 7. Now is the time when you start ransacking Google and Wikipedia for things like the Lovász Local Lemma, and concentration bounds for low-degree polynomials, and Hölder’s inequality, and Talagrand’s inequality, and other isoperimetric-type inequalities, and hypercontractive inequalities, and other stuff that you’ve heard your friends rave about, and have even seen successfully used at least twice, but there’s no way you’d remember off the top of your head under what conditions any of this stuff applies, or whether any of it is good enough for your application. (Just between you and me: you may have already visited Wikipedia to refresh your memory about the earlier items in this list, like the Chernoff bound.) “Try a hypercontractive inequality” is surely the analogue of the psychiatrist’s “try electroconvulsive therapy,” for a patient on whom all milder treatments have failed.

Step 8. So, these bad events … how bad are they, anyway? Any chance you can live with them?  (See also: Step 1.)

Step 9. You can’t live with them? Then back up in your proof search tree, and look for a whole different approach or algorithm, which would make the bad events less likely or even kill them off altogether.

Step 10. Consider the possibility that the statement you’re trying to prove is false—or if true, is far beyond any existing tools.  (This might be the analogue of the psychiatrist’s: consider the possibility that evil conspirators really are out to get your patient.)

### 43 Responses to “How to upper-bound the probability of something bad”

1. michel Says:

This is my favorite Shtetl post in a long time! (Especially Steps 2- 7.)

2. jonathan Says:

When I first saw the title of this post, I thought it was about upper bounding the probability of something bad happening to humanity, i.e. a continuation of the discussion of your skepticism of Stephen Pinker’s optimism.

Then, after reading your introduction about Scott Alexander’s plea for expert system pseudocode, I thought you were going to try to encode your abilities as a Casandra-like infallible oracle of catastrophe in a usable algorithm.

Now I’m left wondering about the connection between your day job and your side gig as a prophet of doom. Maybe they really do both draw heavily on the same underlying skill of estimating probabilities of bad things happening?

3. Person Says:

Haha…this is a great funny post; but it’s also actually helpful!

4. Scott Says:

jonathan #2: I’m always amused—as maybe the other Scott is too—when I see people on Reddit or Twitter or whatever who know me only from popular writings, or polemical blog posts about some issue, and are unaware that I even do anything else. I suspect that, apart from lack of time, this is precisely what prevents more scientists from writing for a broad audience: the fear that everyone will switch to seeing you as just a cable news personality, and will judge your life and work solely by the least guarded things you’ve ever said (which are, after all, the easiest things for anyone else to understand and to form opinions about). I can’t say this fear is unjustified. It’s just that, for some of us, papers and grant proposals and lecture notes don’t fully scratch the writing itch; we still feel an inexplicable urge to write in English and to have people read it.

I could write more blog posts related to my day job, but I always face a dilemma: once I’m writing something careful and mathematical that only my colleagues will understand, why not go all the way and make it into a paper, so that at least it “officially counts” as a contribution to knowledge? But then, of course, writing a paper is hard, so the less-than-optimal result is a huge stack of half-written and quarter-written papers.

As for the connection between my day job and my “side gig as a prophet of doom”: well, I’ve always said that theoretical computer science is the perfect field for worriers, since it’s all about the worst thing that an adversary could possibly do to you—the hardest input (for those designing algorithms), the craziest algorithm (for those proving lower bounds), etc. And the quest for reassurance about what could be done to you never terminates at anything short of mathematical proof—ultimately, we hope, even of enormities like P≠NP. I’ve even speculated that this character of our field, as a sort of mathematicized paranoia, might partly explain the extreme overrepresentation of Israelis in TCS (more, I think, than in any other part of math, CS, or physics), though of course contingent historical factors could also play a role.

Anyway, other than that, no connection.

5. Itai Bar-Natan Says:

Actually, the methods in this post can be applied to giving upper bounds for the probabilities of catastrophic climate change, malignant intelligence explosion, collapse of democracy, and so on. Specifically, step 1 is always applicable. Hopefully that’s some reassurance.

6. Lowerbounds enthusiast Says:

This is amazing and will be the flowchart I follow when I need to upper bound probabilities of bad events!

If you’re taking requests, I would love to see a flowchart for how to prove quantum query complexity lower bounds. I imagine it would start with easy ideas like reductions from Search and Parity, and end with “Step 9: Use the negative weights adversary method. Weep in despair.”

7. Carl Lumma Says:

Great post! Would be awesome if you could add an example of something falling through to, and being solved at, #6. It would be a much longer post but also quite illuminating.

8. Max Says:

Is there still a Tricki (or was it Triki? or something else?) The wiki of mathematical tricks. It seems not. This is exactly the sort of content that should have been on there, and since it no longer exists, would it be appropriate to post this as a question/answer on math.stackexchange.com or MathOverflow?

9. Scott Says:

Max #8: I don’t mind at all if anyone wants to repost this somewhere with attribution, but I don’t know the rules about posting already-existing content on MO or Math StackExchange, and for my part I think I’ll just leave it here.

10. jonas Says:

Here’s a probability puzzle that you might want to solve, although it’s not really phrased as a small probability you want to bound from above.

You play the following card game. At the start of the game, you get ten cards, the spades from 4 to king inclusive, and the dealer gets ten cards, the hearts from 4 to king inclusive. Each round, first you discard a card from your hand at your choice, then the dealer discards a card from his hand face up randomly. If your card is higher rank than the dealer’s card, then you win this round; if your card is lower rank than the dealer’s card, then the dealer wins this round; if they’re of the same rank, nobody gets a point. After ten rounds, when all the cards have been used up, the game ends. If you have won more rounds than the dealer during this game, then you win 10 cents; if the dealer has won more rounds than you, then you lose 10 cents; if you and the dealer won the same amount of rounds, then nobody pays anything. Is there a strategy you can use to make the expected value of money you win positive?

11. Paul Says:

A minor tweak – before jumping to Azuma you might want to check whether the variables are “negatively associated” as occurs in balls-in-bins problems or choices of fixed numbers of elements without replacement (hypergeometric distribution).

12. Jacob Steinhardt Says:

Shouldn’t Step 10 really be steps 1b, 2b, 3b, 4b, etc.?

13. Tristram Says:

This is a great idea! My only suggestion would be to put Step 10 a bit sooner: not give up, but at least *consider* that what I’m trying to prove might be false or out of reach before trying so many things.

14. mjgeddes Says:

Yes, I’m convinced that the ‘rationality’ gig was being done all wrong in the past.

I think detailed algorithms on *what* and *how* to do things should be front and center! Theory and abstractions should be *secondary* to problem solving and practical *how-to* principles. David Deutsch (following Popper) had always made the point that explanatory knowledge should focus on *problem-solving*. It’s only now that I really understand what he meant.

The correct way to learn and understand is to start with the algorithm! Start with practical algorithmic steps. Theory and abstractions (ontology) are only tools for the algorithm and they should only arise or be used as a *consequence* of problem solving. Put the heuristics front-and-center and only turn to the abstract theory later and connect it to the heuristics!

What I’m suggesting is that *all* expert knowledge should be communicated in the way this post does…as a series of heuristic steps.

Take any field of knowledge. You’ll find you can always split it into three parts. Two theoretical and one practical. The two theoretical parts are the abstractions split into two levels (low-level and and high-level abstractions). The practical one is the algorithmic or heuristic part.

Take mathematical logic for example. The 3-part split is as follows:
Set Theory (Low-level abstractions)
Category Theory (High-level abstractions)
Rules of Logic (PROBLEM SOLVING)

The point is that there’s always a PROBLEM SOLVING part and that’s the part one should always focus on! The problem solving part is always interventional (if one does *this*, *that* happens) and algorithmic (steps one takes)

This could be start of a revolution in learning, teaching and general rationality!

15. srp Says:

The idea of Scott as a cable-news personality is one of the most dada things I’ve read in a while.

16. Sasho Says:

I feel like chaining — basically Chernoff plus union bound, but while being quite clever how to use the union bound — should be in here. I’ve personally found it much more useful than say the local lemma (as pretty as the local lemma is), and it has had some cool applications in TCS.

17. Scott Says:

Jacob #12 and Tristram #13: I confess that the linear structure of my list isn’t nearly as important as it is in the psychiatrist’s case, where you’re (normally) trying only one antidepressant at a time and where each thing you try could take months. The order is a very rough guideline, but typically you’ll be jumping around in the list (and yes, sometimes straight to #10) based on your feelings about a specific problem. Also, of course, a solution will often involve a clever combination of as many as 5 or 6 items from the list, which is probably rarer in psychiatry.

18. Michael Says:

Shouldn’t there be step ~ 9.5: «see if you can change your algorithm to make the probability of bad events satisfy some weird recurrence that can be solved after reading ‘Concrete Mathematics’»?

19. Boaz Barak Says:

Very nice! I agree with Jacob that step 10 should be performed between each step, with the time devoted to seeing if your initial guess was false growing as more and more techniques fail to do it.

Of course my own procedure to solve these problems is very different:

Step 2: Ask them to explain to me the solution

🙂

20. jonas Says:

Boaz Barak: right, but Scott has already used that joke in https://www.scottaaronson.com/blog/?p=1981

21. Ashley Says:

Scott,

“Just between you and me: you may have already visited Wikipedia to refresh your memory about the earlier items in this list, like the Chernoff bound.”

I would have loved to have you as my manager. You are so good restoring the confidence of the guy in front :-).

22. RandomOracle Says:

Great list! Just to add to step 6, the triangle inequality is also extremely useful when bounding variation distance 🙂

23. Scott Says:

RandomOracle #22: Indeed!

24. Steve Says:

I usually also incorporate something like the following into my “upper bound” loop: Check that I didn’t overlook some property, like convexity, monotonicity, or positivity, that might preclude a bad event entirely or sharpen my bound on it’s probability. Convexity especially is both subtle and powerful.

25. mjgeddes Says:

You could write a new book with these steps as the central theme Scott, and then branch off into various CS topics! Expert knowledge presented as flow-charts, a new paradigm! As one or two others in thread implied, this could actually be your best post ever.

Going back to the broader theme of rationality, I could be wrong, but I rather suspect that most advances come from tightly focusing on specific problems, rather than unfocused speculations or airy-fairy theorizing. The adage that ‘necessity is the mother of invention’ looms large.

As I mentioned, it seems that the secret to scientific advances might be to focus on ‘problem solving’ (defining specifically what you want to do then considering specifically how to do it). Then new theory and abstraction arises naturally as a consequence of the problem-solving.

This could be relevant for AGI as well! Perhaps one should center all knowledge around something like flow-charts and the notion of procedural memory, with *DOING* the core concern. Then ontologies (knowledge representations) are simply something like attachments or extensions to the main flow-charts. (A representation is just what’s useful for doing things in specific contexts).

26. Terence Tao Says:

Another surprisingly useful trick is this: when one has a large number of independent sources of random inputs in a random variable one is trying to take the expectation of, try swapping the inputs one at a time to some better random input (e.g. a gaussian instead of a Bernoulli input), using tools such as Taylor expansion to control the error of each swap. I know this method as the “Lindeberg exchange method”, but I believe it has a different name in TCS (which escapes me at the moment, though).

27. Scott Says:

Terence #26: Thanks for that! I’ve certainly done things that sound sort of like that—e.g. swapping discrete for continuous random variables or vice versa depending on which is more convenient, and controlling the error when I do so—but I don’t know a TCS name for it. Does anyone else?

28. venky Says:

Re: your comment #4, can you speak more to how you are balancing learning and researching and life? E.g., are you still learning new stuff at the same pace as grad school or are you banking on what you learned back when? Of course research is learning but I feel like these inequalities would not be useful to someone who didn’t take the time to learn them thoroughly in grad school, but would be a great repository for someone who did but has forgotten them now. Do you periodically go back to grad school mode to learn new stuff?

29. Scott Says:

venky #28: I often go into “procrastination mode” where I just read obsessively about some topic because it’s addictive, or as a way of avoiding harder work. But these days, when I’m learning a new part of math or TCS in a really focused way, it’s almost always because of something I’m trying to do, and is very tailored to the purpose at hand. I hope someday I’ll be able to go back into “grad student mode.”

30. Patrick Says:

Scott #27: I think the TCS technique he’s referring to might be hybrid arguments in cryptography: https://en.wikipedia.org/wiki/Hybrid_argument_(Cryptography)

31. Ravi Boppana Says:

Terence #26, Scott #27: In the TCS world, I’ve seen this method referred to as the Replacement Method or the Hybrid Method. See for example Section 11.5 (the Berry-Esseen theorem) of Ryan O’Donnell’s book Analysis of Boolean Functions. There Ryan indeed replaces each Bernoulli bit with a Gaussian, one at a time.

32. Ravi Boppana Says:

I should have included a link to Ryan’s book in my post above:
https://www.contrib.andrew.cmu.edu/~ryanod/

33. Scott Says:

Patrick #30: What Terry described sounds closely related to hybrid arguments. But in a typical TCS hybrid argument, you change your original probability distribution step-by-step to one that’s extremely far from the original, and argue that the two distributions are nevertheless computationally indistinguishable (say, to all polynomial-time algorithms), under some cryptographic assumption.

34. Jon K. Says:

In insurance and finance, the bad event regulators want companies to avoid is insolvency. Lots of theorizing is done to calculate how much “capital cushion” is needed to avoid insolvency, at say a 99% likelihood level. (The regulations are far more complicated, but this level of detail is good enough for this discussion.)

Right now, actuaries are looking at how to quantify longevity risk for companies that have sold annuities, taken on pension plan liabilities, or have some other commitment that will keep paying people as long as they are alive.

There seems to be two different schools of thought when trying to calibrate the capital cushion as it relates to longevity risk. 1) Extrapolate historical improvement rates, and assume improvement trends and random fluctuation will be comparable to historical experience. 2) Throw out the historical trends, and look at the underlying science of biology. Try to understand how regenerative medicine, anti-aging technologies, synthetic biology, machine learning, and other technologies on the horizon might have an effect on life spans over the next couple decades.

As you can imagine, people can get very different answers depending on which kind of model they use. Any thoughts on dealing with “black swan” events? (I assume there is no need for a reminder about what happened during the financial crisis a decade ago, which were fueled, in part, by financial models that didn’t accurately reflect risk.)

35. Rational Feed – deluks917 Says:

[…] Bounding Probabilities by Scott Aaronson – Scott Aaronson gives step by step guidelines for bounding probabilities. References alot of technical math. […]

36. Scott Says:

Jon K. #34: That’s not my field; the tail risks I deal with professionally are ones that can in principle be rigorously upper-bounded. For me, though, the basic question about the 2008 financial crisis has always been, to what extent was it caused by traders actually failing to understand systemic risk, and to what extent was it caused by them understanding the risks perfectly well, but hoping (in many cases, correctly) that they could stick someone else with the tab?

37. Douglas Knight Says:

Scott, what do you mean by “traders”? The salesmen who dealt with individual house buyers didn’t have any skin in the game, but pretty much everyone else involved, everyone I would call a “trader” failed to stick other people with the tab. They handed a lot of it to other parties, but they kept a lot for themselves. The main exception is that Goldman-Sachs stuck AIG with the tab, which almost didn’t work. And I think the timing of that suggests that they didn’t have a lot of foresight, just enough to keep ahead of everyone else.

More specifically, the companies that employed traders were stuck with the tab. The traders themselves got to keep their annual bonuses, except when they were in the form of company stock.

I think that if the traders saw it coming, they would have gotten out of housing in, say, 2006, like the retail sellers did. This probably would have taken them to some other part of finance, which might not have done any better in 2008, but that’s what I’d expect.

38. Sandro Says:

Thanks for this post, this has given me a focused guide to approaching some problems that I was preparing to break out the old probability texts and flail around until I found something remotely resembling applicable.

39. Aaron G Says:

Hi Scott. Per your response in comment #4, you state that you speculate that the “mathematicized paranoia” that is a characteristic of TCS may be the reason why Israelis are over-represented in that field, in comparison to other areas of math, physics, or CS.

I know you meant that comment to be half-joking, but my initial reaction was to whether Israelis are, in fact, over-represented in TCS in comparison to other branches of math or computer science.

If in fact this is true, I would have speculated that this may have been due to large numbers of Israelis being descended from Ashkenazi Jews from Russia (including very recent emigres from Russia or elsewhere in the former Soviet Union), and thus perhaps inheriting the academic mathematical traditions that produced the likes of Andrey Kolmogorov, a pioneer in the field of TCS among many other fields.

40. Yiftach Barnea Says:

Aaron G, I don’t know the history of TCS in Israel. However, looking at some of the main characters, especially in Jerusalem, none seems to be of Russian origin or to have any special connections to Kolmogorov. For example, Michael Rabin who I believe is one of the fathers of TCS in Israel, did his PhD in Princeton with Church.

41. Peter Says:

A couple of additions (for my version of the algorithm, anyway).
(3a) is the bad event that something doesn’t show up as often as it should? Try Janson’s inequality.

(4a) It’s a martingale, the worst-case differences are too big for Azuma to help, but it looks like they should really happen very rarely. Try Friedman’s inequality; if that fails skip straight to (8) as it (usually) means the probability of a bad event really is too big.

42. Aaron G Says:

To Yiftach, comment #40,

Thanks for looking into this. It looks like my speculation about the history of TCS in Israel is incorrect.

43. Anonymous Kitty Says:

Just do all the problems in Lugosi’s et al.’s “Concentration Inequalities.” Lots of inequalities, lots of concrete applications. Then you’ll know all the secrets of the universe.