The Existential Risk of Math Errors - Gwern.net
How big is this upper bound? Mathematicians have often made errors in proofs. But it’s rarer for ideas to be accepted for a long time and then rejected. But we can divide errors into 2 basic cases corresponding to type I and type II errors:

1. Mistakes where the theorem is still true, but the proof was incorrect (type I)
2. Mistakes where the theorem was false, and the proof was also necessarily incorrect (type II)

Before someone comes up with a final answer, a mathematician may have many levels of intuition in formulating & working on the problem, but we’ll consider the final end-product where the mathematician feels satisfied that he has solved it. Case 1 is perhaps the most common case, with innumerable examples; this is sometimes due to mistakes in the proof that anyone would accept is a mistake, but many of these cases are due to changing standards of proof. For example, when David Hilbert discovered errors in Euclid’s proofs which no one noticed before, the theorems were still true, and the gaps more due to Hilbert being a modern mathematician thinking in terms of formal systems (which of course Euclid did not think in). (David Hilbert himself turns out to be a useful example of the other kind of error: his famous list of 23 problems was accompanied by definite opinions on the outcome of each problem and sometimes timings, several of which were wrong or questionable5.) Similarly, early calculus used ‘infinitesimals’ which were sometimes treated as being 0 and sometimes treated as an indefinitely small non-zero number; this was incoherent and strictly speaking, practically all of the calculus results were wrong because they relied on an incoherent concept - but of course the results were some of the greatest mathematical work ever conducted6 and when later mathematicians put calculus on a more rigorous footing, they immediately re-derived those results (sometimes with important qualifications), and doubtless as modern math evolves other fields have sometimes needed to go back and clean up the foundations and will in the future.7

...

Isaac Newton, incidentally, gave two proofs of the same solution to a problem in probability, one via enumeration and the other more abstract; the enumeration was correct, but the other proof totally wrong and this was not noticed for a long time, leading Stigler to remark:

...

TYPE I > TYPE II?
“Lefschetz was a purely intuitive mathematician. It was said of him that he had never given a completely correct proof, but had never made a wrong guess either.”
- Gian-Carlo Rota13

Case 2 is disturbing, since it is a case in which we wind up with false beliefs and also false beliefs about our beliefs (we no longer know that we don’t know). Case 2 could lead to extinction.

...

Except, errors do not seem to be evenly & randomly distributed between case 1 and case 2. There seem to be far more case 1s than case 2s, as already mentioned in the early calculus example: far more than 50% of the early calculus results were correct when checked more rigorously. Richard Hamming attributes to Ralph Boas a comment that while editing Mathematical Reviews that “of the new results in the papers reviewed most are true but the corresponding proofs are perhaps half the time plain wrong”.

...

Gian-Carlo Rota gives us an example with Hilbert:

...

Olga labored for three years; it turned out that all mistakes could be corrected without any major changes in the statement of the theorems. There was one exception, a paper Hilbert wrote in his old age, which could not be fixed; it was a purported proof of the continuum hypothesis, you will find it in a volume of the Mathematische Annalen of the early thirties.

...

Leslie Lamport advocates for machine-checked proofs and a more rigorous style of proofs similar to natural deduction, noting a mathematician acquaintance guesses at a broad error rate of 1/329 and that he routinely found mistakes in his own proofs and, worse, believed false conjectures30.

[more on these "structured proofs":
https://mathoverflow.net/questions/35727/community-experiences-writing-lamports-structured-proofs
]

We can probably add software to that list: early software engineering work found that, dismayingly, bug rates seem to be simply a function of lines of code, and one would expect diseconomies of scale. So one would expect that in going from the ~4,000 lines of code of the Microsoft DOS operating system kernel to the ~50,000,000 lines of code in Windows Server 2003 (with full systems of applications and libraries being even larger: the comprehensive Debian repository in 2007 contained ~323,551,126 lines of code) that the number of active bugs at any time would be… fairly large. Mathematical software is hopefully better, but practitioners still run into issues (eg Durán et al 2014, Fonseca et al 2017) and I don’t know of any research pinning down how buggy key mathematical systems like Mathematica are or how much published mathematics may be erroneous due to bugs. This general problem led to predictions of doom and spurred much research into automated proof-checking, static analysis, and functional languages31.

[related:
https://mathoverflow.net/questions/11517/computer-algebra-errors
I don't know any interesting bugs in symbolic algebra packages but I know a true, enlightening and entertaining story about something that looked like a bug but wasn't.

Define sinc𝑥=(sin𝑥)/𝑥.

Someone found the following result in an algebra package: ∫∞0𝑑𝑥sinc𝑥=𝜋/2
They then found the following results:

...

So of course when they got:

∫∞0𝑑𝑥sinc𝑥sinc(𝑥/3)sinc(𝑥/5)⋯sinc(𝑥/15)=(467807924713440738696537864469/935615849440640907310521750000)𝜋

hmm:
Which means that nobody knows Fourier analysis nowdays. Very sad and discouraging story... – fedja Jan 29 '10 at 18:47

--

Because the most popular systems are all commercial, they tend to guard their bug database rather closely -- making them public would seriously cut their sales. For example, for the open source project Sage (which is quite young), you can get a list of all the known bugs from this page. 1582 known issues on Feb.16th 2010 (which includes feature requests, problems with documentation, etc).

That is an order of magnitude less than the commercial systems. And it's not because it is better, it is because it is younger and smaller. It might be better, but until SAGE does a lot of analysis (about 40% of CAS bugs are there) and a fancy user interface (another 40%), it is too hard to compare.

I once ran a graduate course whose core topic was studying the fundamental disconnect between the algebraic nature of CAS and the analytic nature of the what it is mostly used for. There are issues of logic -- CASes work more or less in an intensional logic, while most of analysis is stated in a purely extensional fashion. There is no well-defined 'denotational semantics' for expressions-as-functions, which strongly contributes to the deeper bugs in CASes.]

...

Should such widely-believed conjectures as P≠NP or the Riemann hypothesis turn out be false, then because they are assumed by so many existing proofs, a far larger math holocaust would ensue38 - and our previous estimates of error rates will turn out to have been substantial underestimates. But it may be a cloud with a silver lining, if it doesn’t come at a time of danger.

https://mathoverflow.net/questions/338607/why-doesnt-mathematics-collapse-down-even-though-humans-quite-often-make-mista

more on formal methods in programming:
https://www.quantamagazine.org/formal-verification-creates-hacker-proof-code-20160920/
https://intelligence.org/2014/03/02/bob-constable/

Update: measured effort
In the October 2018 issue of Communications of the ACM there is an interesting article about Formally verified software in the real world with some estimates of the effort.

Interestingly (based on OS development for military equipment), it seems that producing formally proved software requires 3.3 times more effort than with traditional engineering techniques. So it's really costly.

On the other hand, it requires 2.3 times less effort to get high security software this way than with traditionally engineered software if you add the effort to make such software certified at a high security level (EAL 7). So if you have high reliability or security requirements there is definitively a business case for going formal.

WHY DON'T PEOPLE USE FORMAL METHODS?: https://www.hillelwayne.com/post/why-dont-people-use-formal-methods/
You can see examples of how all of these look at Let’s Prove Leftpad. HOL4 and Isabelle are good examples of “independent theorem” specs, SPARK and Dafny have “embedded assertion” specs, and Coq and Agda have “dependent type” specs.6

If you squint a bit it looks like these three forms of code spec map to the three main domains of automated correctness checking: tests, contracts, and types. This is not a coincidence. Correctness is a spectrum, and formal verification is one extreme of that spectrum. As we reduce the rigour (and effort) of our verification we get simpler and narrower checks, whether that means limiting the explored state space, using weaker types, or pushing verification to the runtime. Any means of total specification then becomes a means of partial specification, and vice versa: many consider Cleanroom a formal verification technique, which primarily works by pushing code review far beyond what’s humanly possible.

...

The question, then: “is 90/95/99% correct significantly cheaper than 100% correct?” The answer is very yes. We all are comfortable saying that a codebase we’ve well-tested and well-typed is mostly correct modulo a few fixes in prod, and we’re even writing more than four lines of code a day. In fact, the vast… [more]
ratty  gwern  analysis  essay  realness  truth  correctness  reason  philosophy  math  proofs  formal-methods  cs  programming  engineering  worse-is-better/the-right-thing  intuition  giants  old-anglo  error  street-fighting  heuristic  zooming  risk  threat-modeling  software  lens  logic  inference  physics  differential  geometry  estimate  distribution  robust  speculation  nonlinearity  cost-benefit  convexity-curvature  measure  scale  trivia  cocktail  history  early-modern  europe  math.CA  rigor  news  org:mag  org:sci  miri-cfar  pdf  thesis  comparison  examples  org:junk  q-n-a  stackex  pragmatic  tradeoffs  cracker-prog  techtariat  invariance  DSL  chart  ecosystem  grokkability  heavyweights  CAS  static-dynamic  lower-bounds  complexity  tcs  open-problems  big-surf  ideas  certificates-recognition  proof-systems  PCP  mediterranean  SDP  meta:prediction  epistemic  questions  guessing  distributed  overflow  nibble  soft-question  track-record  big-list  hmm  frontier  state-of-art  move-fast-(and-break-things)  grokkability-clarity  technical-writing  trust
july 2019 by nhaliday
Cleaner, more elegant, and harder to recognize | The Old New Thing
Really easy

Hard
Writing good error-code-based code

Really hard
Writing good exception-based code

--

Really easy
Recognizing that error-code-based code is badly-written
Recognizing the difference between bad error-code-based code and

Hard
Recognizing that error-code-base code is not badly-written

Really hard
Recognizing that exception-based code is badly-written
Recognizing that exception-based code is not badly-written
Recognizing the difference between bad exception-based code

https://ra3s.com/wordpress/dysfunctional-programming/2009/07/15/return-code-vs-exception-handling/
https://nedbatchelder.com/blog/200501/more_exception_handling_debate.html
techtariat  org:com  microsoft  working-stiff  pragmatic  carmack  error  error-handling  programming  rhetoric  debate  critique  pls  search  structure  cost-benefit  comparison  summary  intricacy  certificates-recognition  commentary  multi  contrarianism  correctness  quality  code-dive  cracker-prog  code-organizing
july 2019 by nhaliday
Thick and thin | West Hunter
There is a spectrum of problem-solving, ranging from, at one extreme, simplicity and clear chains of logical reasoning (sometimes long chains) and, at the other, building a picture by sifting through a vast mass of evidence of varying quality. I will give some examples. Just the other day, when I was conferring, conversing and otherwise hobnobbing with my fellow physicists, I mentioned high-altitude lighting, sprites and elves and blue jets. I said that you could think of a thundercloud as a vertical dipole, with an electric field that decreased as the cube of altitude, while the breakdown voltage varied with air pressure, which declines exponentially with altitude. At which point the prof I was talking to said ” and so the curves must cross!”. That’s how physicists think, and it can be very effective. The amount of information required to solve the problem is not very large. I call this a ‘thin’ problem’.

...

In another example at the messy end of the spectrum, Joe Rochefort, running Hypo in the spring of 1942, needed to figure out Japanese plans. He had an an ever-growing mass of Japanese radio intercepts, some of which were partially decrypted – say, one word of five, with luck. He had data from radio direction-finding; his people were beginning to be able to recognize particular Japanese radio operators by their ‘fist’. He’d studied in Japan, knew the Japanese well. He had plenty of Navy experience – knew what was possible. I would call this a classic ‘thick’ problem, one in which an analyst needs to deal with an enormous amount of data of varying quality. Being smart is necessary but not sufficient: you also need to know lots of stuff.

...

Nimitz believed Rochefort – who was correct. Because of that, we managed to prevail at Midway, losing one carrier and one destroyer while the the Japanese lost four carriers and a heavy cruiser*. As so often happens, OP-20-G won the bureaucratic war: Rochefort embarrassed them by proving them wrong, and they kicked him out of Hawaii, assigning him to a floating drydock.

The usual explanation of Joe Rochefort’s fall argues that John Redman’s ( head of OP-20-G, the Navy’s main signals intelligence and cryptanalysis group) geographical proximity to Navy headquarters was a key factor in winning the bureaucratic struggle, along with his brother’s influence (Rear Admiral Joseph Redman). That and being a shameless liar.

Personally, I wonder if part of the problem is the great difficulty of explaining the analysis of a thick problem to someone without a similar depth of knowledge. At best, they believe you because you’ve been right in the past. Or, sometimes, once you have developed the answer, there is a ‘thin’ way of confirming your answer – as when Rochefort took Jasper Holmes’s suggestion and had Midway broadcast an uncoded complaint about the failure of their distillation system – soon followed by a Japanese report that ‘AF’ was short of water.

Most problems in the social sciences are ‘thick’, and unfortunately, almost all of the researchers are as well. There are a lot more Redmans than Rocheforts.
west-hunter  thinking  things  science  social-science  rant  problem-solving  innovation  pre-2013  metabuch  frontier  thick-thin  stories  intel  mostly-modern  history  flexibility  rigidity  complex-systems  metameta  s:*  noise-structure  discovery  applications  scitariat  info-dynamics  world-war  analytical-holistic  the-trenches  creative  theory-practice  being-right  management  track-record  alien-character  darwinian  old-anglo  giants  magnitude  intersection-connectedness  knowledge  alt-inst  sky  physics  electromag  oceans  military  statesmen  big-peeps  organizing  communication  fire  inference  apollonian-dionysian  consilience  bio  evolution  elegance  necessity-sufficiency  certificates-recognition
november 2016 by nhaliday

Copy this bookmark: