Testing for Zero bugs

Testing for Zero bugs
Ariel Faigon
Overview:
The Software Quality Myth: "Software is always buggy !"
Conventional testing methods: Why we're shipping buggy software ?
Monkeys at work: Could this be a better method ?
Constrained Random Testing (CRT): A zillion monkeys, optimized
A proof by example: It Just Works !

Applying constrained random testing to a Compiler (proven)

Outline of CRT (example)

Constraint magic: Making a random program behave

Closing the loop: proving correctness
Practice & experience: Tune, tune, tune

More advice

Q & A

Applying constrained random testing to an OS (speculative)

Acknowledgments You know who you are :-)

The Software Quality Myth
A widely accepted premise on software quality is that software is so complex (in combinatorial terms) that it is impossible to have defect free software. We often hear maxims like "there's always one more bug", and "software testing can reveal the existence of bugs, but never prove their absence".
While the above maxims may be true, I'm afraid they tend to lead us to a state of mind where we accept them as an inevitable evil. If there's no way to make software clean - why bother.
Having some prior experience with software testing, I'm certain that we can do much better than we do today. Once we do this right, the results would pleasantly surprise us.

Conventional testing: How we test software?
Looking at the ways we test software, we see the following methods:

Known scenario replay (test suites, regression tests)
Random early exposure (Campus alphas, selected customer betas)

The known scenario replay is the single most commonly used method to test software. Unfortunately, it is the least effective method to uncover bugs as proven by the large number of bugs uncovered in later stages.
Regression tests and test suites are necessary but insufficient. They're a good way for conducting sanity checking and ensuring that popular and commonly compiled code runs correctly. However, they have two striking flaws:

Coverage is extremely limited. When you run the same suite so many times, obviously your bugs tend to hide elsewhere.

It is difficult to track bugs in case of failure. Test suites tend to be big pieces of code. When something goes wrong, we apply some further trial runs and search techniques until the problem is narrowed down to a single routine or line of source code.
Early exposure like alpha (and beta) testing has the advantage that it is much more random than the known-scenario replay. It provides more "real world" coverage, but it has its own flaws:

Since it is an informal testing method, reproducibility is a problem (e.g asking a beta customer: what exactly did you do to experience the problem?")
It relies on people's good will to investigate problems as they occur, and report bugs accurately and with enough detail when they are found.
It suffers for a small scope (time, number of machines, and people employed) compared to the real installed base and its usage patterns.
At least the Alpha part doesn't really emulate our customer environment: our environment is far from heterogenic: almost no other vendors' machines (Sun, HP, IBM, Mac, PCs) exist on campus.

Fortunately, there is a complementary testing method that covers the above flaws well.

Monkeys at work: Random Testing
It is said that if you give a zillion monkeys keyboards and let them type long enough, one of them would eventually produce (insert your favorite magnum opus here). Finding bugs is probably much easier than writing magnum opii. If you think this is absurd, just substitute the word "monkeys" with "customers".
For years we have been sending our software to much less than a zillion customers, and yet, without exception, they eventually hit hundreds of undiscovered bugs.
Optimization imperative #1 calls for applying zillions of computer cycles before shipment to try and beat the customers in the race to uncover the bugs. This is what random testing is all about. This is also obvious. The magic is in the details.

Constrained Random Testing: a zillion monkeys, optimized
Since the combinatorial space of software is so huge we would like the proverbial monkeys to be constrained in some way and not purely random. To give an example: if we had our monkeys test UNIX for us, we would want them to type stuff like ls, make, cc etc. rather than having them type stuff like %^3gkl'$#*(&% (*).
``Elementary,'' you may say, but read on.

A proof by example: It just works!
Before going into the details of Constrained Random Testing, let me state that from my experience applying this technique, to one problem space, it works far better than any other conventional testing method I've seen.
I used to work at the National Semiconductor microprocessor design center in Tel Aviv (NSTA) where I was a member of the Compilers team who wrote the compilers for all the 32000 Series of microprocessors.
For several years, our compilers were buggy as hell. Having worked closely with hardware engineers, we were urged to test the compilers "like hardware is tested", using random "vectors" of input. Once we were on the right track in our tought process things started to improve fast. It took one engineer about 4 months to come with a prototype for random compiler testing. He was generating random IR (Intermediate Representation) because it was easier to do than generating high-level code and we just needed a proof of concept. As a result we were testing only the back-end (Optimizer, Code generator, assembler, linker).
The introduction of random testing practically eliminated user bug reports on released back ends. At a certain point, we couldn't believe it ourselves, so we conducted an experiment by running the random program generator (it used to be called RIG, for Random IR Generator) and was implemented by Avi Bloch and Amit Matatia (see acknowledgments) on a previous release of the compiler. To our amazement, RIG was able to find over half of all the bugs reported by customers on the code generator in just one night of runtime.
Thanks to RIG, the GNX Compilers for the 32000 processor family have matured in a very short period to become one of the most reliable in the industry; I remember compiling an early version of perl (I think it was perl 2.0) with full optimizations and passing the whole test suite on a Sequent Balance (a big 32032-CPU SMP machine), while the SunOS and the HP/UX compilers I used for comparison weren't anywhere near this quality at that time (mind you, that was way back, in 1990 or so).
Based on this personal experience, I have no doubt this method is well worth pursuing here.

Testing a Compiler: CRT Outline
Here's the skeleton of RIG:
Loop:

Generate a random program subject to constraints
Compile it
Run it
Check that the results make sense:

if they do, discard of the program
if not, save the program and results for later human inspection

Generating a random program is done using recursive descent on the grammar of the language, while applying constraints at every node on the random generator (explained below). To give a simple example: pick an operand at random, check its arity, generate N random operands to operate on. Each operand can be either a constant or an arbitrary randomly generated new expression. etc.

Constraint Magic: making a random program behave
The outline sounds simple enough, but a few problems immediately come to mind:

How can you ensure that a randomly generated program terminates?
How can you ensure that a randomly generated program doesn't generate an exception?
How can you ensure that a perfectly "safe" optimization won't change the semantics of the program (e.g. use of an undefined variable, the behavior of which is undefined)

This is where the constraints come into play. The embedded constraints are what turns the method from a pure theoretical play into a practical and effective testing method.

To ensure termination, we artificially inject an independent counter into every randomly generated loop or recursive function call, if a configurable limit is exceeded, we just break out of the loop. Not perfect, indeed, but simple, straightforward, and does the job.

As for exceptions, there are two approaches: one is to simply allow them, if a randomly picked divisor happens to be zero, so be it. The output may be partial, but still deterministic. The other approach (if exceptions happen too often) is to regenerate those expressions that cause the exceptions, or to ensure by construction that they always fall within a certain range of values.
For example, we ensure that constant divisors are never zero, and that variable divisors are checked at run time (if they are zero don't divide). Likewise we may add a check before array accesses for legal indexes into a pre-generated array. From experience, the second approach is what you want in 99% of the cases, and again, it works well.

To make sure the program has outputs (so we can inspect its runtime results) we simply put a hook to print all the randomly generated variables and constants of the program from a customized exit() routine that is linked with the program.

Likewise, to ensure there's no use of undefined variables, we simply initialize all of them (with random values of course) just before we jump to main().

This is in essence what the constraints are all about.

Closing the loop: proving correctness
But then there's another more fundamental question:

If the program is randomly generated, (i.e. not known in advance) how can you predict its output, in order to verify that it ran correctly?

It turns out that even though the answer to the third question is "You can't", still, from a practical point of view, there's a nice solution to this. Not only it is simple, but it was also empirically proven to work very well.
Solution: Generate a reference output using another compiler or interpreter for the same language. E.g. for C, the reference may be a most vanilla compilation (without any optimizations) by a widely deployed and stable compiler like GNU cc (gcc). This can be even done on a different vendor's system.
Actually, for most cases, you don't even need an additional compiler: You may generate multiple outputs using the same compiler under test, each time with different compilation options. For example: If the result with optimization is different than the one without it, bingo: you've found a bug in the optimizer.

Practice & experience: The importance of space/time tuning
The constraints mentioned above are just part of the story. To be effective and efficient, the random program generator should be space/time tuned, for example: Testing a small array for an unsigned index wrap-around case is effective as checking a big array and is far less consuming in time and space. Thus: elements like the maximum size of a randomly generated array, maximum number of elements in a structure, or the maximum depth of an expression should all be configurable via a configuration file.
Results (in terms of detected bugs per unit of testing time) can change dramatically by fine tuning of these space/time constraint parameters. The general guideline is: strive to use constraints that will create the smallest/simplest case to exhibit the problem.
Some golden rules:

Don't generate big data structures.
Don't generate big programs, loops, or routines
For every possible range of values, use some bias towards extremum and "interesting" values (MAXINT, 1, 0, -1, last index of an array) as opposed to a uniform distribution of values.
Think hard about the constraints, let the random generator do the rest
For every generated program, run it against as many cases as possible (e.g. many possible compiler options) to amortize the generation overhead, and leverage the reference results over many test cases.

From my experience, all compiler bugs can be narrowed down and reproduced in a very small example. Come to think of it, This is exactly what makes random testing of compilers work so well. Also, the smaller the example, the easier it is to investigate and fix. This is where tuning pays big: if you can generate 100,000 tiny programs per night you'll be much more effective covering the compiler and fixing the bugs than if you generate 1000 larger programs per night.

More Advice
Start with simple expressions, including all types and all possible cast and type conversions, these are easy to test and are a first step in a CRT implementation that is natural to start with and build upon.
Pick a very good random number generator. If you don't and if you don't have much variability in the types of nodes that you randomly generate, your programs will tend to find the same bug multiple times. There are practical ways to minimize this: like changing the configuration parameters once a bug is found (and until the bug is fixed) but this requires further resources and effort.
Generate random C rather than random IR. You cannot compile proprietary formats with a reference compiler (not the one you're testing). You also get to test the language front ends in addition to the back end.
It may be easier to generate IR and upwardly generate high-level constructs from it. This is certainly a sensible strategy, especially in case the mapping between IR and C is one to one. even if not, this will enable testing Fortran with almost no additional effort (since we already have these "reverse compilers").
Generating test cases, running them, and comparing results normally take less time than many compilation with many options, so pick compilation cases that are as different as possible of each other to get better coverage. Also: try to combine several compilation options with each run (i.e. many optimizations together, vs. a vanilla compilation) to achieve good coverage of the compiler in as few compilations as possible.
Jumbo tip: An excellent way of tuning the constraints and the random program generator is to basic-block profile the compiler itself and see what parts of source code are not being exercised by the randomly generated programs. Then, tune the random generator a bit more. In other words, inject some white-box style testing into the random black-box approach.
Q & A
Fred Chow (of Stanford, SGI, and global optimization fame), sent me the following questions on the first draft of this doc, I enclose them with the answers... I also made the doc clearer (so the answers may now sound obvious.)
> >Question: What is the amount the effort (in terms of person-months) to >bring up the CRT compiler testing system? >
At NSTA it took about 4 man months of a very good person proficient with yacc, C, and perl (to write the wrapper/driver scripts driving the main engine). Our randomly generated IR (not high-level) didn't even cover all the IR structures (for example: we didn't do function calls in the first prototype) I estimate that for the full project you'll need about 6 man months. From there it can become a virtually endless refining/improvement cycle.
On the upside:
Keep in mind that if we have IR-to-C or IR-to-FORTRAN translators the effort may be smaller. Also at that time (circa 1990) our computing resources at NSTA (National's design center in Tel Aviv) were incomparably weaker than what we have today, e.g. a compilation took minutes instead of seconds.
>Concerns: > >1. Adherence to programming practices: There are standard programming >practices that programmers in general (or we recommend them) adhere to that >are not describable by grammars. E.g. the no alias rules for Fortran >parameters, and no wrap-around when two integers are added. Those are >rules that if not adhered to will cause optimized code that generates >different program outputs. It is only possible to test the compiler >by telling the back-end to those rules are NOT adhered to. >
I believe you can ensure no-aliasing by construction. i.e.: if your random-operator generator picks up an address of a variable, it gives it a unique name (e.g. addr_varname) and subsequently the rest of the random program generator uses only that name for referring to this address. This is just another rule in the constrains collection.
>2. Optimization coverage: Because the random program generation is driven by >syntax, it is likely that only a small percentage of the generated test cases >has optimization opportunities. And when they do occur, they may be >masked out or disabled by the occurrences of other constructs in the same >code. E.g. LNO and software pipelining are optimizations catered towards >loops for scientific and engineering computations, but how often can >CRT generates meaningful loops for them to work on? >
CRT is not only driven by Syntax. It is driven by syntax AND constraints the more constraints you have the more control you have on the generated code. To refer to the concern: if you tune your loop generator to use loop nesting of no more than 3 or so, and force it to use small amounts of code (like N expression nodes) within a loop, or use a small set of variables (e.g. only N arrays, and M induction variables) the prospects for covering the LNO, I believe, are excellent. I looked at those small generated programs many times, the key is a lot of constraint tuning and generating small enough programs (to prevent them from getting out of control due to cumulative randomization). That's the essence of it. It is not too difficult to do and it is surprisingly effective. So let's put the resources on it and just do it!

Applying CRT to OS testing Note: This part was written while I was working at SGI (I no longer work there) and interested in OS testing.
I just recently started thinking about the problem, so this is full of holes. Please bear with me.
At first thought OS testing is harder:

The initial language is bigger (a few hundred sys-calls vs. about 50 reserved words or operands in a language like C)
System calls usually have larger arities than language expression arities (three arguments sys-calls are most common).
The full grammar, including all possible constants passed to each sys-call is larger (need more manual description)
There are more temporal interdependencies between calls: e.g. you need to open(2) a file before doing a seek(2)
Kernels have parallelism
Kernel behavior is time dependent
Networking ops are slower (and thus harder to cover well)
Analyzing kernel crashes is more complex than figuring out why outputs from two identical program runs are different.
Still I noted that basically, the problem isn't essentially different: System calls are just like operators; they have certain arguments that should fall within a range, and can cause exceptions or return error conditions. It should be straightforward to write a random system call generator under constraints, create processes, write files, call OS services, stress the system until it breaks. It certain aspects, the system stress is even easier:

We don't really stress our machines on campus to the limit and it should be easy to do so automatically.
We care less if system calls fail, in fact they fail often in real life (errno is set)
We don't need to invest in creating correct outcomes (like written file contents) which is much less of a problem than crashes. What we need is mainly to create synthetic random workloads.

Disclaimer
CRT is not a magic bullet. It just works remarkably well. However, it should be seen as complementary to white box testing. Better not get rid of all those asserts in your code. Good practices for writing solid code are described in some excellent books. The best of these is probably Writing Solid Code by Steve Maguire which I heartily recommend.

Acknowledgments
CRT is not my idea. I was just a member in a team that made it a reality. I gave it a cute name, organized the thoughts in a nice and readable way, and put them up on a web page because I thought there must be value to something that actually worked while common knowledge on the subject is basically that "it cannot be done."
I hope that sharing these ideas would help improve software quality in general and free software in particular.
I'm not sure who at National was the first to think about it and apply it to software. In any case, the following people deserve the full credit for the idea and its implementation. They are all great programmers and have amazing intellects. I learned a huge amount from all of them:

Avi Bloch
Amit Matatia
Chaim Bendelac
Gadi Ehrlich
Yoav Hollander
Rick Pelleg
Yuval Shachar
Yehezkel Tzadik
Meir Tzadik
Avi Barel
Tal Eliashiv
... apologies for those I might have left out

Footnotes
(*) Yes, I'm very familiar with Bart Miller's fuzz which I think is ``kind of cool''. However, I believe that the approach described herein has greater practical merit.
Feel free to send comments to: