Testing for Zero bugs
Ariel Faigon

Overview:
The Software Quality Myth

A widely accepted premise on software quality is that software is so complex (in combinatorial terms) that it is impossible to have defect free software. We often hear maxims like "there's always one more bug", and "software testing can reveal the existence of bugs, but never prove their absence".

While the above maxims may be true, I'm afraid they tend to lead us to a state of mind where we accept them as an inevitable evil. If there's no way to make software clean - why bother.

Having some prior experience with software testing, I'm certain that we can do much better than we do today. Once we do this right, the results would pleasantly surprise us.


Conventional testing: How we test software?

Looking at the ways we test software, we see the following methods:

The known scenario replay is the single most commonly used method to test software. Unfortunately, it is the least effective method to uncover bugs as proven by the large number of bugs uncovered in later stages.

Regression tests and test suites are necessary but insufficient. They're a good way for conducting sanity checking and ensuring that popular and commonly compiled code runs correctly. However, they have two striking flaws:

Early exposure like alpha (and beta) testing has the advantage that it is much more random than the known-scenario replay. It provides more "real world" coverage, but it has its own flaws:

Fortunately, there is a complementary testing method that covers the above flaws well.


Monkeys at work: Random Testing

It is said that if you give a zillion monkeys keyboards and let them type long enough, one of them would eventually produce (insert your favorite magnum opus here). Finding bugs is probably much easier than writing magnum opii. If you think this is absurd, just substitute the word "monkeys" with "customers".

For years we have been sending our software to much less than a zillion customers, and yet, without exception, they eventually hit hundreds of undiscovered bugs.

Optimization imperative #1 calls for applying zillions of computer cycles before shipment to try and beat the customers in the race to uncover the bugs. This is what random testing is all about. This is also obvious. The magic is in the details.


Constrained Random Testing: a zillion monkeys, optimized

Since the combinatorial space of software is so huge we would like the proverbial monkeys to be constrained in some way and not purely random. To give an example: if we had our monkeys test UNIX for us, we would want them to type stuff like   ls, make, cc   etc.   rather than having them type stuff like %^3gkl'$#*(&% (*).

``Elementary,'' you may say, but read on.


A proof by example: It just works!

Before going into the details of Constrained Random Testing, let me state that from my experience applying this technique, to one problem space, it works far better than any other conventional testing method I've seen.

I used to work at the National Semiconductor microprocessor design center in Tel Aviv (NSTA) where I was a member of the Compilers team who wrote the compilers for all the 32000 Series of microprocessors.

For several years, our compilers were buggy as hell. Having worked closely with hardware engineers, we were urged to test the compilers "like hardware is tested", using random "vectors" of input. Once we were on the right track in our tought process things started to improve fast. It took one engineer about 4 months to come with a prototype for random compiler testing. He was generating random IR (Intermediate Representation) because it was easier to do than generating high-level code and we just needed a proof of concept. As a result we were testing only the back-end (Optimizer, Code generator, assembler, linker).

The introduction of random testing practically eliminated user bug reports on released back ends. At a certain point, we couldn't believe it ourselves, so we conducted an experiment by running the random program generator (it used to be called RIG, for Random IR Generator) and was implemented by Avi Bloch and Amit Matatia (see acknowledgments) on a previous release of the compiler. To our amazement, RIG was able to find over half of all the bugs reported by customers on the code generator in just one night of runtime.

Thanks to RIG, the GNX Compilers for the 32000 processor family have matured in a very short period to become one of the most reliable in the industry; I remember compiling an early version of perl (I think it was perl 2.0) with full optimizations and passing the whole test suite on a Sequent Balance (a big 32032-CPU SMP machine), while the SunOS and the HP/UX compilers I used for comparison weren't anywhere near this quality at that time (mind you, that was way back, in 1990 or so).

Based on this personal experience, I have no doubt this method is well worth pursuing here.


Testing a Compiler: CRT Outline

Here's the skeleton of RIG:

Loop:
  • Generate a random program subject to constraints
  • Compile it
  • Run it
  • Check that the results make sense:
    • if they do, discard of the program
    • if not, save the program and results for later human inspection

Generating a random program is done using recursive descent on the grammar of the language, while applying constraints at every node on the random generator (explained below). To give a simple example: pick an operand at random, check its arity, generate N random operands to operate on. Each operand can be either a constant or an arbitrary randomly generated new expression. etc.


Constraint Magic: making a random program behave

The outline sounds simple enough, but a few problems immediately come to mind:

This is where the constraints come into play. The embedded constraints are what turns the method from a pure theoretical play into a practical and effective testing method.

This is in essence what the constraints are all about.


Closing the loop: proving correctness

But then there's another more fundamental question:

It turns out that even though the answer to the third question is "You can't", still, from a practical point of view, there's a nice solution to this. Not only it is simple, but it was also empirically proven to work very well.

Solution: Generate a reference output using another compiler or interpreter for the same language. E.g. for C, the reference may be a most vanilla compilation (without any optimizations) by a widely deployed and stable compiler like GNU cc (gcc). This can be even done on a different vendor's system.

Actually, for most cases, you don't even need an additional compiler: You may generate multiple outputs using the same compiler under test, each time with different compilation options. For example: If the result with optimization is different than the one without it, bingo: you've found a bug in the optimizer.


Practice & experience: The importance of space/time tuning

The constraints mentioned above are just part of the story. To be effective and efficient, the random program generator should be space/time tuned, for example: Testing a small array for an unsigned index wrap-around case is effective as checking a big array and is far less consuming in time and space. Thus: elements like the maximum size of a randomly generated array, maximum number of elements in a structure, or the maximum depth of an expression should all be configurable via a configuration file.

Results (in terms of detected bugs per unit of testing time) can change dramatically by fine tuning of these space/time constraint parameters. The general guideline is: strive to use constraints that will create the smallest/simplest case to exhibit the problem.

Some golden rules:

From my experience, all compiler bugs can be narrowed down and reproduced in a very small example. Come to think of it, This is exactly what makes random testing of compilers work so well. Also, the smaller the example, the easier it is to investigate and fix. This is where tuning pays big: if you can generate 100,000 tiny programs per night you'll be much more effective covering the compiler and fixing the bugs than if you generate 1000 larger programs per night.


More Advice

Start with simple expressions, including all types and all possible cast and type conversions, these are easy to test and are a first step in a CRT implementation that is natural to start with and build upon.

Pick a very good random number generator. If you don't and if you don't have much variability in the types of nodes that you randomly generate, your programs will tend to find the same bug multiple times. There are practical ways to minimize this: like changing the configuration parameters once a bug is found (and until the bug is fixed) but this requires further resources and effort.

Generate random C rather than random IR. You cannot compile proprietary formats with a reference compiler (not the one you're testing). You also get to test the language front ends in addition to the back end.

It may be easier to generate IR and upwardly generate high-level constructs from it. This is certainly a sensible strategy, especially in case the mapping between IR and C is one to one. even if not, this will enable testing Fortran with almost no additional effort (since we already have these "reverse compilers").

Generating test cases, running them, and comparing results normally take less time than many compilation with many options, so pick compilation cases that are as different as possible of each other to get better coverage. Also: try to combine several compilation options with each run (i.e. many optimizations together, vs. a vanilla compilation) to achieve good coverage of the compiler in as few compilations as possible.

Jumbo tip: An excellent way of tuning the constraints and the random program generator is to basic-block profile the compiler itself and see what parts of source code are not being exercised by the randomly generated programs. Then, tune the random generator a bit more. In other words, inject some white-box style testing into the random black-box approach.


Q & A

Fred Chow (of Stanford, SGI, and global optimization fame), sent me the following questions on the first draft of this doc, I enclose them with the answers... I also made the doc clearer (so the answers may now sound obvious.)

>
>Question: What is the amount the effort (in terms of person-months) to
>bring up the CRT compiler testing system?
>
At NSTA it took about 4 man months of a very good person proficient with yacc, C, and perl (to write the wrapper/driver scripts driving the main engine). Our randomly generated IR (not high-level) didn't even cover all the IR structures (for example: we didn't do function calls in the first prototype) I estimate that for the full project you'll need about 6 man months. From there it can become a virtually endless refining/improvement cycle.

On the upside:
Keep in mind that if we have IR-to-C or IR-to-FORTRAN translators the effort may be smaller. Also at that time (circa 1990) our computing resources at NSTA (National's design center in Tel Aviv) were incomparably weaker than what we have today, e.g. a compilation took minutes instead of seconds.

>Concerns:
>
>1. Adherence to programming practices: There are standard programming 
>practices that programmers in general (or we recommend them) adhere to that 
>are not describable by grammars.  E.g. the no alias rules for Fortran
>parameters, and no wrap-around when two integers are added.  Those are
>rules that if not adhered to will cause optimized code that generates
>different program outputs.  It is only possible to test the compiler 
>by telling the back-end to those rules are NOT adhered to.
>
I believe you can ensure no-aliasing by construction. i.e.: if your random-operator generator picks up an address of a variable, it gives it a unique name (e.g. addr_varname) and subsequently the rest of the random program generator uses only that name for referring to this address. This is just another rule in the constrains collection.
>2. Optimization coverage: Because the random program generation is driven by 
>syntax, it is likely that only a small percentage of the generated test cases
>has optimization opportunities.  And when they do occur, they may be
>masked out or disabled by the occurrences of other constructs in the same
>code.  E.g. LNO and software pipelining are optimizations catered towards
>loops for scientific and engineering computations, but how often can
>CRT generates meaningful loops for them to work on?
>
CRT is not only driven by Syntax. It is driven by syntax AND constraints the more constraints you have the more control you have on the generated code. To refer to the concern: if you tune your loop generator to use loop nesting of no more than 3 or so, and force it to use small amounts of code (like N expression nodes) within a loop, or use a small set of variables (e.g. only N arrays, and M induction variables) the prospects for covering the LNO, I believe, are excellent. I looked at those small generated programs many times, the key is a lot of constraint tuning and generating small enough programs (to prevent them from getting out of control due to cumulative randomization). That's the essence of it. It is not too difficult to do and it is surprisingly effective. So let's put the resources on it and just do it!


Applying CRT to OS testing Note: This part was written while I was working at SGI (I no longer work there) and interested in OS testing.

I just recently started thinking about the problem, so this is full of holes. Please bear with me.

At first thought OS testing is harder:

Still I noted that basically, the problem isn't essentially different: System calls are just like operators; they have certain arguments that should fall within a range, and can cause exceptions or return error conditions. It should be straightforward to write a random system call generator under constraints, create processes, write files, call OS services, stress the system until it breaks. It certain aspects, the system stress is even easier:


Disclaimer

CRT is not a magic bullet. It just works remarkably well. However, it should be seen as complementary to white box testing. Better not get rid of all those asserts in your code. Good practices for writing solid code are described in some excellent books. The best of these is probably Writing Solid Code by Steve Maguire which I heartily recommend.


Acknowledgments

CRT is not my idea. I was just a member in a team that made it a reality. I gave it a cute name, organized the thoughts in a nice and readable way, and put them up on a web page because I thought there must be value to something that actually worked while common knowledge on the subject is basically that "it cannot be done."

I hope that sharing these ideas would help improve software quality in general and free software in particular.

I'm not sure who at National was the first to think about it and apply it to software. In any case, the following people deserve the full credit for the idea and its implementation. They are all great programmers and have amazing intellects. I learned a huge amount from all of them:

Footnotes

(*) Yes, I'm very familiar with Bart Miller's fuzz which I think is ``kind of cool''. However, I believe that the approach described herein has greater practical merit.

Feel free to send comments to: