Testing for Trustworthiness in Scientific Software Daniel Hook Diane Kelly Queen's University Royal Military College of Canada Kingston Ontario Kingston Ontario Abstract Two factors contribute to the difficulty of testing scientific software. One is the lack of testing oracles – a means of comparing software output to expected and correct results. The second is the large number of tests required when following any standard testing technique described in the software engineering literature. Due to the lack of oracles, scientists use judgment based on experience to assess trustworthiness, rather than correctness, of their software. This is an approach well established for assessing scientific models. However, the problem of assessing software is more complex, exacerbated by the problem of code faults. This highlights the need for effective and efficient testing for code faults in scientific software. Our current research suggests that a small number of well chosen tests may reveal a high percentage of code faults in scientific software and allow scientists to increase their trust. 1. Introduction In a 1982 paper, Elaine Weyuker [14] pointed out that we routinely assume a software tester can determine the correctness of the output of a test. The basis of the assumption is that the tester has an oracle, a means of comparing the software’s output to expected – and correct – results. Weyuker notes that in many cases, an oracle is pragmatically unattainable. With scientific software, this is almost always the case. Weyuker further comments that if the oracle is unattainable, then “from the view of correctness testing … there is nothing to be gained by performing the test.” For scientific software, this is a depressing and disturbing conclusion. However, there is instead a goal of trustworthiness rather than correctness that can be addressed by testing. In this paper, we first outline a typical development environment for scientific software, then briefly discuss testing. This leads to a discussion of correctness and trustworthiness from the view of the scientist who develops and uses scientific software. We then introduce new results from our analysis of samples of scientific software using mutation testing. We conclude with future directions for research. 2. An Environment for Scientific Software Scientific software resides in a rich environment that has several layers of complexity that affect how to successfully test the software. The computer language representation, or software, is the culmination of a series of model refinements each of which adds its own errors and/or Figure 1: Contributors to Computer Output Error for Scientific Software SECSE’09, May 23, 2009, Vancouver, Canada 978-1-4244-3737-5/09/$25.00 © 2009 Crown 59 ICSE’09 Workshopapproximations, as shown in Figure 1. The complexity of the refinements is complicated by transitions from one knowledge domain to another. For example, if we were coding an application for aircraft design, we could be using theory from physics, solution techniques from computational fluid dynamics, and algorithms and data structures from computer science. Each of these knowledge domains contributes errors and approximations to the models embedded in the computer code. The final computer output is an accumulation of all such errors and approximations. Assessing correctness of the computer output becomes as complex as analyzing the entire environment shown in Figure 1. 3. Testing Scientific Software Testing is usually couched in terms of verification and validation. The software engineering definitions of verification and validation, based on process [eg., 6], provide no insight into suitable testing activities for scientific software. To confound things further, verification and validation are not consistently defined across the computational science and engineering communities [eg., 3, 7, 9, 10, 11, 13]. This lack of consistency plus the complexity of the scientific software environment contributes to the omission of a major goal in testing, a goal that should be addressed by what we call code scrutinization. We suggest that for scientific software, there are three separate testing goals that should be addressed, not two. Validation for scientists primarily means checking the computer output against a reliable source, a benchmark that represents something in the real world. In the literature, validation is described by scientists as the comparison of computer output against various targets such as measurements (of either real world or bench test events), analytical solutions of mathematical models, simplified calculations using the computational models, or output from other computer software. Whether that target is another computer program, measurements taken in the field, or human knowledge, the goal of validation is the same: is the computer output a reasonable proximity to the real world? Verification is also described as a comparison of the computer output to the output of other computer software or to selected solutions of the computational model. Roache succinctly calls verification "solving the equations right" [11]. This includes checking that expected values are returned and convergence happens within reasonable times. The goal of verification is the assessment of the suitability of the algorithms and the integrity of the implementation of the mathematics. A third goal of testing, that of searching specifically for code faults, is almost universally missing in the work practices of scientists [eg. 12]. Yet, Hatton and Roberts [5] carried out a detailed study that demonstrated that accuracy degradation due to unnoticed code faults is a severe problem in scientific software. Hatton reiterated this observation in 2007 [4], commenting that problems with code faults in scientific software have not gone away. We have only come across one technique, developed by Roache and colleagues [7, 10, 11], that specifically addresses faults in scientific code. This technique has been developed for software that solves partial differential equations (pde). Called the Method of Manufactured Solutions, the technique involves manufacturing an exact analytical solution for the computational model. The computer output can be compared to the manufactured solution for accuracy and convergence characteristics. The intent is that any code faults affecting either of these will be detected. The technique is used in niches like computational fluid dynamics, but is limited in its applicability. Its limitations are due to the difference in breadth between analytical solutions and full computational solutions, the need in some cases to alter the code to use the technique, and the fact that pde solvers are only a small fraction of the lines of code that make up the body of computational software. We suggest a new model for testing, as shown in Figure 2. Figure 2: Model of Testing for Scientific Software Figure 2 shows three circles, or cycles, that represent testing activities whose goals are within the realm of three different specialists. 60The outer cycle of testing addresses the need to assess the software against goals in the scientific domain. The ultimate goal of the scientist is to use the software to provide data and insight [13] for problems in his/her domain. This testing addresses the capability of the software as a tool for the scientist. We call this testing activity scientific validation. The next cycle of testing addresses the integrity and suitability of the algorithms and other solution techniques used to provide the scientific solution. This is the domain of the numerical analyst. We introduce the term algorithm verification and refine a definition from Knupp and Salari [7]: "Algorithm verification is the process by which one assesses the code implementation of the mathematics underlying the science". The inner-most cycle of testing addresses code faults that arise in the realization of models using a computer language. We call this step in the scientist’s testing activity, code scrutinization. This step specifically looks for faults such as one-off errors, incorrect array indices, and the like. The goal is to ensure the integrity of the software code. This step is not concerned with evaluating the choice of scientific or mathematical models. Ideally, the order that these testing activities are carried out is from the inner cycle to the outer cycle. A scientist finding problems with a scientific validation test would be hard pressed to decide if the source of the problem is the scientific model, the mathematical model, or a code fault, if neither algorithm verification nor code scrutinization have been carried out. Code scrutinization is where testing techniques from software engineering could potentially provide benefit. We know of no testing techniques described in software engineering literature that have been specifically developed and/or validated for use with computational software. The goal of our research is to identify such a technique. 4. Correctness versus Trustworthiness Weyuker states [14] that "[being] able to determine whether or not [test] results obtained are correct ... is the fundamental limitation that testers must face." Correctness is essentially a Boolean decision, either true or false. On the other hand, trust is on a graduated scale, ultimately determined by judgment. The level of trust can be different for every example of scientific software, and possibly different for each individual output data. When a scientist says that a computer output is acceptable within forty percent of benchmark data, he/she is talking about trust. Nothing can be said about the Boolean concept of correctness. If we consider the three cycles of testing in our model, scientific validation testing is about trust: is the software giving output that we believe? To gain trust, we exercise the software in different ways and compare the output to different benchmarks in the scientific domain. As the output conforms to our expectations, our trust increases. Similarly for algorithm verification, all we can do is exercise the implementation of our algorithms until our trust is sufficiently high. It is well known that we cannot do exhaustive testing. Scientists have developed a number of approaches to judge trustworthiness of the implementation of their mathematics. For example, checking that quantities subject to the conservation laws are in fact conserved or that the matrix solver returns a message about ill-conditioning when expected. Only in the testing cycle of code scrutinization can we possibly tackle the Boolean true/false of correctness. For a specific code segment, we may have an oracle that would allow us to determine correctness. The impossibility of exhaustive testing still lingers however. Practically, this means we make judicious choices for our testing, and we fall back on the sufficiency of trust. 5. Definition of Silent Fault Trust means we use “judgment” to decide the acceptability of output from an example of software. Judgment unfortunately can miss evidence of a serious software fault due to insufficient data and misinterpretation. For scientists this is exacerbated by the accumulation of approximations and scientific error as the code is realized from the different models. Under many different conditions, a code fault may provide no obvious signal that something is wrong and slip past the scientist’s judgment. This is what we term a “silent fault”. More fully, we define silent faults as those that cause a change in output but do not cause any of (i) an error message, (ii) a system crash, (iii) wildly improbable results, or, (iv) noticeably longer execution time (in the extreme, an endless loop). Silent faults are what Hatton and Roberts [5] found when they compared the output of nine seismic data processing packages. Code faults such as one-off errors caused significant loss of accuracy. These faults went unnoticed by the geoscientists because the output of the code fell within the limits of plausible values. Yet, as Hatton points out in [4], this output data is used to site oil wells and the loss of accuracy effectively randomized a multi-million dollar decision making process. 61It is these types of faults we are interested in exposing. For this research, we are using a technique called mutation testing. 6. Analysis of Software Behaviour Using Mutation Testing Mutation testing is a well defined technique [eg. 1, 2, 8] used to assess the adequacy of a set of tests for a particular piece of software, P0. The technique involves generating a number of new source codes, called mutants, by making one small change to the source code of P0 for each mutant, Pm. The types of small changes are based on a set of mutation operators that research has demonstrated [eg. 2] to be representative of code faults in general. Each mutant then represents a single code fault that could occur in the original code P0. The adequacy of a set of tests for P0 is established by running the set of tests on all the mutants that have been generated. A mutant is said to be “killed” by any one of the tests in the set if the output from the mutant Pm differs from the output of the original program P0. This is a simple binary decision. Adequacy of the set of tests is judged by the ability of the tests to kill the mutants. For scientific software with floating point output, and with the accumulation of the many errors and approximations as shown in Figure 1, a binary decision on equality is not useful. Instead, we compute a sensitivity measure S between the output of the mutated code and the output of the non-mutated code for a particular test tx: S(P0, Pm, tx) = | Pm (tx) – P0 (tx)| / |P0 (tx)| The sensitivity measure S gives a relative difference between the output for Pm and P0 when executing the test tx. We do not use any tests where |P0 (tx)| = 0. For our research, we generated a set of mutants for various examples of scientific software. We then ran a series of tests to examine the ability of the tests to expose the mutants. In other words, we wanted to determine which tests caused the output of the mutant Pm to be most different from the output of P0, and hence most visible. In other words, we are interested in tests with maximum S for each mutant Pm. At the moment, we are using examples of code drawn from a library of MATLAB functions. The codes are all relatively small, but are intended to be part of a larger application. We feel the codes are similar to small numerical code units typically used in larger computational codes. For each code example, we generated a set of mutants and two sets of tests. The first set of tests was generated randomly with values chosen from the range of reasonable input values. The second set of tests was specifically designed to exercise the valid limits of the functions. The results we are observing so far suggest that a relatively small set of tests that “push” on the limits of the coded computation can reveal a large proportion of the code faults represented by the mutants. Figures 4 and 5 illustrate our observations with an example of code that performs an integration using Simpson’s Rule. The x-axis gives the real number range of values for each bar in the graph. The height of the bar gives the number of mutants whose maximum S fall in the associated real number range. The right most bar is marked “err”. This is the count of mutants that failed with an error condition for at least one test tx. We’ll call the source code implementing Simpson’s Rule, Ps. For Ps, 167 mutants were generated using MATLAB variants of the mutation operators described in [1]. The first set of tests we created consisted of 200 random tests. For each test and for each mutant, we calculated S(Ps,Pm,tx). Since we are interested in tests that reveal a mutant, we recorded the maximum S for each mutant. For the set of random tests, Figure 3 shows the counts of the number of mutants whose max(S) falls in the given ranges. If S(Ps,Pm,tx) = 0, then Ps and Pm are considered equivalent for tx. There were 11 mutants where S was zero for all tests in the set. For the purposes of our discussions, the range of 0 < S < 1 was chosen as representing silent faults. Faults whose output falls in this range under particular test conditions might be difficult for a scientist to detect. There were 50 mutants whose maximum S was in this range. There were 47 mutants whose maximum S was above one. We assumed this difference could be large enough to make the fault visible. Another 59 mutants caused an error condition, also very visible. If a test causes the sensitivity measure for a mutant to fall either into the range S>1 or into an error condition, then we say the test strongly reveals the mutant. In summary, our first set of tests strongly revealed 106 of the 156 nonequivalent mutants, or 68% of the mutants. 62Figure 3: Counts of mutants Pm with given maximum S for a set of 200 random tests. Figure 4: Counts of mutants Pm with given maximum S for a set of 68 well-designed tests. A second set of tests were designed to exercise the function by pushing on its boundaries. We designed tests to exercise the function in ways that would be immediately recognizable to scientists. For example, the number of panels in the integration were chosen as small (1 and 2) and large (1000); sections were chosen to include intercepts, symmetries and non-symmetries. By taking valid combinations of these inputs, we generated 68 tests. Figure 4 shows the counts of mutants with max(S) in the given ranges for this set of tests. We again observed 11 mutants in the range S=0 (in other words, equivalent to Ps). However, the number of mutants that either caused an error or whose value for max(S) >1 is 144. In other words, 144 of the 156 nonequivalent mutants, or 92%, were revealed by this smaller set of tests. On closer examination, one test alone (test #4) strongly revealed 140 of the mutants. The addition of one other test (test #1) resulted in all 144 mutants being revealed. The success of test#4 may have resulted from its un-mutated output being very close to zero; it was easy to perturb the output by code changes. Test #1 used the minimum value of one for the number of panels in the integration and revealed errors that tried to push the number of panels to illegal values. For this particular example, a set of two well chosen tests revealed 92% of code faults. Our examination of other MATLAB examples of scientific software has given similar results. In all cases the number of mutants generated was comparable to that for the Simpson’s Rule function. Overall, we found for the examples we considered, that 2 to 5 well designed tests strongly revealed a large percentage of the mutants. The percentages ranged from 76 to 100 percent. The well designed tests were based on ideas that would be immediately cognizant to scientists. 7. Conclusions The difficulty of testing scientific software arises from many factors including the lack of test oracles and the difficulty of judging how much testing is enough. A testing goal of correctness is impractical and should be replaced by a goal of trustworthiness. Scientists are already comfortable with testing their software to increase trust in their science models and their solution techniques. Scientists also need to test the software to address code faults, with a cycle of testing that we call code scrutinization. The practical impossibility of finding all code faults is well understood, which leaves the question of how much is enough? This is of particular concern to scientists where their time needs to be spent doing science, not 63testing software. To make a particular testing approach attractive to scientists, it must be time-efficient and results-effective. Our research is showing that it may be possible to characterize and define such an approach. At this moment in our research, we are making judicious choices for our testing with the help of 20-20 hindsight. We need to move on to identifying the commonality in these choices and provide scientists with guidance on how to secure trust in their software. Our aim is to identify a time-efficient and resultseffective testing approach for scientific software. 8. Acknowledgements This work is funded by NSERC (Natural Sciences and Engineering Research Council of Canada) and ARP (Canadian Department of National Defense Academic Research Program.) 9. References [1] James H. Andrews, Lionel C. Briand, Yvan Labiche, Akbar Siami Namin, “Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria”, IEEE Transactions on Software Engineering, Vol. 32, No. 8, August 2006, pp. 608-624 [2] James H. Andrews, Lionel C. Briand, Yvan Labiche, “Is Mutation an Appropriate Tool for Testing Experiments?”, Proceedings 27th International Conference on Software Engineering ICSE ’05, 2005, pp. 402-411 [3] CSA N286.7-99, Quality Assurance of Analytical, Scientific, and Design Computer Programs for Nuclear Power Plants, Canadian Standards Association, March 1999 [4] Les Hatton, “The chimera of software quality”, Computer, 40(8):104, 2007, pp.102-103 [5] Les Hatton, Andy Roberts, “How accurate is scientific software?”, IEEE Transactions on Software Engineering, 20:10, 1994, pp. 786-797 [6] IEEE Standard 610.12-1990, IEEE Standard Glossary of Software Engineering Terminology, 1999 [7] Patrick Knupp, Kambiz Salari, Verification of Computer Codes in Computational Science and Engineering, Chapman & Hall/CRC, USA, 2003 [8] A. Jefferson Offut, Roland H. Untch, Mutation 2000: Uniting the orthogonal. Mutation testing for the new century, Kluwer Academic Publishers, Norwell, MA, USA, 2001, pp. 34-44 [9] Douglass E. Post, Lawrence G. Votta, “Computational Science Demands a New Paradigm”, Physics Today, January 2005, pp. 35-41 [10] Patrick J. Roache, "Building PDE Codes to be Verifiable and Validatable", Computing in Science and Engineering, September/October 2004, pp. 30-38 [11] Patrick J. Roache, Verification and Validation in Computational Science and Engineering, Hermosa publishing, USE, 1998 [12] Rebecca Sanders, Diane Kelly, “Scientific Software: Where’s the Risk and how do Scientists Deal with it?”, IEEE Software, July/August 2008, pp. 21-28 [13] D.E. Stevenson, “A Critical Look at Quality in LargeScale Simulations”, Computing in Science and Engineering, May-June 1999, pp. 53-63 [14] Elaine Weyuker, "On Testing Non-testable Programs", The Computer Journal, Vol. 25, No. 4, 1982, pp. 465-470 64