84 IEEE SOFTWARE | PUBLISHED BY THE IEEE COMPUTER SOCIET Y 074 0 -74 5 9 /11/ $ 2 6 . 0 0 © 2 011 I E E E SOFTWARE TESTING IS A TIMEconsuming, often frustrating activity, and the software engineering literature related to it is overwhelming—especially for scientists writing computational software in scientif c disciplines outside software engineering. Judith Segal studied the general cultural differences between the software engineering and scientif c communities.1 Even though both communities emphasize testing’s importance, the gap between their respective understanding of testing concepts seems particularly wide.2 We conducted an exercise to test an example of scientif c software. The interesting outcome is not so much the number of code defects the testing activity detected but the form and evolution of the activity itself. To analyze how the activity evolved, we applied a four-dimensional view of testing. As far as we know, this view is novel. The four dimensions help shift the view of testing from a single attribute of the software (for example, “It’s tested!”) to a more complete picture that lets us understand the differences in concepts and priorities between testing as it’s described in the software engineering literature and as it’s applied to a scientif c application. Four Dimensions of Testing The test dimensions that guided our analysis were context, goals, techniques, and adequacy. These four dimensions began as eight, which one of the authors (Diane Kelly) used to teach testing in a graduate course and at instructional workshops.3 She found signif cant overlap in the concepts included under each of the original dimensions, which allowed her to reduce their number to four. These four dimensions represent an orthogonal minimal set, suff cient to support an interesting analysis of a testing activity. Context To fully understand context in terms of what mattered to the testing exercise, we had to cast a wide net. We included the software’s historical and technical background, its applications, and the roles and knowledge of its users and developers, as well as the details of what we needed to test for the exercise itself. Goals Test goals are sometimes confused with statements such as, “I need to do boundary value testing.” However, testing is an information-gathering activity. The initial information-gathering goal for our exercise was simply to better understand the domain content of the Scienti c Software Testing: Analysis with Four Dimensions Diane Kelly and Stefan Thorsteinson, Royal Military College Daniel Hook, Engineering Seismology Group Solutions // An exercise to analyze scienti c software testing in terms of context, goals, technique, and adequacy evolves to make better use of the scientist’s dual role of developer and user. // FEATURE: SOFTWARE TESTINGMAY/JUNE 2011 | IEEE SOFTWARE 85 software and how it was expressed in code. As our understanding increased, we articulated more focused goals. Techniques Our exercise included both static and dynamic techniques. For example, static reviews of source code address maintainability goals better than running the executable. On the other hand, running the software dynamically on target platforms works better for goals related to accuracy. In our context, we found it important to consider the tester as an active part of the system under test. The tester’s knowledge and goals were key factors in the choice of technique. Adequacy Adequacy often subsumes the goals of a testing exercise. Software engineering literature often reduces adequacy to a measure of coverage or bug counts. If time-to-market is the project’s highest priority, adequate testing might depend on when time or money runs out. In safety- or business-critical situations, adequacy might reflect the completion of a predetermined verifcation exercise or the reduction of failures below a statistical limit. In our case, goals determined adequacy, and the tester determined whether the goals were satisfed. This is a softer measure of adequacy, but it can be perfectly valid given the scientifc context and goals. Testing an Astronomy Software Package Our exercise involved testing StarImg, an astronomy software package. All three of us have undergraduate degrees in science or engineering disciplines other than computing or software. Our graduate degrees differ in that one of us, Stefan Thorsteinson, has a graduate degree in physics, while the other two, Daniel Hook and Diane Kelly, have graduate degrees in computer science and software engineering. We felt that this mixture of backgrounds could help us bridge the cultural gap that Segal described.1 Each of us had different aims with the project, but the exercise increased everyone’s understanding of what it means to test scientifc software, particularly in the context of a single scientist making changes to an industrial product. It’s common for scientists to have dual roles of developer and user with scientifc software. The StarImg Context Our study’s expanded context included not only the software’s technical aspects but also the scientifc domain’s content as it related to the software’s functionality, historical development, and future work. In other words, the context we considered was broad and proved to be an important factor in understanding effective testing. The StarImg software package normally runs automatically (without human intervention) to detect artifacts in astronomical imagery. Its development was spurred by a wealth of imagery that became available in 2006 from a new observatory. However, as is often the case with scientifc software, StarImg adapted parts and ideas from older software packages that involved several developers scattered across different institutions. Technical context. The software itself is not large—about 10,000 lines of code (LOC) written in Matlab and C++. However, the input images that StarImg analyzes are each on the order of 1 to 4 Mbytes. The output is a set of image coordinates marking the locations of identifed artifacts, along with metrics calculated for each one. In its normal use, the package runs nightly, analyzing hundreds of astronomical images, together called bulk images. We chose the latest version of the StarImg code base for our testing exercise. Only its C++ modules are reasonably well documented. There are technical notes documenting algorithms copied from an earlier code package, but they don’t necessarily correspond exactly to algorithms presently in the code. No corresponding documentation was ever developed for StarImg. Its code is sparsely commented, and names of variables and functions that made some sense to the original author are not obvious otherwise. Scientifc context. StarImg analyzes an image taken in star-stare mode. This is a wide-feld image taken by a charged-coupled device (CCD) camera and telescope. The image might contain faint artifacts left by objects that are moving relative to the brighter background stars. The faint artifacts can be diffcult to detect because of background noise, of which CCD noise is the largest contributor. CCD noise is mostly from dark current, the thermal energy emitted from the silicon lattice composing the CCD. The camera records this thermal energy as a signal, and its effects are related to the image’s exposure time. In addition, individual pixels can exhibit higher-than-normal dark current. Nevertheless, because dark current is a fxed characteristic of The tester’s knowledge and goals were key factors in the choice of technique.86 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE FEATURE: SOFTWARE TESTING the CCD, software can be written to calculate suitable corrections ahead of time. Other contributors to background noise are light pollution from natural sources (bright stars, zodiacal light, and clouds) and artifcial sources (nearby electric lights and telescope imperfections). A major function of StarImg is to estimate the background noise and subtract it from the entire image. If the estimation is in error, the weak artifacts StarImg is trying to locate can be lost. After subtracting the background from the image, StarImg creates a binary version of it. All image signals are identifed as either stars or artifacts. For each artifact, StarImg calculates metrics such as length, orientation, brightness, signal-to-noise ratio, and eccentricity. Then it uses the metrics’ values to determine whether to include an artifact in its output. Historical context. StarImg’s background noise algorithms were ported to Matlab from an earlier C++ image analysis package. An even earlier detection package (Match) supplied some of the algorithms but no code. Match requires a priori information on an artifact’s size and orientation to identify and extract its signal. StarImg doesn’t use a priori information, which allows it to detect unexpected and multiple artifacts that Match wouldn’t see. However, not taking advantage of prior knowledge about artifact characteristics reduces StarImg’s ability to detect very weak signals. Estimates of the artifacts it misses range from 10 to 25 percent. StarImg’s performance on the large, bulk-image sets was nevertheless suffcient to justify its use. One astronomer developed StarImg’s initial set of detection and sorting algorithms and tested them against sample benchmark imagery. The algorithms were then passed on to two other astronomers and a physicist, and the four scientists carried out further tests, development, troubleshooting, and debugging. Their development work was at times functionally separated and at other times overlapping, particularly during testing. Current development and use. StarImg has no formal maintenance or development plan. Scientist users fx problems as needed and send updates to one of the original scientistdevelopers, who acts as gatekeeper for changes. The scientists perform regression tests by choosing samples from the thousands of archived images. As is typical with almost all scientifc software, the scientist must use experiencebased judgment to determine if an update or new functionality is working. StarImg is currently deployed in an automated image acquiring and processing environment. Scientists are continuing work on it to handle different image types from new observatories. Because each observatory is in a different location with its own camera and telescope, each produces different image sizes and, most importantly, different image backgrounds with different noise and signal characteristics. Other new development has added the ability to track artifacts as well as identify them. Initial Goals for the Testing Exercise We began the exercise with different testing goals. Thorsteinson had just inherited StarImg and would be adding a signifcant new function. He was interested in assessing the trust he could have in the current software package. His trust touches his dual roles: as a user, he must trust the software’s output; as a developer, he must trust that he can successfully alter the software without destroying the trust he needs as a user. Kelly and Hook, on the other hand, were interested in the effectiveness of two different quality assessment techniques as applied to scientifc software. This led them to choose techniques independent of the scientist’s goals. Both the techniques and the goals evolved as the exercise took place. In a research environment, this might be acceptable, but it’s not the most effective way to proceed in industry. By the end of the exercise, both the goals and the techniques had crystallized into something far more useful for the scientist. Initial Selection of Techniques The frst of the two major software engineering activities planned for StarImg was to create a set of unit tests that Hook could assess for its effectiveness in the context of scientifc software and output accuracy. We asked Thorsteinson to generate a set of unit tests for several StarImg functions. We wanted to automate the test execution, including the decision process for determining whether the test was successful. The scientist’s goal was to conduct some specifc in-depth StarImg testing. The software engineer’s goal was to better understand how to test scientifc software. The ultimate goal was to provide guidance to scientists in their choice of tests. The second activity was to carry out a software inspection of StarImg. Inspection has been described as the single most effective software quality assessment activity.4 By inspection, we Both the goals and the techniques crystallized into something far more useful for the scientist.MAY/JUNE 2011 | IEEE SOFTWARE 87 mean a formalized static assessment of a software product—usually source code—that includes a well-defned process and results in a record of found code defects. Researchers at the Royal Military College of Canada (RMC) have developed and evaluated an inspection technique, called task-directed inspection,5,6 that meets this purpose. It integrates code inspection for defects with production of a useful product such as design documentation. Thorsteinson agreed to use this technique to write a functional description for each StarImg function and record what code defects he found in the process. Adequacy Criteria to Judge Completion Initially, we defned adequacy, or stopping criterion for the exercise, in relation to the techniques: the exercise would be complete when all software functions were inspected and unit tests that exercised every line of code were created. This is a common process-based choice for adequacy in software engineering. The problem with it is the lack of focus on either the product or the person involved. For scientists interested in advancing their theoretical or engineering understanding, this approach is tedious and lacking in motivation. As our exercise progressed, the adequacy criteria shifted to a focus on both the product and the scientist. How the Exercise Unfolded For a relatively small piece of code and two reasonably well-defned software engineering activities, it was surprising how soon the process activities changed. Testing from the Scientist’s Viewpoint The scientist started creating tests for the simple low-level functions that didn’t call any other function. Input was either a single value or an array of values. The testing technique was the familiar white-box technique driven by statement coverage. For each function, the scientist created enough tests to ensure each line of the code was executed at least once. For the simple functions, he verifed the coverage by hand because he considered the effort and time needed to learn and adapt a coverage tool to be prohibitive. However, as the functions increased in size, code coverage became too diffcult to track manually. Writing unit tests for the low-level functions provided a degree of confdence in the code, but the overall usefulness was questionable. Providing full coverage required considerable time. Full coverage meant including test cases with malformed inputs to exercise errorchecking code. The scientist found problems in the low-level functions, but their signifcance was low compared to the time expended to fnd them. In addition, the low-level functions are selfcontained and therefore unlikely to change as StarImg is developed further. So unit tests developed for these functions are unlikely to be used again. At this point, the scientist’s goal began to shift toward supporting planned software changes. The adequacy criterion was also shifting to include riskbased considerations, which changed the focus to the product rather than the process. Writing unit tests for higher-level functions was a more challenging task but seemed more satisfying than working with the low-level functions. This was because the science in the more complex functions was more interesting for the scientist to explore. The scientist wasn’t sure that the code in these functions was working well or that his changes wouldn’t affect the functions in an unexpected way. He found his motivation to complete the higherlevel function tests came from wanting to fully understand how they worked. This became the main motivating goal for the testing exercise. This shifted the adequacy criteria to now include the scientist: increasing his understanding until he reached a comfort level. The input to each high-level function was often entire images or subimages. Achieving full-code coverage would require some painstaking work to alter the images. Given what we learned from creating unit tests for the low-level functions, we didn’t think full-code coverage would necessarily be a worthwhile pursuit. Instead, we found a new approach to testing. The scientist carefully considered each function’s scientifc goal and how to test it with a reasonable range of inputs. Instead of trying to format images to reach every line in the function, he selected input imagery that was typical of each usual case: images containing one, multiple, or no artifacts; faint artifacts or artifacts positioned along the image boundary; and malformed artifacts. The scientist changed techniques from white-box testing driven by statement coverage to black-box testing using scenarios. In this approach, the scientist determines the different scenarios that the code must handle and creates the tests for each scenario. Creating test cases this way has the additional beneft of identifying a representative input imagery set that could be documented and reused for any function that required an input image. This improves on the The adequacy criteria shifted to a focus on both the product and the scientist.88 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE FEATURE: SOFTWARE TESTING current system testing process, which involves selecting test images each time from the thousands of archived images. The software engineer asked the scientist about the possible impact of a code error that escaped testing and inspection. The biggest impact would be false negatives—in other words, missed artifacts in the astronomical images. Of less concern were false positives—that is, incorrect identifcation of artifacts or miscalculations of their metrics. Because the scientist manually examines each identifed artifact, these errors would be immediately obvious and the output would be rejected. By considering risk, we refned the testing exercise goals into something more specifc: detecting errors that would cause false-negative output. Testing from the Software Engineer’s Viewpoint Segal’s case study looked at scientists and software engineers working together to write new software and the differences that hampered their effciency and productivity.1 In our case, the software engineer had a background in engineering physics, which gave him the fundamentals of the application area. However, he lacked domain knowledge specifc to processing astronomical images, which made it diffcult for him to conduct effective testing without the scientist’s help. His comments reveal the breadth of the diffculty: “I didn’t know which parts of the software were most important (therefore deserving of more attention), and I didn’t know how much error tolerance each function should be given. In short, I didn’t have enough domain experience to develop the intuition and expertise that would allow me to test the routines effectively.” At the same time, the software engineer felt he had a positive influence on the scientist’s testing practices. For example, the scientist’s frst batch of tests focused on robustness testing—that is, testing with nonsense inputs to ensure the software handles them appropriately. However, because the signifcance of such problems in this context was low compared to the time expended on them, the software engineer suggested focusing on accuracy problems using realistic input data instead. This proved to be a more valuable use of the scientist’s time. In addition, the software engineer provided expertise on different testing techniques and coverage metrics that the scientist could experiment with to determine which were most useful. He suggested statement coverage to the scientist, despite its known weaknesses. The suggestion was based on the availability of tools that provide statement coverage statistics and our previous experience in testing scientifc software.7 We explored automating both the running of the tests and the decisions on test outcome. In cases where a test oracle (expected output) was not obvious, the scientist wrote code or script to evaluate the output’s correctness. However, as the project moved forward, it became apparent that comparing test output to expected values was a serious issue. Commonly, comparison of floating-point output, x, to some expected value, y, is handled simply by providing an error band ε, where | x – y | < ε. In many cases with scientifc software, the value of y is not clear—that is, there is no test oracle. A subtler problem exists with the error ε. Normally, we think of ε representing round-off error imposed by the limitations of working with fnite-length representations in computers. With scientifc software, ε includes errors, simplifcations, and approximations from modeling, measurements, and solution techniques as well as fnite floating-point round-off error. These various sources of error require the scientist’s expertise to judge what’s reasonable for both y and ε. A misjudgment in the size of ε can hide a code defect. Hook went on to demonstrate the impact of error tolerances on our ability to fnd code defects that affect accuracy in scientifc software.8 Software Inspections Industry’s adoption of software inspections has been slow to nonexistent.4 Even without the overhead of multiple inspectors and organized meetings, inspections pose problems for scientifc software, mainly in identifying an effective reading technique. In our case study, the scientist explored different approaches to guide the inspection. In the end, he developed two approaches that substantially increased his understanding of the code and enabled him to spot problems. Typically, documentation is lacking for scientifc software. The scientifc theory might be documented and the code authors might possibly still be available as resources. In our particular case, the scientist was the third person to inherit the original software package. It included a signifcant code base with which he was unfamiliar and for which there was little documentation. Given these limitations, he could inspect for self-consistency within the code, duplication of code pieces, and We refned the testing goals to detect errors that would cause false-negative output.MAY/JUNE 2011 | IEEE SOFTWARE 89 inconsistencies and invalid assumptions on the basis of his own knowledge of the application area. Initially, the scientist followed an inspection regime that resembled the more traditional software inspection approaches. He read the Matlab functions line by line, choosing the functions alphabetically from the StarImg code base and then documenting each function’s purpose. This approach was laborious and seemed to reveal very little. Each function’s purpose had little or nothing to do with the purpose of the one just previously read, when chosen alphabetically. The scientist discovered a far better approach to choosing the order of the functions for inspection. He selected three typical images as input data and set a breakpoint at the frst line of StarImg. The scientist then read the code as it was executed. The scientist reported that “this gave a much better feel for what each function was to do.” The execution sequence provided additional information about the software and let the scientist make useful judgments on the source code’s correctness. The scientist used his knowledge and expectations for the code as he cross-checked executed pieces of it against pieces that hadn’t executed. This approach uncovered the problem of dead code—functions that would never be called and could therefore be discarded. He also found algorithms for several functions that accomplished the same task or were deprecated and not documented as such. Some functions had hard-coded assumptions that were no longer valid. Another function had a hard-coded assumption that was currently correct but would have to be generalized as StarImg added new imagery types to its functionality. This affected maintainability and therefore ft with the more focused goal of supporting planned software changes. Interestingly, the scientist found creating the unit tests to be as benefcial as running them. This form of taskdirected inspection, in which the task is “creation of tests,” requires careful scrutiny of the code to determine test cases. In our exercise, a thorough understanding of each code section led to the identifcation of more problems. In particular, we found a problem in the code for removing background noise, which had been ported from the older C++ version of the software. It contained several hard-coded values related to CCDs used for the images, but the values were no longer valid for the imagery types processed by StarImg. The hard-coded values didn’t affect the metrics calculated for identifed artifacts, so their impact wasn’t immediately apparent in the StarImg output. However, further investigation showed that the hard-coded values did affect the threshold for whether an artifact is identifed at all. This problem contributed to the 10 to 25 percent of the images that StarImg missed—that is, it contributed to the false-negative problem. Inspection by a knowledgeable scientist was the only way to fnd this error. Lessons Learned for Testing Scientifc Software Our initial expectations and approaches evolved throughout the exercise. Here, we look again at the four testing dimensions and discuss what we learned. Context Considerations To achieve the testing goals that eventually evolved, we had to understand much more than the source code in front of us. We had to understand the histories of different pieces of code, their current and ultimate uses, and the goals of the scientists using the software. Context identifed risky code areas, helped prioritize StarImg failure types, and provided information for understanding the pedigree of different parts of the code. Effective testing was impossible without this full understanding of the science within the code. We also had to use the scientist more effectively. The scientist was both a user and a developer. He had goals and knowledge that blended both roles. We found it better to make use of this blend rather than artifcially separate the parts. It lets us more effectively defne the goals, techniques, and adequacy criteria. Goal Considerations Common software engineering goals to “improve quality” and “fnd bugs” are too vague to effectively guide the assessment of scientifc software. Once we asked the question, “Who are we testing this for?” and answered, “the scientist,” we formulated more realistic goals. These goals all involved the scientist and included improving scientifc understanding, identifying code parts involved in future changes, and mitigating problems in high-risk areas of the code. Once we better understood the goals, we refned techniques and adequacy criteria to match them. Technique Considerations The scientist said the most useful exercise was the line-by-line scrutiny of the code while it executed with selected test data. This exercise gave him a greater The scientist found creating the unit tests to be as benefcial as running them.90 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE FEATURE: SOFTWARE TESTING sense of accomplishment with its activities constrained by the test data’s execution path and its clear termination point. Line-by-line scrutiny of selected functions to create unit test cases required the scientist to form a deep understanding of the code, which in turn revealed important problems. Both these activities are a type of code inspection. When asked if he would use code inspection again, the scientist answered with a def nite “yes.” He will inherit the Match code. Its increased sensitivity is needed for a space-based telescope, but like StarImg, Match must be automated. He commented that a “thorough code scrutiny provides a level of scientif c understanding that is very much desirable.” The extent of the oracle and tolerance problems in testing scientif c software requires novel testing approaches specif c to this software type. We found a problem-domain viewpoint to be more viable than a code-coverage viewpoint. As the exercise progressed, we streamlined the activities and made them more eff cient. The scientist commented that they gave him “a level of knowledge and conf dence in the code that wouldn’t have been achieved otherwise.” Adequacy Considerations Adequacy criteria shifted from the initial process-based criteria to criteria focused on the product and the scientist. This f ts with the goals of increasing the scientist’s understanding of the code and identifying and, as necessary, improving its high-risk areas. We judged adequacy by the scientist’s ultimate satisfaction that he had a trustworthy tool to work with. B y analyzing our testing exer- cise through the four dimen- sions of context, goals, techniques, and adequacy, we developed a better understanding of how to effectively test a piece of scientif c software. Once we considered the scientist-tester as part of the testing system, the exercise evolved in a way that made use of and increased his knowledge of the software. One result was an approach to software assessment that combines inspection with code execution. Another result was the suppression of process-driven testing in favor of goalcentric approaches. The combination of software engineer working with scientist was successful in this case. The software engineer brings a toolkit of ideas, and the scientist chooses and fashions the tools into something that works for a specif c situation. Unlike many other types of software systems, scientif c software includes the scientist as an integral part of the system. The tools that support the scientist must include the scientist’s knowledge and goals in their design. This represents a different way of considering the juxtaposition of software engineering with scientif c software development. References 1. J. Segal, “Scientists and Software Engineers: A Tale of Two Cultures,” Proc. Psychology of Programming Interest Group (PPIG 08), Lancaster Univ., 2008, pp. 44–51. 2. R. Sanders and D. Kelly, “Scientif c Software: Where’s the Risk and How Do Scientists Deal with It?” IEEE Software, vol. 25, no. 4, 2008, pp. 21–28. 3. T. Shepard and D. Kelly, Dimensions of Testing, tech. report TR-74.188-13, 2003; https:// www-927.ibm.com/ibm/cas/publications/ TR-74.188/13/index.pdf. 4. R.L. Glass, “Inspections—Some Surprising Findings,” Comm. ACM, vol. 42, no. 4, 1999, pp. 17–19. 5 D. Kelly and T. Shepard, “Task-Directed Software Inspection,” J. Systems and Software, vol. 73, no. 2, 2004, pp. 361–368. 6. D. Kelly and T. Shepard, “Task-Directed Software Inspection Technique: An Experiment and Case Study,” Proc. IBM Centers for Advanced Studies Conf. (CASCON 2000), IBM Press, 2000; http://portal.acm.org/ citation.cfm?id=782040. 7. D. Kelly, N. Cote, and T. Shepard, “Software Engineers and Nuclear Engineers: Teaming Up to Do Testing,” Proc. Canadian Nuclear Soc. Conf., Canadian Nuclear Soc., June 2007. 8. D.A. Hook, “Using Code Mutation to Study Code Faults in Scientif c Software,” master’s thesis, Queen’s Univ., Kingston, Canada, 2009; https://qspace.library.queensu.ca/ handle/1974/1765. Selected CS articles and columns are also available for free at http://ComputingNow.computer.org. ABOUT THE AUTHORS DIANE KELLY is an associate professor in the Department of Mathematics and Computer Science at the Royal Military College (RMC). Her research interests are in identifying and improving software engineering techniques for use specif cally with scientif c software. Kelly has a PhD in software engineering from RMC. Contact her at [email protected]. STEFAN THORSTEINSON is a researcher at the Royal Military College (RMC) Center for Space Research. His research interests are in small-aperture space-based astronomy, astrodynamics, and image analysis. Thorsteinson has an MSc in physics from RMC. Contact him at [email protected]. DANIEL HOOK is a software researcher and developer for Engineering Seismology Group Solutions in Kingston, Ontario. His research interests are in the engineering and development of scientif c software, especially the impact of software engineering on scientif c software quality. Hook has an MSc in computing from Queen’s University, Kingston. Contact him at [email protected].