Antonia Bertolino (http://www.isti.cnr.it/People/A.Bertolino) is a Research Director of the Italian National Research Council at ISTI in Pisa, where she leads the Software Engineering Laboratory. She also coordinates the Pisatel laboratory, sponsored by Ericsson Lab Italy. Her research interests are in architecture-based, component-based and service-oriented test methodologies, as well as methods for analysis of non-functional properties. She is an Associate Editor of the Journal of Systems and Software and of Empirical Software Engineering Journal, and has previously served for the IEEE Transactions on Software Engineering. She is the Program Chair for the joint ESEC/FSE Conference to be held in Dubrovnik, Croatia, in September 2007, and is a regular member of the Program Committees of international conferences, including ACM ISSTA, Joint ESEC-FSE, ACM/IEEE ICSE, IFIP TestCom. She has (co)authored over 80 papers in international journals and conferences. Software Testing Research: Achievements, Challenges, Dreams Antonia Bertolino Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007Software Testing Research: Achievements, Challenges, Dreams Antonia Bertolino Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” Consiglio Nazionale delle Ricerche 56124 Pisa, Italy [email protected] Abstract Software engineering comprehends several disciplines devoted to prevent and remedy malfunctions and to warrant adequate behaviour. Testing, the subject of this paper, is a widespread validation approach in industry, but it is still largely ad hoc, expensive, and unpredictably effective. Indeed, software testing is a broad term encompassing a variety of activities along the development cycle and beyond, aimed at different goals. Hence, software testing research faces a collection of challenges. A consistent roadmap of the most relevant challenges to be addressed is here proposed. In it, the starting point is constituted by some important past achievements, while the destination consists of four identified goals to which research ultimately tends, but which remain as unreachable as dreams. The routes from the achievements to the dreams are paved by the outstanding research challenges, which are discussed in the paper along with interesting ongoing work. 1. Introduction Testing is an essential activity in software engineering. In the simplest terms, it amounts to observing the execution of a software system to validate whether it behaves as intended and identify potential malfunctions. Testing is widely used in industry for quality assurance: indeed, by directly scrutinizing the software in execution, it provides a realistic feedback of its behavior and as such it remains the inescapable complement to other analysis techniques. Beyond the apparent straightforwardness of checking a sample of runs, however, testing embraces a variety of activities, techniques and actors, and poses many complex challenges. Indeed, with the complexity, pervasiveness and criticality of software growing ceaselessly, ensuring that it behaves according to the desired levels of quality and dependability becomes more crucial, and increasingly difficult and expensive. Earlier studies estimated that testing can consume fifty percent, or even more, of the development costs [3], and a recent detailed survey in the United States [63] quantifies the high economic impacts of an inadequate software testing infrastructure. Correspondingly, novel research challenges arise, such as for instance how to conciliate model-based derivation of test cases with modern dynamically evolving systems, or how to effectively select and use runtime data collected from real usage after deployment. These newly emerging challenges go to augment longstanding open problems, such as how to qualify and evaluate the effectiveness of testing criteria, or how to minimize the amount of retesting after the software is modified. In the years, the topic has attracted increasing interest from researchers, as testified by the many specialized events and workshops, as well as by the growing percentage of testing papers in software engineering conferences; for instance at the 28th International Conference on Software Engineering (ICSE 2006) four out of the twelve sessions in the research track focused on “Test and Analysis”. This paper organizes the many outstanding research challenges for software testing into a consistent roadmap. The identified destinations are a set of four ultimate and unachievable goals called “dreams”. Aspiring to those dreams, researchers are addressing several challenges, which are here seen as interesting viable facets of the bigger unsolvable problem. The resulting picture is proposed to the software testing researchers community as a work-in-progress fabric to be adapted and expanded. In Section 2 we discuss the multifaced nature of software testing and identify a set of six questions underlying any test approach. In Section 3 we then introduce the structure of the proposed roadmap. We summarize some more mature research areas, which constitute the starting point for our journey in the roadmap, in Section 4. Then in Section 5, which is the main part of the paper, we overview several outstanding research challenges and the dreams to which they tend. Brief concluding remarks in Section 6 close the paper. 1 Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 20072. The many faces of software testing Software testing is a broad term encompassing a wide spectrum of different activities, from the testing of a small piece of code by the developer (unit testing), to the customer validation of a large information system (acceptance testing), to the monitoring at run-time of a network-centric service-oriented application. In the various stages, the test cases could be devised aiming at different objectives, such as exposing deviations from user’s requirements, or assessing the conformance to a standard specification, or evaluating robustness to stressful load conditions or to malicious inputs, or measuring given attributes, such as performance or usability, or estimating the operational reliability, and so on. Besides, the testing activity could be carried on according to a controlled formal procedure, requiring rigorous planning and documentation, or rather informally and ad hoc (exploratory testing). As a consequence of this variety of aims and scope, a multiplicity of meanings for the term “software testing” arises, which has generated many peculiar research challenges. To organize the latter into a unifying view, in the rest of this section we attempt a classification of problems common to the many meanings of software testing. The first concept to capture would be what is the common denominator, if it exists, between all possible different testing “faces”. We propose that such a common denominator can be the very abstract view that, given a piece of software (whichever its typology, size and domain) testing always consists of observing a sample of executions, and giving a verdict over them. Starting from this very general view, we can then concretize different instances, by distinguishing the specific aspects that can characterize the sample of observations: WHY: why is it that we make the observations? This question concerns the test objective, e.g.: are we looking for faults? or, do we need to decide whether the product can be released? or rather do we need to evaluate the usability of the User Interface? HOW: which sample do we observe, and how do we choose it? This is the problem of test selection, which can be done ad hoc, at random, or in systematic way by applying some algorithmic or statistical technique. It has inspired much research, which is understandable not only because it is intellectually attractive, but also because how the test cases are selected -the test criterion- greatly influences test efficacy. HOW MUCH: how big of a sample? Dual to the question of how do we pick the sample observations (test selection), is that of how many of them do we take (test adequacy, or stopping rule). Coverage analysis or reliability measures constitute two “classical” approaches to answer such question. WHAT: what is it that we execute? Given the (possibly composite) system under test, we can observe its execution either taking it as a whole, or focusing only on a part of it, which can be more or less big (unit test, component/subsystem test, integration test), more or less defined: this aspect gives rise to the various levels of testing, and to the necessary scaffolding to permit test execution of a part of a larger system. WHERE: where do we perform the observation? Strictly related to what do we execute, is the question whether this is done in house, in a simulated environment or in the target final context. This question assumes the highest relevance when it comes to the testing of embedded systems. WHEN: when is it in the product lifecycle that we perform the observations? The conventional argument is that the earliest, the most convenient, since the cost of fault removal increases as the lifecycle proceeds. But, some observations, in particular those that depend on the surrounding context, cannot always be anticipated in the laboratory, and we cannot carry on any meaningful observation until the system is deployed and in operation. These questions provide a very simple and intuitive characterization schema of software testing activities, that can help in organizing the roadmap for future research challenges. 3. Software testing research roadmap A roadmap provides directions to reach a desired destination starting from the “you are here” red dot. The software testing research roadmap is organised as follows: • the “you are here” red dot consists of the most notable achievements from past research (but note that some of these efforts are still ongoing); • the desired destination is depicted in the form of a set of (four) dreams: we use this term to signify that these are asymptotic goals at the end of four identified routes for research progress. They are unreachable by definition and their value exactly stays in acting as the poles of attraction for useful, farsighted research; • in the middle are the challenges faced by current and future testing research, at more or less mature stage, and with more or less chances for success. These challenges constitute the directions to be followed in the journey towards the dreams, and as such they are the central, most important part of the roadmap. The roadmap is illustrated in Figure 1. In it, we have situated the emerging and ongoing research directions in the center, with more mature topics -the achievements- on their Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007Figure 1. Roadmap Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007left, and the ultimate goals -the dreams- on their right. Four horizontal strips depict the identified research routes toward the dreams, namely: 1. Universal test theory; 2. Test-based modeling; 3. 100% automatic testing; 4. Efficacy-maximized test engineering. The routes are bottom-up ordered according somehow to progressive utility: the theory is at the basis of the adopted models, which in turn are needed for automation, which is instrumental to cost-effective test engineering. The challenges horizontally span over six vertical strips corresponding to the WHY, HOW, HOW MUCH, WHAT, WHERE, and WHEN questions characterizing software testing faces (in no specific order). Software testing research challenges find their place in this plan, vertically depending on the long term dream, or dreams, towards which they mainly tend, and horizontally according to which question, or questions, of the introduced software testing characterization they mainly center on. In the remainder of this paper, we will discuss the elements (achievements, challenges, dreams) of this roadmap. We will often compare this roadmap with its 2000’s predecessor by Harrold [43], which we will refer henceforth as FOSE2000. 4. You are here: Achievements Before outlining the future routes of software testing research, a snapshot is here attempted of some topics which constitute the body of knowledge in software testing (for a ready, more detailed guide see also [8]), or in which important research achievements have been established. In the roadmap of Figure 1, these are represented on the left side. The origins of the literature on software testing date back to the early 70’s (although one can imagine that the very notion of testing was born simultaneously with the first experiences of programming): Hetzel [44] dates the first conference devoted to program testing to 1972. Testing was conceived like an art, and was exemplified as the “destructive” process of executing a program with the intent of finding errors, opposed to design which constituted the “constructive” party. It is of these years Dijkstra’s topmost cited aphorism about software testing, that it can only show the presence of faults, but never their absence [25]. The 80’s saw the assumption of testing to the status of an engineered discipline, and a view change of its goal from just error discovery to a more comprehensive and positive view of prevention. Testing is now characterized as a broad and continuous activity throughout the development process ([44], pg.6), whose aim is the measurement and evaluation of software attributes and capabilities, and Beizer states: More than the act of testing, the act of designing tests is one of the best bug preventers known ([3], pg. 3). Testing process. Indeed, much research in the early years has matured into techniques and tools which help make such “test-design thinking” more systematic and incorporate it within the development process. Several test process models have been proposed for industrial adoption, among which probably the “V model” is the most popular. All of its many variants share the distinction of at least the Unit, Integration and System levels for testing. More recently, the V model implication of a phased and formally documented test process has been argued by some as being inefficient and unnecessarily bureaucratic, and in contrast more agile processes have been advocated. Concerning testing in particular, a different model gaining attention is test-driven development (TDD)[46], one of the core extreme programming practices. The establishment of a suitable process for testing was listed in FOSE2000 among the fundamental research topics and indeed this remains an active research today. Test criteria. Extremely rich is the set of test criteria devised by past research to help the systematic identification of test cases. Traditionally these have been distinguished between white-box (a.k.a. structural) and black-box (a.k.a. functional), depending on whether or not the source code is exploited in driving the testing. A more refined classification can be laid according to the source from which the test cases are derived [8], and many textbooks and survey articles (e.g., [89]) exist that provide comprehensive descriptions of existing criteria. Indeed, so many criteria among which to choose now exist, that the real challenge becomes the capability to make a justified choice, or rather to understand how they can be most efficiently combined. In recent years the greatest attention has been turned to model-based testing, see Section 5.2. Comparison among test criteria. In parallel with the investigation of criteria for test selection and for test adequacy, lot of research has addressed the evaluation of the relative effectiveness of the various test criteria, and especially of the factors which make one technique better than another at fault finding. Past studies have included several analytical comparisons between different techniques (e.g., [31, 88]). These studies have permitted to establish a subsumption hierarchy of relative thoroughness between comparable criteria, and to understand the factors influencing the probability of finding faults, focusing more in particular on comparing partition (i.e., systematic) against random testing. “Demonstrating effectiveness of testing techniques” was in fact identified as a fundamental research challenge in FOSE2000, and still today this objective calls for further research, whereby the emphasis is now on emFuture of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007pirical assessment. Object-oriented testing. Indeed, at any given period, the dominating paradigm of development has catalyzed testing research for adequate approaches, as we further develop in Section 5.5. In the 90’s the focus was on testing of Object-oriented (OO) software. Rejected the myth that enhanced modularity and reuse brought forward by OO programming could even prevent the need for testing, researchers soon realized that not only everything already learnt about software testing in general also applied to OO code, but also OO development introduced new risks and difficulties, hence increasing the need and complexity of testing [14]. In particular, among the core mechanisms of OO development, encapsulation can help hide bugs and makes test harder; inheritance requires extensive retesting of inherited code; and polymorphism and dynamic binding call for new coverage models. Besides, appropriate strategies for effective incremental integration testing are required to handle the complex spectrum of possible static and dynamic dependencies between classes. Component-based testing. In the late 90’s, componentbased (CB) development emerged as the ultimate approach that would yield rapid software development with fewer resources. Testing within this paradigm introduced new challenges, which we would distinguish between technical and theoretical in kind. On the technical side, components must be generic enough for being deployed in different platforms and contexts, therefore the component user needs to retest the component in the assembled system where it is deployed. But the crucial problem here is to face the lack of information for analysis and testing of externally developed components. In fact, while component interfaces are described according to specific component models, these do not provide enough information for functional testing. Therefore research has advocated that appropriate information, or even the test cases themselves (as in Built-In Testing), are packaged along with the component for facilitating testing by the component user, and also that the “contract” that the components abide to should be made explicit, to allow for verification. The testing of component-based systems was also listed as a fundamental challenge in FOSE2000. For a more recent survey see [70]. What remains an open evergreen problem is the theoretical side of CB testing: how can we infer interesting properties of an assembled system, starting from the results of testing the components in isolation? The theoretical foundations of compositional testing still remain a major research challenge destined to last, and we discuss some directions for research in Section 5.1. Protocol testing. Protocols are the rules that govern the communication between the components of a distributed system, and these need to be precisely specified in order to facilitate interoperability. Protocol testing is aimed at verifying the conformance of protocol implementations against their specifications. The latter are released by standard organizations, or by consortia of companies. In certain cases, also a standard conformance test suite is released. Pushed by the pressure of enabling communication, research in protocol testing has proceeded along a separate and, in a sense, privileged trail with respect to software testing. In fact, thanks to the existence of precise statebased specifications of desired behaviour, research could very early develop advanced formal methods and tools for testing conformance to those established standard specifications [16]. Since these results were conceived for a restricted welldefined field of application, they do not readily apply to general software testing. However, the same original problem of ensuring proper interaction between remote components and services arises today on a broader scale for any modern software; therefore software testing research could fruitfully learn from protocol testing the habit of adopting standardized formal specifications, which is the trend in modern service-oriented applications. Viceversa, while early protocols were simple and easily tractable, today the focus is shifting to higher levels of communication protocols, and hence the complexity plague more typical of software testing starts also to become pressing here. Therefore, the conceptual separation between protocol testing and general software testing problems is progressively vanishing. Reliability testing. Given the ubiquity of software, its reliability, i.e., the probability of failure-free operation for a specified period of time in a specified environment, impacts today any technological product. Reliability testing recognizes that we can never discover the last failure, and hence, by using the operational profile to drive testing, it tries to eliminate those failures which would manifest themselves more frequently: intuitively the tester mimics how the users will employ the system. Software reliability is usually inferred based on reliability models: different models should be used, depending on whether the detected faults are removed, in which case the reliability grows, or not, when reliability is only certified. Research in software reliability has intersected research in software testing in many fruitful ways. Models for software reliability have been actively studied in the years 80’s and 90’s [58]. These models are now mature and can be engineered into the test process providing quantitative guidance for how and how much to test. For instance, this was done by Musa in his Software-Reliability-Engineered Testing (SRET) approach ([58], Chapt.6), and is also advocated in the Cleanroom development process, which pursues the application of statistical test approaches to yield certified reliability measures [69]. Unfortunately, the practice of reliability testing has not Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007proceeded at the same speed of theoretical advances in software reliability, probably because it is (perceived as) a complex and expensive activity, but also for the inherent difficulty of identifying the required operational profile [41]. Yet today the demand for reliability and other dependability qualities is growing and hence the need arises for practical approaches to coherently test functional and extrafunctional behaviour of modern software-intensive systems, as discussed further in Section 5.5. For future challenges in reliability testing we refer to Lyu’s roadmap [57]. 5. The routes In this section we describe the dreams of software testing research, and for each of them some relevant challenges to be addressed to advance the state of the art closer to the dream itself. 5.1. Dream: Universal test theory One of the longstanding dreams of software testing research would be to build a sound comprehensive theory which is useful to backup and nourish test technology. By asking for a “universal” test theory I mean one coherent and rigorous framework to which testers can refer to understand the relative strengths and limitations of existing test techniques, and to be guided in selecting the most adequate one, or mix thereof, given the present conditions. Seminal work in software testing theory dates back to the late 70’s, when the related notions of a “reliable” [45] or an “ideal” [36] test suite were first introduced. Thanks to this pioneering work, we have logical arguments to corroborate the quite obvious fact that testing can never be exact [25]. But such knowledge per se, in addition to the warning that even though many tests passed, the software can still be faulty, provides little guidance about what is it then that we can conclude about the tested software after having applied a selected technique, or going even further, about how we could dynamically tune up our testing strategy as we proceed with accumulating test results, taking into account what we observe. The dream would be to have a test machinery which ties a statement of the goal for testing with the most effective technique, or combination of techniques, to adopt, along with the underlying assumptions that we need to make. Towards this dream research needs to address several challenges. Challenge: Explicit test hypotheses Ultimately, given that testing is necessarily based on approximations (remember we started from the statement that testing amounts to sampling some executions), this universal theory should also make explicit for each technique which are its underlying assumptions, or test hypotheses: firstly formalized in [6], the concept of a test hypothesis justifies the common and intuitive test practice behind the selection of every finite test set, by which a sample is taken as the representative of several possible executions. With the exception of few formal test approaches, test hypotheses are usually left implicit, while it would be of utmost importance to make them explicit. In this way, if we perform “exhaustive” testing according to the selected test criterion, from successfully completing the testing campaign we could justifiably conclude that the software is correct under the stated hypotheses: i.e., we still know that actually the software could be faulty, but we also know what we have assumed to be true at the origin and could instead be false. This notion is similar to the one of a “fault model”, which is used instead in the realm of protocol testing, where a test suite is said to provide fault coverage guarantee for a given fault model. A summary of test hypotheses behind most common testing approaches is given for instance by Gaudel [34], who mentions among others Uniformity Hypothesis for blackbox partition criteria (the software is assumed to behave uniformly within each test subdomain), and Regularity Hypothesis, using a size function over the tests. Such research should be extended to cover other criteria and approaches. The test hypotheses should be modularized by the test objective: different theories/hypotheses would be necessary when testing for reliability, when testing for debugging, and so on. By making explicit our assumptions, this challenge refines the WHY do we observe some executions. Challenge: Test effectiveness To establish a useful theory for testing, we need to assess the effectiveness of existing and novel test criteria. Although as said among the Achievements, several comparison studies have been conducted to this purpose, Fose2000 already signalled that additional research was needed to provide analytical, statistical, or empirical evidence of the effectiveness of the test-selection criteria in revealing faults, in order to understand the classes of faults for which the criteria are useful. These challenges are still alive. In particular, it is now generally agreed that it is always more effective to use a combination of techniques, rather than applying only one, even if judged the most powerful, because each technique may target different types of fault, and will suffer from a saturation effect [58]. Several works have contributed to a better understanding of inherent limitations of different testing approaches, starting from the seminal Hamlet and Taylor’ paper discussing partition testing and its underlying assumptions [41]. Yet further work is needed, notably to contextualize such comparisons to the complexity of real world testing (for instance, Zhu and He [90] analyse the adequacy of testing concurrent systems), as well as to refine assumptions at the Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007bases of such comparisons, to take into account progresses in test automation. For example, even the conventional controversy about the relative merits of systematic vs. random techniques is today revitalized by the arising sophisticated methods for automating random test generation (which are discussed in Section 5.3). This challenge addresses the WHY, HOW and HOW MUCH of testing, in terms of the faults (which and how many) we target. Challenge: Compositional testing The ever growing complexity of software makes testing hard, and hinders progress towards any of research dreams, included test theory. Traditionally, test complexity has been addressed by the ancient divide et impera strategy, i.e., the testing of a large complex system is decomposed into the separate testing of its composing “pieces”. Much past research has addressed techniques and tools for helping incremental testing strategies in organizing and executing progressively different aggregations of components. For example, different strategies have been proposed to generate the test order which is more efficient in minimizing the need of stubs and scaffolding, see [18] for a recent comparison. The problem has become particularly relevant today with the emergence of the CB development paradigm, as already discussed in FOSE2000, and even more with the increasing adoption of dynamic system compositions. So, we need a chapter of testing theory addressing compositional testing: we need to understand how we can reuse the test results observed in the separate testing of the individual pieces (be them Units or Components or Subsystems), in particular what conclusions can be inferred about the system resulting from the composition, and which additional test cases must be run on the integration. Several promising directions of study have been undertaken in different contexts. For instance, Hamlet has proposed a simple foundational theory for component-based software reliability [40], recently extended with the notion of state [39], but work is still needed to make it generally applicable. Blundell and coauthors [15] are instead investigating the application to testing of assume-guarantee reasoning, a verification technique used to infer global system properties by checking individual components in isolation. Since to be able to verify a component individually, we need to make assumptions about its context, assume-guarantee verification checks whether a component guarantees a property assuming the context behaves correctly, and then symmetrically the context is checked assuming the component is correct. The promise of assume-guarantee testing would be that by observing the test traces of the individual components one could infer global behaviours. The protocol test community is also actively investigating compositional testing. For example, van der Bijl and coauthors [81] have formally analysed the parallel composition of two given communication components, based on the ioco-test theory [79], which works on Labeled Transition Systems. In particular, if two components have been separately tested and proved to be ioco-correct, is their integration ioco-correct as well? The authors show that in general this cannot be concluded, but the answer can be affirmative for components whose inputs are completely specified [81]. Gotzhein and Khendek [37] instead have considered the glue code for the integration of communicating components, have produced a fault model for it and developed a procedure to find the test cases for the glue. This challenge is clearly related to WHAT we test. Challenge: Empirical body of evidence Today the importance of experimentation to advance the maturity of software engineering discipline certainly does not need to be underlined (Siøberg and coauthors [77] discuss in depth research challenges faced by empirical methods). In every topic of software engineering research, empirical studies are essential to evaluate proposed techniques and practices, to understand how and when they work, and to improve on them. This is obviously true for testing as well, in which controlled experimentation is an indispensable research methodology [26]. In FOSE2000, Harrold identified in this regard the following needs: controlled experiments to demonstrate techniques; collecting and making publicly available sets of experimental subjects; and industrial experimentation. All such needs can be confirmed today, and a more recent review of testing technique experiments [48] sadly concluded that over half of the existing (testing technique) knowledge is based on impressions and perceptions and, therefore, devoid of any formal foundation. Indeed, by experimenting, we should aim at producing an empirical body of knowledge which is at the basis for building and evolving the theory for testing. We need to examine factors that can be used to early estimate where faults reside and why, so that test resources can be properly allocated. And for doing this we need to have meaningful experiments, in terms of scale, of the subjects used, and of context, which is not always realistic. Banally, for all three aspects, the barrier is cost: careful empirical studies on large scale products, within real world contexts (such as [66]), and possibly replicated by several professional testers so to attain generally valid results are of course prohibitively expensive. A possible way out to overcome such barrier could be that of joining the forces of several research groups, and carrying out distributed and widely replicated experiment. Roughly the idea would be that of launching sort of ”Open Experiments” initiative, similarly to how several Open Source projects have been successfully conducted. Awareness of the need to unite forces is spreading, and some efforts are already being taken toward building shared data repositories, as in [26], or distributed experiFuture of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007mental testbeds, such as the PlanetLab [68] collection of more than 700 machines connected around the world. This is a fundamental challenge which spans over all six characterizing questions. 5.2. Dream: Test-based modeling A great deal of research focuses nowadays on modelbased testing, which we discuss below. The leading idea is to use models defined in software construction to drive the testing process, in particular to automatically generate the test cases. The pragmatic approach that testing research takes is that of following what is the current trend in modeling: whichever be the notation used, say e.g. UML or Z, we try to adapt to it a testing technique as effectively as possible. But if we are allowed to consider the dream, from the tester’s viewpoint the ideal situation would be to reverse this approach with respect to what comes first and what comes after: instead of taking a model and see how we can best exploit it for testing, let us consider how we should ideally build the model so that the software can be effectively tested. Wouldn’t it be nice if developers -fully aware of the importance and difficulty of thorough model-based testingcare in advance about testing and derive appropriate models already enhanced with information instrumental for testing? This is the motivation why we are here reversing the current view of “model-based testing” towards the dream of “testbased modeling”. Admittedly, this is just a new term for an old idea, as actually we can find already several research directions that more or less explicitly have been working toward approaching this dream. On one side, this notion of test-based modeling is closely related to, actually a factor of, the old idea of “Design-for-testability”, which is primarily concerned with designing software so as to enhance Controllability (of inputs) and Observability (of outputs). But also related can be seen former approaches to testing based on assertions, and more recent ones based on Contracts. Assertions in particular have early been recognized as a useful tool to enhance testing, since they can verify at runtime the internal state of a program. Descending from assertions, contracts were originally introduced at the level of classes for OO software, and have then been adopted for components: intuitively, a contract establishes a “legal” agreement between two interacting parties, which is expressed by means of three different types of assertions: pre-conditions, post-conditions and invariants. The step to using such contracts as a reference for testing is short, and much interesting research is going on with promising results, e.g., [20, 52]. Challenge: Model-based testing The often cited trends in this paper of rising levels of complexity and needs for high quality are driving the cost of testing higher, to the point where traditional testing practices become uneconomic, but fortunately at the other end, the increasing use of models in software development yields perspective of removing the main barrier to the adoption of model-based testing, which is (formal) modeling skills. Model-based testing is actually a sort of Back to the future movie for software testing. Indeed, the idea of modelbased testing has been around for decades (Moore [62] started the research on FSM-based test generation in 1956!), but it is in the last few years that we have seen a groundswell of interest in applying it to real applications (for an introduction to the different approaches and tools in modelbased testing see, e.g., [80]). Nonetheless, industrial adoption of model-based testing remains low and signals of the research-anticipated breakthrough are weak. Therefore, beyond theoretical challenges, researchers are today focusing on how to beat the barriers to wide adoption. There are important technical and process-related issues pending. A widely recognized issue is how can we combine different styles of modeling (such as transition-based, pre/post condition-based and scenario-based). For instance, we need to find effective ways to compose state-based and scenariobased approaches [9, 38]. At Microsoft, where model-based testing has been championed for various years now, but with limited follow-up, a multi-paradigmatic approach [38] is now pursued to favor a wider adoption. The idea is that models stemming from different paradigms and expressed in any notation can be seamlessly used within one integrated environment. The lesson learned is in fact that forcing users to use a new notation does not work, instead the core of a model-based testing approach should be agnostic and let developers use existing programming notations and environments [38]. We also need ways to combine model-based criteria with other approaches; for instance a promising idea is to use testing over simulations [72] to optimize the test suite and to boost testing. Process-related issues concern the need to integrate model-based testing practice into current software processes: perhaps the crucial issues here are the two related needs for test management of making test models as abstract as possible, while still retaining the ability to generate executable tests on one side; and of keeping traceability from requirements to tests all along the development process, on the other. We finally also need industrial-strength tools for authoring and interactive modeling, that can help reduce the inadequate education of current testers (or maybe the excessive expertise requirements of proposed techniques). A special case of model-based testing is conformance testing, i.e., checking whether the system under test complies with its specification, under some defined relation (which is strictly related to the test hypotheses previously discussed). Starting from the 70’s, many algorithms have Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007been proposed; a recent extensive overview of current open challenges is given in Broy and coauthors’ tutorial on model-based testing for reactive systems [21]. The results achieved so far are impressive on theoretical grounds, but many of the proposed methods are hardly applicable to realistic systems, even though several tools have been produced and some of these are applied in specialized domains. A good overview of tools for model-based conformance testing built on a sound theory is provided by Belinfante and coauthors [4], who highlight the need to improve and ease the application of the theory. This challenge refers to the HOW we select which test executions to observe, and partly to the HOW MUCH of them. Challenge: Anti-model-based testing Parallel to model-based testing, several efforts are being devoted towards novel forms of testing which lay directly on the analysis of program executions, rather than on an a-priori model. Instead of taking a model, devise from it a test plan, and hence compare the test results back to the model, these other approaches collect information from executing the program, either after actively soliciting some execution, or passively during operation, and try to synthesize from these some relevant properties of data or of behaviour. There can be cases in which the models simply do not exist or are not accessible, such as for COTS or legacy components; other cases in which the global system architecture is not decided a-priori, but is created and evolves dynamically along the life of a system; or, a model is originally created, but during development it becomes progressively less useful since its correspondence with the implementation is not enforced and is lost. Hence, symmetrically to model-based testing, we have that (explicitly or implicitly) a model is derived a posteriori via testing, which we refer to as anti-model-based testing, as anticipated in [11]. By this term we refer to all various approaches that by means of testing, reverse-engineering a model, in the form of an invariant over the program variables, or in the form of a statemachine, or a Labelled Transition System, or a Sequence diagram, and so on, and then check such a model to detect whether the program behaves appropriately. Anti-model-based testing can rely over the great advances of dynamic program analysis, which is a very active research discipline today, as discussed by Canfora and Di Penta [22]. We need to be able to infer system properties by reasoning on a limited set of observed traces, or even partial traces, since we might observe the components that form the system. In a recent work, Mariani and Pezze [59] propose the ` BCT technique to derive behaviour models for monitored COTS components. In their approach the behavioural models consist of both I/O models, obtained by means of the well-known Daikon dynamic invariant detector [30], and interaction models, in the form of Finite State Automata. Interestingly, these derived models can afterward be used for model-based testing if and when the components are replaced by new ones. A related challenge is to maintain the dynamically derived models up-to-date: depending on the type of upgrade to the system, also the model can need to be refined, as Mariani and Pezze also observe, outlining some ` possible strategies. This challenge as well refers to the HOW and the HOW MUCH we observe of test executions. Challenge: Test oracles Strictly related to test planning, and specifically to the problem of how to derive the test cases, is the issue of deciding whether a test outcome is acceptable or not. This corresponds to the so-called “oracle”, ideally a magical method that provides the expected outputs for each given test cases; more realistically, an engine/heuristic that can emit a pass/fail verdict over the observed test outputs. Although it is obvious that a test execution for which we are not able to discriminate between success or failure is a useless test, and although the criticality of this problem has been very early raised in literature [85], the oracle problem has been paid little attention by research and in practice few alternative solutions still exist to human eyeballing. But such state of affairs which is already today not satisfactory, with the increasing complexity and criticality of software applications is destined to become a blocking obstacle to reliable test automation (in fact, the test oracles challenge also overlaps the route toward test automation). Indeed, the precision and efficiency of oracles greatly affects testing cost/effectiveness: we don’t want that test failures pass undetected, but on the other side we don’t want either to be notified of many false-positives, which waste important resources. We need to find more efficient methods for realizing and automating oracles, modulo the information which is available. A critical survey of oracle solutions is provided by Baresi and Young [1], who conclude by highlighting areas where research progress is expected, which we borrow and expand below: Concrete vs. abstract state and behavior: modelbased testing promises to alleviate the oracle problem, since the same model can act as the oracle; however, for oracles based on abstract descriptions of program behavior, the problem remains of bridging the gap between the concrete observed entities and the abstract specified entities; Partiality: plausibly partial oracles are the only viable solution to oracle automation: the challenge is to find the best tradeoff between precision and cost; Quantification: for test oracles implemented via executable specification languages a compromise between expressiveness and efficiency must be sought, so far there is no clear optimum balance nor any fully satisfactory approach to accommodating quantifiers; Oracles and test case selection: ideally, oracles should be Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007orthogonal to test case selection; however, in model-based testing the available models are often used to derive test classes and test-class-specific test oracles together. This challenge refers to the WHY question, in the sense of what we test against. 5.3. Dream: 100% automatic testing Far-reaching automation is one of the ways to keep quality analysis and testing in line with the growing quantity and complexity of software. Software engineering research puts large emphasis on automating the production of software, with a bulk of modern development tools generating ever larger and more complex quantities of code with less effort. The other side of the coin is the large danger that the methods to assess the quality of the so produced software, in particular testing methods, cannot keep the pace with such software construction methods. A large part of current testing research aims at improving the degree of attainable automation, either by developing advanced techniques for generating the test inputs (this challenge is expanded below), or, beyond test generation, by finding innovative support procedures to automate the testing process. The dream would be a powerful integrated test environment which by itself, as a piece of software is completed and deployed, can automatically take care of possibly instrumenting it and generating or recovering the needed scaffolding code (drivers, stubs, simulators), generating the most suitable test cases, executing them and finally issuing a test report. This idea, although chimeric, has attracted followers, for instance in the early DARPA sponsored initiative for Perpetual Test (also mentioned in FOSE2000) and more recently in Saff and Ernst’ Continuous Testing approach [74], which exactly aims to run tests in the background on the developer’s machine while they program. Quite promising steps have recently been made towards realization of this dream for unit testing, which is widely recognized as an essential phase to ensure software quality, because by scrutinizing individual units in isolation it can early detect even subtle and deeply-hidden faults which would hardly be found in system testing. Unfortunately, unit testing is often poorly performed or skipped altogether because quite expensive. We need approaches to make it feasible within the industrial development processes. A major component of unit testing high cost is the huge quantity of extra coding necessary for simulating the environment where the unit will be run, and for performing the needed functional checking of the unit outputs. To alleviate such tasks, highly successful between developers have been the frameworks belonging to the family of XUnit. Among these, the most successful is JUnit [47], which permits to automate the coding of Java test cases and their management, and has favored the spread of already mentioned testdriven development. However such frameworks do not help with test generation and environment simulation. We would like to push automation further, as for example in the Directed Automated Random Testing (DART) approach [35], which fully automates unit testing by: automated interface extraction by static source-code analysis; automated generation of a random test driver for this interface; and dynamic analysis of program behaviour during execution of the random test cases, aimed at automatically generating new test inputs that can direct the execution along alternative program paths. Another example is provided by the notion of “software agitation” [17], an automatic unit test technique supported by the Agitator commercial tool, which combines different analyses, such as symbolic execution, constraint solving, and directed random input generation for generating the input data, together with the already cited Daikon system [30]. Yet another approach is constituted by Microsoft Parameterized Unit Tests (PUT) [78], i.e., coded unit tests that are not fixed (as it happens for those programmed in XUnit frameworks), but depend on some input parameters. PUTs can describe abstract behavior in concise way by using symbolic execution techniques and by constraint solving can find inputs for PUTs that achieve high code coverage. The three cited examples are certainly not exhaustive of the quite active and fruitful stage that test automation is currently enjoying. The common underlying trend that emerges is the effort to combine and efficiently engineering advances coming from different types of analysis, and this, together with the exponential increase of computational resources available, could be the really winning direction towards the 100% automation dream. Challenge: Test input generation Research in automatic generation of test inputs has always been very active and currently so many advanced technologies are under investigation that even devoting the whole paper just to this topic would not yield sufficient space to adequately cover it. What is dismaying is that until nowadays all such effort has produced limited impact in industry, where the test generation activity remains largely manual (as reported for instance at ISSTA 2002 Panel[7]). But finally the combination of theoretical progresses in the underlying technologies, such as symbolic execution, model checking, theorem proving, static and dynamic analyses, with technology advances in modeling industry-strength standards and with available computational power seems to make this objective closer and has revitalized researchers’ faith. The most promising results are announced to come from three directions, and especially from their mutual convergence: the already widely discussed model-based approach, “clever” ways to apply random generation, and a wealthy Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007variety of search-based techniques used for both white-box and black-box test generation. Concerning model-based test generation, clearly research is posing great expectations in this direction since the emphasis on using (formal) models to guide testing resides exactly in the potential for automated derivation of test cases. References to ongoing work have already been provided in speaking of model-based testing challenges. Many of the existing tools are state-based and do not deal with input data. Research is needed to understand how we can incorporate data models within more traditional state-based approaches; one direction could be the introduction of symbolism over the state-based models, which could avoid the state-space explosion during test generation, and would preserve the information present in data definitions and constraints for use during the test selection process. For example, such an approach is being realized by the Symbolic Transition Systems approach [33], which augment transition systems with an explicit notion of data and data-dependent control flow. Also we need to increase the efficiency and potential of automated test generation by reusing within model-based approaches latest advances in theorem proving, model-checking and constraint satisfaction techniques. In particular, dating back to the mid 70’s, symbolic execution might be considered the most traditional approach to automated test data generation. Such approach, which has been put aside for some time, because of the many underlying difficulties, is today revitalized by the resurgence of strongly-typed languages and by the development of more powerful constraint solvers; Lee and coauthors [53] survey most promising developments. Concerning random test generation, this used to be considered a shallow approach, with respect to systematic techniques, deemed to be more comprehensive and capable to find important corner cases that would be likely to be overlooked by random techniques. However, previous studies mainly compared strawman implementations of random testing to sophisticated implementations of systematic techniques. Today, several researchers are proposing clever implementations of random testing that appear to outperform systematic test generation, if not else in terms of feasibility. The underlying idea of such approaches is that the random generation is improved dynamically, by exploiting feedback information collected as the tests are executed. For instance, Sen and coauthors have built on top of the cited DART approach, a notion of “concolic testing” [75], which is the combination of concrete (random) testing with symbolic execution. The concrete and symbolic executions are run in parallel and “help” each other. Instead, Pacheco and coauthors [67] randomly select a test case, execute it and check it against a set of contracts and filters. The most promising direction then is to figure out efficient ways to combine the respective strengths of systematic (model-based) and random testing. Finally, concerning search-based test generation, this consists of exploring the space of solutions (the sought test cases) for a selected test criterion, by using metaheuristic techniques that direct the search towards the potentially most promising areas of the input space. The attractive feature is that this approach appears to be fruitfully applicable to an unlimited range of problems; a recent survey is provided by McMinn [60]. Search-based test data generation is just one possible application of search-based software engineering [42]. This challenge addresses the HOW the observations are generated. Challenge: Domain-specific test approaches Domain-specific languages emerge today as an efficient solution towards allowing experts within a domain to express abstract specifications closer to their exigencies of expression, and which can then be automatically translated into optimized implementations. Testing as well can benefit from restricting the scope of application to the needs of a specific domain. Research should address how domain knowledge can improve the testing process. We need to extend domainspecific approaches to the testing stage, and in particular to find domain-specific methods and tools to push test automation. Domain-specific testing could use specialized kinds of approaches, processes and tools. These in turn need to make use of customizable modeling and transformation tools, hence the challenge also overlaps the test-based modeling route. Test techniques for specific domains have been investigated, for instance for databases, for GUI usability, for web applications, for avionics, for telecommunication systems; but few works having as their very focus the development of methodologies for exploiting domain-knowledge exist. One interesting pioneering work is due to Reyes and Richardson [71], who early developed a framework, called Siddhartha, for developing domain-specific test drivers. Siddhartha implemented an example-driven, iterative, method for developing domain-specific translators from the Test Specifications to a Domain-specific driver. It however required the tester input in the form of a general, example manuallycoded driver. More recently, Sinha and Smidts have developed the HOTTest technique [76], which refers to a strongly typed domain-specific language to model the system under test and demonstrates how this permits to automatically extract and embed domain specific requirements into the test models. I believe such research efforts show promising results in demonstrating the efficiency improvements of specific domain test approaches, and hopefully further research will follow. This challenge refers to the kind of application being observed, i.e., the WHAT question. Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007Challenge: On-line testing In parallel with the traditional view of testing as an activity carried on before release to check whether a program will behave as expected, a new concept of testing is emerging around the idea of monitoring a system’s behaviour in real life operation, using dynamic analysis and self-test techniques. Actually runtime monitoring has been in use for over 30 years, but a renewed interest arises because of the increasing complexity and ubiquitous nature of software systems. Terminology is not uniform, and different terms such as monitoring, runtime testing, on-line testing are used in the literature (Delgado and coauthors [24] present a recent taxonomy). All approaches share the goal to observe the software behavior in the field, with the aims of determining whether it complies with its intended behavior, detecting malfunctions or also performance problems. In some cases an on-line recovery is attempted, in other cases the analysis is conducted off-line in order to produce a profile or to get reliability figures. One distinguishing feature of on-line testing is that we do not need to devise a test suite to stimulate the system under test, since we limit ourselves to passively observe what happens. In fact, in communication protocol testing monitoring approaches are called passive testing. Message exchange along the real channels is traced, and the observed patterns are compared to the specified ones. In principle, the inherent “passivity” of any on-line testing approaches makes them less powerful than proactive approaches. All approaches can be reconducted to verifying the observed execution traces against assertions which express desired properties, or against specification invariants. For instance, Bayse and coauthors [2] have developed a tool that support passive testing against invariants derived from FSMs; they distinguish between simple invariants and obligation invariants. More in general, on-line testing effectiveness will depend on the identification of the reference assertions. Also, the collection of traces could degrade system performance. We need to understand which are the good places and the right timing for probing the system. In the midland between classical testing before deployment, and passive monitoring in the field, we could also conceive proactive testing in the field, i.e., actively stimulating the application after deployment, either when some events happen, for instance a component is substituted, or at scheduled intervals. A similar idea is exploited in the socalled “audition” framework [10], proposing an admission testing stage for web services. Another issue concerns the ability to carry on the testing in the field, especially for embedded applications that must be deployed in a resource constrained environment, where the overhead required by testing instrumentation could not be feasible. An interesting new research direction has been taken by Kapfhammer et al. [49], who are developing the Juggernaut tool for testing Java applications within a constrained environment. Their original idea is to exploit execution information, so far used to tune the test suite, also for adapting the test environment (they use in particular adaptive code unloading to reduce memory requirements). Such idea is attractive and can certainly find many other useful applications. This challenge concerns mainly the WHERE and WHEN to observe the test executions, with particular attention to dynamically evolving systems. 5.4. Dream: Efficacy-maximized test engineering The ultimate goal of software testing research, today as it was in FOSE2000, remains that of cost-effectively engineering “practical testing methods, tools and processes for development of high quality software” [43]. All theoretical, technical and organization issues surveyed so far should be reconciled into a viable test process yielding the maximum efficiency and effectiveness (both summarized by the term efficacy). Besides, the inherent technicalities and sophistication of advanced solutions proposed by researchers should be hidden behind easy to use integrated environments. This vision makes such a challenging endeavor that we qualify it as the highest ultimate dream of software testing research. The main obstacle to such a dream, that undermines all research challenges mentioned so far, is the growing complexity of modern systems. This complexity growth affects not only the system itself, but also the environments in which these systems are deployed, strongly characterized by variability and dynamicity. Strategies to align the development process so to maximize testing effectiveness belong to design for testability. We have already mentioned testability in speaking of models and precode artifacts which can be enhanced so to facilitate testing. However, testability is a broader concept than just how the system is modelled, it also involves characteristics of the implementation, as well as of the test technique itself and its support environment. Indeed, design for testability has been attributed by practitioners to be the primary cost driver in testing [5]. Efficacy-maximized test engineering passes through many challenges, some of which are discussed here below. Challenge: Controlling evolution Most testing activities carried on in industry involve retesting already tested code to ascertain that changes either in the program or in the context did not adversely affect system correctness. As pointed out in FOSE2000, because of the high cost of regression testing, we need effective techniques to reduce the amount of retesting, to prioritize regression Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007test cases and to automate the re-execution of the test cases. In general, we need strategies to scale up regression testing to large composite systems. We have already discussed theoretical issues behind compositional testing (see Section 5.1); we also need practical approaches to regression testing global system properties as some system parts are modified. For instance, given a component which is originally designed with an architecture, we need to understand how to test whether this piece of software evolves in line with its architecture. Such problem is also central to testing of product families. A related idea is test factoring, which consists into converting a long-running system test into a collection of many small unit tests. These unit tests can exercise a small part of the system in exactly the same way that the system testing did, but being more focused they can pinpoint failures in specific selected parts of the system. Test factoring is today actively investigated [73, 65, 28] since it promises order-ofmagnitude improvements in execution of regression tests. A common basic assumption of many existing approaches to regression testing is that a software artifact can be assumed for reference, for instance the requirements specification, or the software architecture. For modern software applications which continuously evolve, often dynamically in ways we cannot know in advance, neither such precode artifacts nor even the source code itself might be available, and the testing paradigm is shifting from regression testing towards one of continuous testing at runtime. Hence, a crucial issue concerns how to maintain control of the quality of software which evolves dynamically in the field. We need to understand what is the meaning of regression testing in such an evolutive context, and how we can modify and extend the basic idea of selective regression testing, i.e., how often do we need to check the execution traces? how can we compare the traces taken in different temporal intervals and understand if the evolution did not bring any malfunctions? This challenge concerns broadly the WHAT, WHERE and WHEN do we replay some executions following software evolution. Challenge: Leveraging user population and resources We have already mentioned the emerging trend of continuous validation after deployment, by means of on-line testing approaches (see Section 5.3), when traditional offline testing techniques become ineffective. Since software intensive systems can behave very differently in varying environments and configurations, we need practical ways to scale up on-line testing to cover the broad spectrum of possible behaviors. One rising approach to address this challenge is to augment in-house quality assurance activities by using data dynamically collected from the field. This is promising in that it can help to reveal real usage spectra and expose real problems on which to focus the testing activities and on which testing is lacking. For instance, by giving each user a different default configuration, the user base can be leveraged to more quickly expose configurations conflicts or problems, such as in [87]. And fielded profiles can also be used for improving a given test suite, such as in [56, 64, 29]. Although also some commercial initiatives start to appear, such as Microsoft’s Customer Experience Improvement Program [61], these efforts are still in their infancy, and one important research challenge left open is how to define efficient and effective techniques to unleash the potential represented by a large number of users, running similar applications, on interconnected machines. This high-level challenge involves several more specific challenges, among which: - How can we collect runtime data from programs running in the field without imposing too much overhead? - How can we store and mine the collected (potentially huge) amount of raw data so to effectively extract relevant information? - How can we effectively use the collected data for augmenting and improving in-house testing and maintenance activities? This challenge proposes that the users instantiate the WHERE and WHEN to scrutinize software runs. Challenge: Testing patterns We have already mentioned under the dream of a useful test theory, that we need to understand the relative effectiveness of test techniques in the types of faults they target. To engineering the test process, we need to collect evidences for such information to be able to find the most effective pattern for testing a system. This is routinely done, when for instance functional testing based on the requirements is combined with measures of code coverage adequacy. Another recurring recommendation is to combine operational testing with specific verification of special case inputs. However, such practices need to be backed up by a systematic effort to extract and organize recurring and proved effective solutions to testing problems into a catalogue of test patterns, similarly to what is now a well established scheme for design approaches. Patterns offer well proven solutions to recurring problems, or, in other words, they make explicit and document problem-solving expertise. As testing is recognized as expensive and effort-prone, making explicit which are successful procedures is highly desirable. A related effort is Vegas and coauthors characterization schema of how test techniques are selected [82]. They surveyed the type of knowledge that practitioners use to choose the testing techniques for a software project and have produced a formalized list of the relevant parameters. However, organizations that use the proposed schema might not dispose of all required information, hence more recently they are also investigating the sources of information, and Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007how these sources are to be used [82]. Similar studies are needed to formalize and document successful practices for any other testing-related activity. In fact this challenge spans over all six questions. Challenge: Understanding the costs of testing Since testing does not take place in abstract, but within the concrete world, with its risks, and safety as well as economic constraints, ultimately we want to be able to link the testing process and techniques with their cost. Each and every research article on software testing starts from claiming that testing is a very expensive activity, but we lack up-to-date and reliable references; it is somehow dismaying that still today references to quantify the high cost of testing cite textbooks dating back to more than twenty years ago. This might admittedly be due to the sensibility of failure data, which are company confidential. Nonetheless, to usefully transfer research advances to practice we need to be able to quantify direct and indirect costs of software testing techniques. Unfortunately, most research in software testing takes a value-neutral position, as if every fault found is equally important or has the same cost, but this is of course not true; we need ways to incorporate economic value into the testing process, to help test managers apply their judgement and select the most appropriate approaches. Boehm and colleagues have introduced the value-based software engineering (VBSE) paradigm [12], in which quantitative frameworks to support software managers decisions are sought that enhance the value of delivered software systems. In particular, various aspects of software quality assurance have been investigated including value-based and risk-based testing, e.g., [13]. VBSE concerns mainly the management of processes, for instance, w.r.t. testing, different types of stakeholder utility functions are considered to trade-off time of delivery vs. market value. We would also need to be able to incorporate estimation functions of the cost/effectiveness ratio of available test techniques. The key question is: given a fixed testing budget, how should it be employed most effectively? This challenge clearly addresses mainly the HOW and HOW MUCH of testing. Challenge: Education of software testers Finally, for software testing as for any other software engineering activity, a crucial resource remains the human factor. Beyond the availability of advanced techniques and tools and of effective processes, the testers’ skill, commitment and motivation can make a big difference between a successful test process or an ineffective one. Research on its side should strive for producing engineered effective solutions that are easily integrated into development and do not require deep technical expertise. But we also need to work in parallel for empowering the human potential. This is done by both education and motivation. Testers should be educated to understand the basic notions of testing and the limitations and the possibilities offered by the available techniques. While it is research that can advance the state of the art, it is only by awareness and adoption of those results by the next-coming generation of testers that we can also advance the state of practice. Education must be continuing, to keep the pace with the advances in testing technology. Education by itself poses several challenges, as discussed in [54]. It is evident that education must cover all characterizing aspects of testing. 5.5. Transversal challenges By transversal challenges we identify some research trends that go through all the four identified dreams. In particular we discuss here two transversal challenges. Challenge: Testing within the emerging development paradigm The history of software engineering research is phased by the subsequent emerging of novel paradigms of development, which promise to release higher quality and less costly software. Today, the fashion is Service-oriented Computing and many interesting challenges emerge for the testing of service-oriented applications. Several similarities exist with CB systems, and as in CB testing, services can be tested from different perspectives, depending on who is the involved stakeholder [23]. The service developer, who implements a service, the service provider, who deploys and makes it available, and the service integrator, who composes services possibly made available by others, access different kinds of information and have different testing needs. Except for the service developer, black-box test techniques need to be applied, because design and implementation details of services are not available. One peculiar aspect of services is that they are forced to make available a standard description in computer processable format to enable search and discovery. So, given that this is often the only information available for analysis, research is investigating how to exploit this compulsory specification for testing purposes. Currently, this description only includes the service interface in terms of the signature of methods provided (for instance the WSDL definition for Web Services). Clearly method signatures provide poor expressiveness for testing purposes, and in fact researchers aim at enriching such description to allow for more meaningful testing. Towards promoting interoperability, a first concern is to ensure that the services comply with established standardized protocols for message exchange. For instance, guidelines have been released by the WS-I (Web ServicesInteroperability) organization, along with testing tools to Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007monitor and check that the message exchanged comply with the guidelines. Such an approach is certainly necessary to assure interoperability, but not sufficient, in particular it does not test dynamic behaviour and is not concerned with the verification of extra-functional properties. Conceptually, off-line and on-line testing of services can be distinguished. With regard to off-line testing, the two approaches that emerge are model-based and mutation. For model-based testing of services, the general idea is to assume that the service developers and/or integrators make available models suitable for the automatic derivation of test cases. We need then to adapt the wealth of existing approaches to model-based testing to the context and constraints of services, and several proposals are being made, e.g. [55, 32, 50]. Mutation strategies adapted to services instead foresee the mutation of the input messages, as presented in [23]. On-line testing (discussed in 5.3), assumes a special importance for testing of services, since monitoring the realworld execution is the only way to observe the application behaviour. A service-oriented application generally results from the integration of several services, controlled and owned by different organizations. As a result, the control of an application is distributed, and the services composing it discover each other at run time and can change without prior notice, so nobody can predict the receiver or the provider of a given service. Monitoring of services introduces many subtle problems, relative to performance degradation, production of undesired side effects and cost. We need to understand on one side how we can observe the execution in a distributed network without (too) negatively affecting the system performance; on the other, we need means to reason at the abstract level of service composition and understand what and when we need to check. Challenge: Coherent testing of functional and extrafunctional properties By far the bulk of software testing literature addresses functionality testing, i.e., checking that the observed behaviour complies with the logic of the specifications. But this is not enough to guarantee the real usefulness and adequacy to purpose of the tested software: as importantly, well-behaving software must fulfill extra-functional properties, depending on the specific application domain. Notably, while conventional functionality testing does not provide for any notion of time, many features of the exhibited behaviour of a piece of software can depend on when the results are produced, or on how long they take to be produced. Similarly, while functionality testing does not tackle resource usage and workloads, in specific domains, such as telecommunications, performance issue account for a major fault category [84]. We would like to have test approaches to be applied at development time that could provide feedback as early as possible. On-going research that can pave the way is not much, and adopted approaches can be classified into modelbased and genetic. Among the former, we need effective ways to enhance models with desired extra-functional constraints. In this direction, researchers from Aalborg University [51] have been long investigating the extension of existing conformance testing theory to timed setting, producing a tool that can generate test cases from Timed Automata and execute them monitoring the produced traces. Model-based approaches are certainly an important instrument also for real-time embedded systems, but they will probably take a long course before being akin to large-scale application, also in view of the many technical issues that need to be modelled, such as environment dependency, distribution, resource constraints. It is thus advisable to look in parallel for innovative approaches: for instance, Wegener and Grochtmann [83] have proposed the use of evolutionary algorithms. They reduce real-time testing to the optimization problem of finding the best-case and the worst-case values of execution time. Such idea could be extended to other extra-functional properties, by appropriately translating the constraint into an optimization problem. 6. Conclusions We believe that software testing is a lively, difficult and richly articulated research discipline, and hope that this paper has provided a useful overview of current and future challenges. Covering into one article all ongoing and foreseen research directions is impossible; we have privileged broadness against depth, and the contribution of this paper should be seen rather as an attempt to depict a comprehensive and extensible roadmap, in which any current and future research challenge for software testing can find its place. The picture which emerges must be taken as a workin-progress fabric that the community may want to adapt and expand. It is obvious that those goals in such roadmap which have been settled as the dreams are destined to remain so. However, in a research roadmap the real thing is not the label on the finish, but the pathways along the traced routes. So, what actually is important that researchers focus on to sign progress are those called the challenges, and certainly the roadmap provides plenty of them, some at a more mature stage, other just beginning to appear. What is assured is that software testing researchers do not risk to remain without their job. Software testing is and will continue to be a fundamental activity of software engineering: notwithstanding the revolutionary advances in the way it is built and employed (or perhaps exactly because of), the software will always need to be eventually tried and monitored. And as extensively discussed in this paper, for sure we will need to make the process of testing more efFuture of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007fective, predictable and effortless (which coincides with the ultimate of the four testing dreams). Unfortunately, progress may be slowed down by fragmentation of software testing researchers into several disjoint communities: for instance, different events have been established by the communities as the loci where to meet to discuss the latest results, such as the ACM International Symposium on Software Testing and Analysis (ISSTA), or the IFIP International Conference on the Testing of Communicating Systems (TESTCOM) events, just to cite a couple, showing little overlap between PC members, participation, mutual knowledge and citations (which is a pity). In addition to scientific challenges faced by testing research, which have been discussed in Section 5, then we would like to also rise a challenge, which is opportunistic: the time has come that the different existing test research communities eventually converge and reconcile the respective achievements and efforts, since this would certainly be of the greatest benefit to advance the state of art1. A necessary concluding remark concerns the many fruitful relations between software testing and other research areas. By focussing on the specific problems of software testing, we have in fact overlooked many interesting opportunities arising at the border between testing and other disciplines. Some have been just touched upon in this paper, for instance model-checking techniques, see [27] (e.g., to drive model-based testing), or the use of search-based approaches, see [42], for test input generation, or the application of test techniques to assess performance attributes, see [86]. We believe that really many are the openings that may arise from a more holistic approach to software testing research, and in [19] readers can certainly find and appreciate many new interesting synergies spanning across the research disciplines of software engineering. 7. Acknowledgements Summarizing the quite broad and active field of software testing research has been a tough challenge. While I remain the only responsible of imprecisions or omissions, there are lot of people whom I am indebted. First, with the aim of being as comprehensive and unbiased as possible, I asked several colleagues to send me both a statement of what they considered the topmost outstanding challenge faced by software testing research, and a reference to relevant work (indifferently other authors’ or their own paper) that this paper could not miss to cite. Out of the many I invited, I warmly thank for contributing: Ana Cavalli, S.C. Cheung, Sebastian Elbaum, Mike Ernst, Mark Harman, Bruno Legeard, Alex Orso, Mauro Pezze, Jan Tretmans, Mark Utting, Margus ` 1Notably, among its goals the current Marie Curie TAROT Network http://www.int-evry.fr/tarot/ has that of joining researchers from the software testing and protocol testing communities. Veanes; their contributions have been edited and incorporated in the paper. I would also like to thank Lars Frantzen, Eda Marchetti, Ioannis Parissis, and Andrea Polini for the discussion on some of the presented topics. Daniela Mulas and Antonino Sabetta helped with drawing the Roadmap figure. I would also like to sincerely thank Lionel Briand and Alex Wolf for inviting me and for providing valuable advice. This work has been partially supported by the Marie Curie TAROT Network (MRTN-CT-2004-505121). References [1] L. Baresi and M. Young. Test oracles. Technical report, Dept. of Comp. and Information Science, Univ. of Oregon, 2001. http://www.cs.uoregon.edu/michal/pubs/oracles.html. ˜ [2] E. Bayse, A. R. Cavalli, M. Nu ´nez, and F. Za ˜ ¨ıdi. A passive testing approach based on invariants: application to the wap. Computer Networks, 48(2):235–245, 2005. [3] B. Beizer. Software Testing Techniques (2nd ed.). Van Nostrand Reinhold Co., New York, NY, USA, 1990. [4] A. Belinfante, L. Frantzen, and C. Schallhart. Tools for test case generation. In [21]. [5] S. Berner, R. Weber, and R. Keller. Observations and lessons learned from automated testing. In Proc. 27th Int. Conf. on Sw. Eng., pages 571–579. ACM, 2005. [6] G. Bernot, M. C. Gaudel, and B. Marre. Software testing based on formal specifications: a theory and a tool. Softw. Eng. J., 6(6):387–405, 1991. [7] A. Bertolino. ISSTA 2002 Panel: is ISSTA research relevant to industrial users? In Proc. ACM/SIGSOFT Int. Symp. on Software Testing and Analysis, pages 201–202. ACM Press, 2002. [8] A. Bertolino and E. Marchetti. Software testing (chapt.5). In P. Bourque and R. Dupuis, editors, Guide to the Software Engineering Body of Knowledge SWEBOK, 2004 Version, pages 5–1–5–16. IEEE Computer Society, 2004. http://www.swebok.org. [9] A. Bertolino, E. Marchetti, and H. Muccini. Introducing a reasonably complete and coherent approach for modelbased testing. Electr. Notes Theor. Comput. Sci., 116:85–97, 2005. [10] A. Bertolino and A. Polini. The audition framework for testing web services interoperability. In Proc. EUROMICRO ’05, pages 134–142. IEEE, 2005. [11] A. Bertolino, A. Polini, P. Inverardi, and H. Muccini. Towards anti-model-based testing. In Proc. DSN 2004 (Ext. abstract), pages 124–125, 2004. [12] S. Biffl, A. Aurum, B. Boehm, H. Erdogmus, and P. Gruenbacher, editors. Value-Based Software Engineering. Springer-Verlag, Heidelberg, Germany, 2006. [13] S. Biffl, R. Ramler, and P. Gruenbacher. Value-based management of software testing. In [12]. [14] R. V. Binder. Testing Object-Oriented Systems Models, Patterns, and Tools. Addison Wesley Longman, Inc., Reading, MA, 2000. Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007[15] C. Blundell, D. Giannakopoulou, and C. S. Pasareanu. Assume-guarantee testing. In Proc. SAVCBS ’05, pages 7– 14. ACM Press, 2005. [16] G. V. Bochmann and A. Petrenko. Protocol testing: review of methods and relevance for software testing. In Proc. ACM/SIGSOFT Int. Symp. Software Testing and Analysis, pages 109–124, 1994. [17] M. Boshernitsan, R. Doong, and A. Savoia. From Daikon to Agitator: lessons and challenges in building a commercial tool for developer testing. In Proc. ACM/SIGSOFT Int. Symp. Software Testing and Analysis, pages 169–180. ACM Press, 2006. [18] L. Briand, Y. Labiche, and Y. Wang. An investigation of graph-based class integration test order strategies. IEEE Trans. Softw. Eng., 29(7):594– 607, 2003. [19] L. Briand and A. Wolf, editors. Future of Software Engineering 2007. IEEE-CS Press, 2007. [20] L. C. Briand, Y. Labiche, and M. M. Sowka. Auto- ´ mated, contract-based user testing of commercial-off-theshelf components. In Proc. 28th Int. Conf. on Sw. Eng., pages 92–101. ACM Press, 2006. [21] M. Broy, B. Jonsson, J.-P. Katoen, M. Leucker, and A. Pretschner, editors. Model-Based Testing of Reactive Systems - Advanced Lectures, LNCS 3472. Springer Verlag, 2005. [22] G. Canfora and M. Di Penta. The forthcoming new frontiers of reverse engineering. In [19]. [23] G. Canfora and M. Di Penta. Testing services and servicecentric systems: Challenges and opportunities. IT Professional, 8(2):10–17, March/April 2006. [24] N. Delgado, A. Q. Gates, and S. Roach. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Trans. Softw. Eng., 30(12):859–872, 2004. [25] E. Dijkstra. Notes on structured programming. Technical Report 70-WSK03, Technological Univ. Eindhoven, 1970. http://www.cs.utexas.edu/users/EWD/ewd02xx/EWD249.PDF. [26] H. Do, S. Elbaum, and G. Rothermel. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng., 10(4):405– 435, 2005. [27] M. Dwyer, J. Hatcliff, C. Pasareanu, Robby, and W. Visser. Formal software analysis : Emerging trends in software model checking. In [19]. [28] S. Elbaum, H. N. Chin, M. B. Dwyer, and J. Dokulil. Carving differential unit test cases from system test cases. In Proc. 14th ACM/SIGSOFT Int. Symp. on Foundations of Sw Eng., pages 253–264. ACM Press, 2006. [29] S. Elbaum and M. Diep. Profiling deployed software: Assessing strategies and testing opportunities. IEEE Trans. Softw. Eng., 31(4):312–327, 2005. [30] M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao. The Daikon system for dynamic detection of likely invariants. Science of Computer Programming, to appear. [31] P. Frankl and E. Weyuker. Provable improvements on branch testing. IEEE Trans. Softw. Eng., 19(10):962–975, 1993. [32] L. Frantzen, J. Tretmans, and R. d. Vries. Towards modelbased testing of web services. In Proc. Int. Workshop on Web Services - Modeling and Testing (WS-MaTe2006), pages 67– 82, 2006. [33] L. Frantzen, J. Tretmans, and T. Willemse. A symbolic framework for model-based testing. In Proc. FATES/RV, LNCS 4262, pages 40–54. Springer-Verlag, 2006. [34] M.-C. Gaudel. Formal methods and testing: Hypotheses, and correctness approximations. In Proc. FM 2005, LNCS 3582, pages 2–8. Springer-Verlag, 2005. [35] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In Proc. ACM SIGPLAN PLDI’05, pages 213–223, 2005. [36] J. B. Goodenough and S. L. Gerhart. Toward a theory of test data selection. IEEE Trans. Softw. Eng., 1(2):156–173, June 1975. [37] R. Gotzhein and F. Khendek. Compositional testing of communication systems. In Proc. IFIP TestCom 2006, LNCS 3964, pages 227–244. Springer Verlag, May 2006. [38] W. Grieskamp. Multi-paradigmatic model-based testing. In Proc. FATES/RV, pages 1–19. LNCS 4262, August 15-16, 2006. [39] D. Hamlet. Subdomain testing of units and systems with state. In Proc. ACM/SIGSOFT Int. Symp. on Software Testing and Analysis, pages 85–96. ACM Press, 2006. [40] D. Hamlet, D. Mason, and D. Woit. Theory of software reliability based on components. In Proc. 23rd Int. Conf. on Sw. Eng., pages 361–370, Washington, DC, USA, 2001. IEEE Computer Society. [41] D. Hamlet and R. Taylor. Partition testing does not inspire confidence. IEEE Trans. Softw. Eng., 16(12):1402–1411, 1990. [42] M. Harman. The current state and future of search-based software engineering. In [19]. [43] M. J. Harrold. Testing: a roadmap. In A. Finkelstein, editor, The Future of Software Engineering, pages 61–72. IEEE Computer Society, 2000. In conjunction with ICSE2000. [44] W. Hetzel. The Complete Guide to Software Testing, 2nd Edition. QED Inf. Sc., Inc., 1988. [45] W. Howden. Reliability of the path analysis testing strategy. IEEE Trans. Softw. Eng., SE-2(3):208– 215, 1976. [46] D. Janzen, H. Saiedian, and L. Simex. Test-driven development concepts, taxonomy, and future direction. Computer, 38(9):43–50, Sept. 2005. [47] JUnit.org. http://www.junit.org/index.htm. [48] N. Juristo, A. M. Moreno, and S. Vegas. Reviewing 25 years of testing technique experiments. Empirical Softw. Eng., 9(1-2):7–44, 2004. [49] G. M. Kapfhammer, M. L. Soffa, and D. Mosse. Testing in resource constrained execution environments. In Proc. 20th IEEE/ACM Int. Conf. on Automated Software Engineering, Long Beach, California, USA, November 2005. ACM Press. [50] C. Keum, S. Kang, I.-Y. Ko, J. Baik, and Y.-I. Choi. Generating test cases for web services using extended finite state machine. In Proc. IFIP TestCom 2006, LNCS 3964, pages 103–117. Springer Verlag, 2006. [51] K. G. Larsen, M. Mikucionis, B. Nielsen, and A. Skou. Testing real-time embedded software using UPPAAL-TRON: an industrial case study. In Proc. 5th ACM Int. Conf. on Embedded Softw., pages 299–306. ACM Press, 2005. [52] Y. Le Traon, B. Baudry, and J.-M. Jez ´ equel. Design by con- ´ tract to improve software vigilance. IEEE Trans. Softw. Eng., 32(8):571–586, 2006. Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007[53] G. Lee, J. Morris, K. Parker, G. A. Bundell, and P. Lam. Using symbolic execution to guide test generation: Research articles. Softw. Test. Verif. Reliab., 15(1):41–61, 2005. [54] T. C. Lethbridge, J. Daz-Herrera, R. J. LeBlanc., and J. Thompson. Improving software practice through education: Challenges and future trends. In [19]. [55] Z. Li, W. Sun, Z. B. Jiang, and X. Zhang. BPEL4WS unit testing: Framework and implementation. In Proc. of ICWS’05, pages 103–110, 2005. [56] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pages 141–154. ACM Press, 2003. [57] M. Lyu. Software reliability engineering: A roadmap. In [19]. [58] M. Lyu (ed.). Handbook of Software Reliability Engineering. McGraw-Hill, New York, and IEEE CS Press, Los Alamitos, 1996. [59] L. Mariani and M. Pezze. Dynamic detection of COTS com- ` ponents incompatibility. IEEE Software, to appear. [60] P. McMinn. Search-based software test data generation: a survey. Software Testing, Verification and Reliability, 14(2):105–156, Sept. 2004. [61] Microsoft Research. Customer experience improvement program, 2006. http://www.microsoft.com/ products/ceip/. [62] E. F. Moore. Gedanken-experiments on sequential machines. Automata Studies, pages 129–153, 1956. [63] NIST. The economic impacts of inadequate infrastructure for software testing, May 2002. http://www.nist.gov/director/prog-ofc/report02-3.pdf. [64] A. Orso, T. Apiwattanapong, and M. J. Harrold. Leveraging field data for impact analysis and regression testing. In Proc. Joint meeting of the European Soft. Eng. Conf. and ACM/SIGSOFT Symp. on Foundations of Soft. Eng. (ESEC/FSE’03), pages 128–137, 2003. [65] A. Orso and B. Kennedy. Selective capture and replay of program executions. In Proc. 3rd Int. ICSE Workshop on Dynamic Analysis (WODA 2005), pages 29–35, St. Louis, MO, USA, may 2005. [66] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Predicting the location and number of faults in large software systems. IEEE Trans. Softw. Eng., 31(4):340–355, 2005. [67] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedbackdirected random test generation. Technical Report MSRTR-2006-125, Microsoft Research, Redmond, WA. [68] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A blueprint for introducing disruptive technology into the internet. SIGCOMM Comput. Commun. Rev., 33(1):59–64, 2003. [69] J. Poore, H. Mills, and D. Mutchler. Planning and certifying software system reliability. IEEE Software, pages 87–99, Jan. 1993. [70] M. J. Rehman, F. Jabeen, A. Bertolino, and A. Polini. Testing software components for integration: a survey of issues and techniques. Software Testing, Verification and Reliability, to appear. [71] A. Reyes and D. Richardson. Siddhartha: a method for developing domain-specific test driver generators. In Proc. 14th Int. Conf. on Automated Software Engineering, pages 81 – 90. IEEE, 12-15 Oct. 1999. [72] M. J. Rutherford, A. Carzaniga, and A. L. Wolf. Simulationbased test adequacy criteria for distributed systems. In Proc. 14th ACM/SIGSOFT Int. Symp. on Foundations of Sw Eng., pages 231–241. ACM Press, 2006. [73] D. Saff, S. Artzi, J. H. Perkins, and M. D. Ernst. Automatic test factoring for Java. In Proc. 20th Int. Conf. on Automated Software Engineering, pages 114–123, Long Beach, CA, USA, November 9–11, 2005. [74] D. Saff and M. Ernst. An experimental evaluation of continuous testing during development. In Proc. ACM/SIGSOFT Int. Symp. on Software Testing and Analysis, pages 76–85. ACM, July, 12-14 2004. [75] K. Sen, D. Marinov, and G. Agha. CUTE: A concolic unit testing engine for C. In Joint meeting of the European Soft. Eng. Conf. and ACM/SIGSOFT Symp.on Foundations of Soft. Eng. (ESEC/FSE’05), pages 263–272. ACM, 2005. [76] A. Sinha and C. Smidts. HOTTest: A model-based test design technique for enhanced testing of domain-specific applications. ACM Trans. Softw. Eng. Methodol., 15(3):242– 278, 2006. [77] D. Sjøberg, T. Dyba, and M. Jørgensen. The future of em- ˚ pirical methods in software engineering research. In [19]. [78] N. Tillmann and W. Schulte. Unit tests reloaded: Parameterized unit testing with symbolic execution. IEEE Softw., 23(4):38–47, 2006. [79] J. Tretmans. Test generation with inputs, outputs and repetitive quiescence. Software – Concepts and Tools, 17:103– 120, 1996. [80] M. Utting and B. Legeard. Practical Model-Based Testing - A Tools Approach. Morgan and Kaufmann, 2006. [81] M. van der Bijl, A. Rensink, and J. Tretmans. Compositional testing with ioco. In Proc. FATES 2003, LNCS 2931, 2003. [82] S. Vegas, N. Juristo, and V. Basili. Packaging experiences for improving testing technique selection. The Journal of Systems and Software, 79(11):1606–1618, Nov. 2006. [83] J. Wegener and M. Grochtmann. Verifying timing constraints of real-time systems by means of evolutionary testing. Real-Time Syst., 15(3):275–298, 1998. [84] E. Weyuker and F. Vokolos. Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Trans. Soft. Eng., 26(12):1147–1156, 2000. [85] E. J. Weyuker. On testing non-testable programs. The Computer Journal, 25(4):465–470, 1982. [86] M. Woodside, G. Franks, and D. Petriu. The future of software performance engineering. In [19]. [87] C. Yilmaz, A. M. A. Porter, A. Krishna, D. Schmidt, A. Gokhale, and B. Natarajan. Preserving distributed systems critical properties: a model-driven approach. IEEE Software, 21(6):32–40, 2004. [88] H. Zhu. A formal analysis of the subsume relation between software test adequacy criteria. IEEE Trans. Softw. Eng., 22(4):248–255, 1996. [89] H. Zhu, P. A. V. Hall, and J. H. R. May. Software unit test coverage and adequacy. ACM Comput. Surv., 29(4):366– 427, 1997. [90] H. Zhu and X. He. A theory of behaviour observation in software testing. Technical Report CMS-TR-99-05, 24, 1999. Future of Software Engineering(FOSE'07) 0-7695-2829-5/07 $20.00 © 2007