17 Systematic Review of Software Behavioral Model Consistency Checking FAIZ UL MURAM, HUY TRAN, and UWE ZDUN, University of Vienna, Faculty of Computer Science, Software Architecture Research Group, Vienna, Austria In software development, models are often used to represent multiple views of the same system. Such models need to be properly related to each other in order to provide a consistent description of the developed system. Models may contain contradictory system specifications, for instance, when they evolve independently. Therefore, it is very crucial to ensure that models conform to each other. In this context, we focus on consistency checking of behavior models. Several techniques and approaches have been proposed in the existing literature to support behavioral model consistency checking. This article presents a Systematic Literature Review (SLR) that was carried out to obtain an overview of the various consistency concepts, problems, and solutions proposed regarding behavior models. In our study, the identification and selection of the primary studies was based on a well-planned search strategy. The search process identified a total of 1770 studies, out of which 96 have been thoroughly analyzed according to our predefined SLR protocol. The SLR aims to highlight the state-of-the-art of software behavior model consistency checking and identify potential gaps for future research. Based on research topics in selected studies, we have identified seven main categories: targeted software models, types of consistency checking, consistency checking techniques, inconsistency handling, type of study and evaluation, automation support, and practical impact. The findings of the systematic review also reveal suggestions for future research, such as improving the quality of study design and conducting evaluations, and application of research outcomes in industrial settings. For this purpose, appropriate strategy for inconsistency handling, better tool support for consistency checking and/or development tool integration should be considered in future studies. CCS Concepts: General and reference → Surveys and overviews; Design; Verification; Validation; Software and its engineering → Consistency; Formal methods; Additional Key Words and Phrases: Software behavioral model, consistency checking, consistency types, systematic literature review ACM Reference Format: Faiz Ul Muram, Huy Tran, and Uwe Zdun. 2017. Systematic review of software behavioral model consistency checking. ACM Comput. Surv. 50, 2, Article 17 (April 2017), 39 pages. DOI: http://dx.doi.org/10.1145/3037755 1. INTRODUCTION The development of a software system often goes through a number of different stages and iteration cycles, and each of them can introduce new elements or more detailed specifications of the system [Ruparelia 2010]. In addition, software systems are constantly evolving. In many areas of software engineering, behavioral models are used to The research is supported by the Wiener Wissenschafts-, Forschungs- und Technologiefonds (WWTF), Grant No. ICT12-001. Authors’ addresses: F. Ul Muram, H. Tran, and U. Zdun, University of Vienna, Faculty of Computer Science, Software Architecture Research Group, Wahringer Strasse 29, 1090 Vienna, Austria; emails: {faiz.ulmuram, huy.tran, uwe.zdun}@univie.ac.at. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2017 ACM 0360-0300/2017/04-ART17 $15.00 DOI: http://dx.doi.org/10.1145/3037755 ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:2 F. Ul Muram et al. represent the behavioral aspects of a software system. Examples of behavior models are UML activity models, state machines, and sequence models [OMG 2011b], Simulink Stateflow [MathWorks 2015], the Business Process Model and Notation (BPMN) [OMG 2011a], the Business Process Execution Language (BPEL) [OASIS 2007], and Eventdriven Process Chains (EPC) [Scheer and Nuttgens 2000], to name but a few. In the ¨ past decades, a substantial number of software engineering research works have been devoted to consistency checking between different models or multiple views of the models that are used in software development (for a comprehensive list of existing works, please refer to Usman et al. [2008] and Lucas et al. [2009]). An example is ensuring consistency over multiple abstraction levels: Many models are created as “high-level” models [Stahl and Volter 2006; Van der Straeten 2005; ¨ Huzar et al. 2005]. That is, they are mainly used to convey the core concepts or principles of the reality they represent in an abstract and/or concise way (e.g., requirements models or design models). In addition, technical or “low-level” models are often created as refinements of the high-level models with purposes such as providing a precise specification of the source code, executing the model (e.g., in a process engine, interpreter, or virtual machine), or generating executable code directly from the model (e.g., in model-driven software development (MDSD)) [Frankel 2002; Stahl and Volter 2006]. It ¨ is crucial that the overlapping parts of the high-level and low-level models are in sync with each other [Van der Straeten 2005; Spanoudakis and Zisman 2001; Huzar et al. 2005]. Unfortunately, multiple models of the same system (or reality) are often drifting apart over time, and inconsistencies arise among them, when they are created by different stakeholders and evolved independently [Van der Straeten 2005; Spanoudakis and Zisman 2001; Huzar et al. 2005]. For instance, high-level models might be changed according to new requirements, and low-level models are changed as the implementation is modified. If not each change is systematically propagated to all other models of the same system (or reality), the evolved models may include inconsistencies. Inconsistencies can also occur due to the multiview nature of many models [Huzar et al. 2005; Fradet et al. 1999]. A system can be defined by multiple views that specify different aspects of the system. Inconsistencies can occur due to the overlaps between the views [Huzar et al. 2005]. Consistency is a general goal to be obtained while building the models. In particular, during system (and model) evolution, it is crucial to maintain the consistency between different behavior models. As Spanoudakis and Zisman [2001] also emphasized, there are severe negative effects of model inconsistencies that may delay and, therefore, increase the cost of the system development process, jeopardize properties related to the quality of the system, and make it more difficult to maintain the system [Spanoudakis and Zisman 2001]. These negative effects can be multiplied manifoldly, especially in the context of developing modern large scale software systems that are far more complex and might consist of numerous models and interconnected subsystems (e.g., cloudbased systems, Internet of Things systems, networks of sensors–based systems, or cyberphysical systems) [Goth 2008; Atzori et al. 2010; Sommerville et al. 2012]. There are only a few secondary studies in collecting and analyzing evidence regarding the research of model consistency checking and management [Spanoudakis and Zisman 2001; Huzar et al. 2005; Usman et al. 2008; Lucas et al. 2009]. The survey of Spanoudakis and Zisman [2001] presents a broad view and discusses open research issues. Among of those secondary studies, only the SLR conducted by Lucas et al. [2009] has systematically investigated and provided insights regarding consistency concepts, proposals, problems, supported models and the maintainability of consistency management approaches, but only for UML models in the timespan from 2001 to 2007. We note that the aforementioned studies, except the survey of Spanoudakis and Zisman [2001], mainly focus on UML-based modeling and development. They aim to ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:3 cover broadly both structural and behavioral consistency checking of software artifacts. Although UML is widely used in academia and industry, there are still a considerable number of non-UML modeling and development methods, for instance, in the domains of workflow and business process management, embedded and real-time systems, and service-oriented systems, to name but a few. Considering only UML-based methods and techniques can likely lead to bias against non-UML approaches. Thus, we decided to conduct a systematic literature review of consistency checking with a broader scope and, therefore, research objectives. Our study strives for investigating consistency checking beyond the domain of UML-based software development. As checking of structural aspects has been extensively reviewed [Spanoudakis and Zisman 2001; Huzar et al. 2005; Usman et al. 2008; Lucas et al. 2009], we opted to pay special attention to the behavioral aspects of software system modeling. We planned to carry out the SLR at a finer granularity, for instance, we studied in-depth the degrees of automation support, inconsistency handling, tool support, evaluation, and evidence of application in industry settings (these aspects are not yet considered or only in limited form in the previous studies). Chen and Ali Babar [2011] emphasized that these aspects are important as they indicate the practical impacts of the research outcomes to both science and industrial practices. Nevertheless, we adopted and extended some fundamental concepts and categories of consistency checking and management that have been proposed in these studies. The article is organized as follows: Section 2 describes background and context of the consistency checking problem along with fundamental concepts. Section 3 explains the systematic literature review process used in this review. Section 4 discusses the results of SLR in relation to the addressed research questions. Section 5 presents the discussion on the results of our SLR. Finally, Section 6 concludes the article with main findings and potential future research directions. 2. BACKGROUND ON CONSISTENCY CHECKING OF SOFTWARE MODELS According to Spanoudakis and Zisman [2001], an inconsistency is described as “a state in which two or more overlapping elements of different software models make assertions about aspects of the system they describe which are not jointly satisfiable.” As Spanoudakis and Zisman [2001] stated, the problem of inconsistencies in software models have been a big concern of the software engineering community for a long time. As a system is often modeled from different viewpoints by different stakeholders, in different levels of abstraction and granularity [Finkelsteiin et al. 1993; Boiten et al. 2000], there may exist contradicting information [Engels et al. 2001; Spanoudakis and Zisman 2001]. There are several definitions of consistency and its classification emerging in the literature [Engels et al. 2001; Huzar et al. 2005; Van der Straeten 2005; Usman et al. 2008; Lucas et al. 2009]. A typical consistency problem happens among different types of representations (e.g., models) of the same aspect of a software system. For instance, for describing the interactions among various objects, a UML sequence diagram and/or a communication diagram can be used with respect to a temporal or structural viewpoint, respectively [Engels et al. 2001]. Engels et al. [2001] consider this type of consistency “horizontal consistency,” while [Huzar et al. 2005] and [Usman et al. 2008] call it “intra-model consistency.” Looking into another dimension, a software model can be refined or transformed into a richer form with more details. This scenario happens quite often in model-driven software development in which model transformations are extensively used to map a high-level, abstract model down to a lower level of abstraction [Frankel 2002; Stahl and Volter 2006]. This type of consistency is named ¨ vertical consistency [Engels et al. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:4 F. Ul Muram et al. 2001], inter-model consistency [Huzar et al. 2005], or refinement [Lucas et al. 2009]. We note that the term “model” is interpreted rather differently by Huzar et al. [2005] (as a set of diagrams) and [Engels et al. 2001] (as an umbrella term that embraces the meaning of a diagram) and, therefore, their definition of consistency types are not totally overlapping. Van der Straeten [2005] used the term “evolution consistency” to indicate the consistency between different versions of the same model, which fits to the category “vertical consistency” in [Engels et al. 2001]. In the context of this study, we adopt the umbrella term “model” as proposed by Spanoudakis and Zisman [2001] and Engels et al. [2001], and we use the terms “model” and “diagram” interchangeably with the same meaning unless stated otherwise in case it is necessary to distinguish between these terms. As a result, the interpretation of horizontal consistency and vertical consistency based on this point of view is used in our SLR. In addition, our initial study revealed that several approaches investigate the consistency of a single model itself for correctness, determinism, and so forth. These consistencies do not fit nicely into the two categories mentioned earlier. Inspired by the philosophical stance presented by Mens and Van Gorp [2006] in categorizing model transformations, we consider the following definitions of consistency types for software models in this SLR. —Endogenous consistency indicates consistencies within the same model regarding the properties of itself. —Exogenous consistency denotes consistencies between different models. It can be refined into two subcategories: horizontal and vertical consistencies. —Horizontal consistency denotes consistencies among different kinds of models or a model against rules and constraints that are manually specified or derived from other artifacts. —Vertical consistency shows the consistency among models of the same type at different levels of abstraction. This also includes the consistency between a design model and a corresponding implementation. Existing studies distinguish between syntactic and semantic consistencies [Engels et al. 2001]. Syntactical consistency aims to ensure that a model conforms to a predefined syntax specified by a certain meta-model or grammar [Engels et al. 2001]. This ensures the well-formedness of the model [Engels et al. 2001]. Syntactic (or structural) consistency checking has been extensively discussed in Huzar et al. [2005], Usman et al. [2008], and Lucas et al. [2009]. As the main focus of our study are behavioral models, we will investigate further into semantic consistency. Semantic consistency addresses the semantic compatibility of models, for instance, the same identifier that occurs in different models should refer to the same entity (i.e., have same meaning). Semantic consistency is stricter and often requires syntactical consistency beforehand. In the context of horizontal consistency, semantic requires models of different viewpoints to be semantically compatible with regards to the aspects of the system that are specified in both submodels. For vertical consistency problems, semantic consistency requires that a refined model is semantically consistent with the model it refines. Semantic consistency depends very much on the underlying semantics of the models being used and of the development process. Eshuis and Grefen [2007] also observed that several model structures are described by different syntaxes but represent the same behavior. Therefore, the identification of a suitable semantic domain is important for consistency checking of software models in general and behavior models in particular [Engels et al. 2001]. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:5 3. SYSTEMATIC LITERATURE REVIEW PROCESS An SLR provides a key instrument for identifying, evaluating, and interpreting all existing research related to a particular research question, phenomenon of interest, or topic area [Kitchenham et al. 2015; Kitchenham and Charters 2007]. SLRs are a form of secondary study, while individual studies that contribute to the SLR are considered primary studies. The process of conducting an SLR must be explicitly defined and well executed. To facilitate the planning and execution of this SLR, we leverage the guidelines performing SLRs in software engineering recommended in Kitchenham and Charters [2007] with adjustments recommended in Kitchenham and Brereton [2013] and Kitchenham et al. [2015]. A clear description of the phases for conducting the review process will be presented in this section. (1) Planning the review: The goal of this phase is to define the research questions and methods for developing a review protocol (see Section 3.1), (2) Conducting the review: In this phase, the review protocol defined in the previous phase will be enacted (see Section 3.2), (3) Reporting the review: In this phase, the results of the review are documented, validated, and reported (see Sections 4 and 5). The aforementioned phases are not done sequentially at once but rather in an iterative manner with feedback loops. In particular, many activities are created during the protocol development phase and should be refined when execution of the review takes place. For example, search terms, inclusion criteria, and exclusion criteria can be refined during the course of the review. The protocol provides details of the plan for the review, such as specifying the search process to be followed and the conditions to apply when selecting relevant primary studies. 3.1. Planning the Review The first phase for undertaking an SLR is related to specifying pre-review activities for conducting the SLR. The planning phase includes identification of the reasons for carrying out the systematic review (see Section 3.1.1), specifying the research questions (see Section 3.1.2), defining the search strategy (see Section 3.1.3), and establishing the inclusion and exclusion criteria (see Section 3.1.4). 3.1.1. Need for the Literature Review. There are several reasons for undertaking this systematic review. The main objective of this SLR is to systematically select and review the published literature with regard to behavioral model consistency and summarize all existing practices and information in a well-defined and unbiased manner including problems, limitation, future trends, and possible opportunities within the context of software behavior model consistency. This SLR aims at identifying the current state of the art of consistency checking of software behavior models not only in the field of UML-based modeling but also in various other domains. As mentioned earlier, the existing reviews merely covered UML models. Hence, no comprehensive systematic review in the area of software behavior models consistency checking has been previously published. Our study also aims at a finer level of granularity with regard to automation support, inconsistency handling, tool support, evaluation, and evidence of application in industry settings (which have not yet been well considered in the previous studies), as these aspects indicate the practical impacts of the research outcomes to both science and industrial practices [Chen and Ali Babar 2011]. The perspective taken in this study is both of practitioners and researchers working on behavioral model consistency checking to provide them up-to-date state-of-the-art research and therefore guide them to identify relevant studies that suit their own needs. Furthermore, we aim to appraise evidence ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:6 F. Ul Muram et al. of research on consistency checking of software behavior models as well as to identify challenges and open problems that may provide insights for further investigations. 3.1.2. Research Questions. Defining the research questions is one of the most important activities of any systematic review because they guide the selection of primary studies and data extraction. For this purpose, we leverage the Goal-Question-Metric (GQM) approach [Basili and Weiss 1984], which is a systematic method for organizing measurement programs. The GQM model starts with specifying the certain goal (i.e., purpose, object, issue, and viewpoint). Then, the goal is refined into several questions, each question is then refined into metrics [Basili and Weiss 1984]. By using GQM, a goal for conducting the SLR is defined. The goal is refined into several research questions, and, subsequently, these questions are refined into metrics that provide means to answer these questions. By providing the answers of the questions, the data can be analyzed to identify whether the goals are achieved or not. The goal for our SLR is: Based on the aforementioned goal, we derive the following research questions. RQ1 and RQ2 aim at describing the state-of-the-art of research on consistency checking of software behavioral models. The expected outcome will be a comprehensive view of behavioral model consistency checking in various dimensions. This allows us to categorize the state of the art in consistency checking with respect to the behavioral models. RQ3 is proposed for learning about the practical impact of the existing methods in academia and industry. It aims to provide both researchers and practitioners with evidence about what methods could be used in practice and to which degree. A lack of evidence or poor evidence could highlight the need for more rigorous studies and applications in real or quasi-real settings. RQ4 is formulated to identify gaps in current research that could yield insights regarding the issues or open problems in software behavioral models consistency research and provide directions for further studies. 3.1.3. Search Strategy. Our search strategy aims to find a comprehensive and unbiased collection of primary studies from the literature related to the research questions. Therefore, we devised a search strategy that maximizes the possibility to discover every relevant publications in a search result. For the electronic search, we leveraged the major databases that are widely used in computer science (CS) and software engineering ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:7 Table I. Literature Search Dimensions and Keywords Dimension Search Keywords Type of models behaviour diagram, behavior diagram, behavioural diagram, behavioral diagram, behaviour model, behavior model, behavioural model, behavioral model, activity diagram, sequence diagram, state diagram, state machine, statechart, collaboration diagram, communication diagram, interaction diagram, timing diagram, workflow, business process, process model, BPMN, WSBPEL, EPC, finite state machine, FSM, state-transition, Stateflow Type of studies containment/contain/containing, refine/refining/refinement, inconsistencies/inconsistent/inconsistency, consistent/consistency/consistencies (SE) research as reported in Zhang et al. [2011], which are the ACM Digital Library,1 the IEEE Explore Digital Library,2 SpringerLink,3 and ScienceDirect.4 These are rich and comprehensive databases containing bibliographic information of a plethora of publications from all major publishers of the computing literature. We note that ISI Web of Science (WoS)5 is also a large database of scientific studies. However, in the field of software engineering, WoS mainly records publications published in premium journals and only a few conferences, whereas conferences (and sometimes workshops and symposiums) are major outlets for publishing in CS and SE. Thus, we mainly use WoS along with Google Scholar6 for cross-checking with the search results from the chosen sources and for performing some meta-analyses. Our search terms stem from the research questions and can be categorized into two major dimensions, as shown in Table I. On the one hand, we aim at exploring the different behavior models or diagrams considered in the primary studies. On the other hand, we aim to discover relevant types of studies that have been performed for checking different types of inconsistencies. In order to ensure a sufficient scope of searching, we also consider different alternative words (e.g., behavior/behaviour/behavioural, model/diagram), synonyms (e.g., model, diagram, workflow, process), and abbreviations for the search terms. Then we used the boolean operator “OR” to join the alternate words and synonyms and the boolean operator “AND” to form a sufficient search string. 3.1.4. Inclusion and Exclusion Criteria. Inclusion and exclusion criteria are used to identify the suitability of primary studies and making decisions for inclusion or exclusion of an article in the SLR based on the addressed research questions. Our inclusion and exclusion criteria are shown in Table II. 3.2. Conducting the Review The second phase of the SLR is to perform the search strategy defined in the planning stage. The steps involved in conducting the review are primary studies selection presented in Section 3.2.1, quality assessment presented in Section 3.2.2, and data extraction and synthesis described in Section 3.2.3. 3.2.1. Primary Studies Selection. We illustrate the search process and the number of primary studies identified at each stage in Figure 1. We strive for a comprehensive list of studies reported without any additional constraints. That is, our search strategy dated back to late eighties since that time period is often considered to foster consistency checking of software artifacts [Spanoudakis and Zisman 2001]. We started with the 1See http://dl.acm.org. 2See http://ieeexplore.ieee.org. 3See http://link.springer.com. 4See http://www.sciencedirect.com. 5See http://webofknowledge.com. 6See http://scholar.google.com. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:8 F. Ul Muram et al. Table II. Inclusion and Exclusion Criteria Type Description Inclusion I1 Study is internal to software domain. We are only interested in consistency checking for software systems. I2 Study is about consistency checking related to software behavioral models/diagrams. I3 Study comes from an acceptable source such as a peer-reviewed scientific journal, conference, symposium, or workshop. I4 Study reports issues, problems, or any type of experience concerning software behavioral model consistency. I5 Study describes solid evidence on software behavioral model consistency checking, for instance, by using rigorous analysis, experiments, case studies, experience reports, field studies, and simulation. Exclusion E1 Study is about hardware or other fields not directly related to software. E2 Study is not clearly related to at least one aspect of the specified research questions. E3 Study reports only syntactic or structural consistency checking of models/diagrams. E4 Secondary literature reviews. E5 Study does not present sufficient technical details of consistency checking related to software behavioral models (e.g., they have a different focus (i.e., version control) and have insufficient detail) E6 Study did not undergo a peer-review process, such as non-reviewed journal, magazine, or conference papers, master theses, books, and doctoral dissertations (in order to ensure a minimum level of quality). E7 Study is not in English. E8 Study is a shorter version of another study which appeared in a different source (the longer version will be included). Fig. 1. Overview of the stages and results of our search process. search in June 2015 and ended the search process at the end of July 2015 using the search string described in Section 3.1.3. The initial search, however, is performed in 2013. After consolidating the results, overall 1,770 studies have been identified. In the initial search process, we identified 1,698 studies, whereas the snowballing process added 72 more studies. Two researchers working independently have identified the relevant studies by quickly scanning parts of each publication such as the title, abstract and keywords (and sometimes the conclusion in case it was difficult to extract information from the abstract). The second selection stage is based on the aforementioned inclusion and ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:9 exclusion criteria. We note that there were a number of duplications due to different reasons. For instance, some authors published their journal articles which were extended versions of previously published workshop and/or conference papers. Therefore, we worked carefully to eliminate all potential duplications and retained only the most complete (or recent) versions of the duplicates. After the inclusion and exclusion stage (including removing duplications), there are 70 primary studies remained out of 1,698 identified studies in initial search. We are aware that it is impossible to achieve a total set of publications using the aforementioned automated searches. Therefore, we performed an additional snowballing process [Budgen et al. 2011] (i.e., manually scanning and analyzing the references and citations of these primary studies) to ensure that our study also covered relevant follow-up works that might exist but have not been included in the search. In particular, we collected references and related works from each of 70 primary studies selected in the automated search phase. The snowballing search continued until no more relevant studies were found. As a result, there are total 72 studies collected from the snowballing process. Some of these studies did not appear in the results of the aforementioned automated searches. Some others existed but were not found in the first search because either they have too short abstracts or their titles or abstracts do not explicitly include the proposed search terms (but their contents and contribution do). Twenty-six out of the 72 additional studies satisfied the inclusion criteria and exclusion criteria and, therefore, are selected. Finally, we analyzed the remaining primary studies thoroughly to ensure that we had obtained the most relevant studies. This in-depth analysis results in total 96 studies within the time period of 1999–2015 that are included for further consideration in our study. 3.2.2. Quality Assessment. The quality assessment criteria are used to determine the rigorousness and credibility of the used research methods and the relevance of the studies. This assessment is important to limit bias in conducting this SLR, to obtain insight into potential differences, and to support the interpretation of the results. Three main quality assessment criteria have been applied that are based on the assessment criteria introduced in Kitchenham and Charters [2007] and Kitchenham et al. [2009]. We used a checklist-based scoring procedure to evaluate the quality of each selected study and to provide a quantitative comparison between them. The scoring procedure has only three optional answers: “Yes = 1,” “Partly = 0.5,” or “No = 0.” Therefore, for a given study, its quality score is computed by summing up the scores of the answers to the quality assessment questions. —Relevance. Are all the collected studies relevant to the objective of this literature review? In this SLR, the quality assessment is specifically focused on accumulating only those studies that report adequate information to answer the targeted research questions. The quality assessment has been performed according to our inclusion and exclusion criteria by the first two researchers who have independently reviewed each study. In case of contradicting opinions, the third researcher reviewed and resolved the issues together with the two researchers. —Coverage. Is the literature research cover all relevant studies? In this SLR, the two researchers have independently reviewed each study in a number of iterations to ensure that none of the relevant studies are missed. To achieve this, the search is performed on the entire list of relevant studies following with the screening of the titles, abstracts, keywords, and conclusion. Moreover, a snowballing process is conducted to broaden the scope of selected studies. Finally, in-depth analyses have been performed after accessing the full text. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:10 F. Ul Muram et al. Table III. Data Collection Extracted data Relevant RQ Author(s) Study overview Title Study overview Year Study overview, RQ1 Studied domains Study overview Publication types and venues (c.f. Section 4.1) Study overview, RQ1 Active research groups (c.f. Section 4.1) Study overview, RQ1 Types of studied behavioral models/diagrams (c.f. Section 4.2) RQ2.1 Consistency checking types (c.f. Section 4.3) RQ2.2, RQ4 Consistency checking techniques (c.f. Section 4.4) RQ2.3, RQ4 Formalization methods (c.f. Section 4.4) RQ2.3, RQ4 Degree of inconsistency handling (c.f. Section 4.5) RQ2.4, RQ4 Degree of automation support (c.f. Section 4.6) RQ2.5, RQ4 Tool support for consistency checking (c.f. Section 4.6) RQ2.5, RQ3, RQ4 Types of study and evaluation (c.f. Section 4.7) RQ2.6, RQ4 Citations (c.f. Section 4.8) RQ3 Development tool support and integration (c.f. Section 4.8) RQ3, RQ4 Levels of application evidence (c.f. Section 4.8) RQ3, RQ4 —Validation. Do the collected studies contain adequate data and information? In this SLR, it is analyzed whether the primary studies contain the necessary information to answer the targeted research questions. In particular, we devised a number of questions to assess the validation of the relevant studies, such as: Is the technique/ tool clearly defined? How rigorously is the technique evaluated? Does the study add value to academia or industry? 3.2.3. Data Extraction and Data Synthesis. All remaining 96 primary studies were analyzed in depth, and relevant data were extracted from these studies. In the data extraction step, we used spreadsheets to record and correlate the extracted information. In this SLR, we concentrate on extracting the following data items from each study. Table III shows the corresponding data extracted from the set of selected studies. After the data extraction stage was completed, we synthesized the resulting information such that they are suitable and sufficient for answering the review questions. We will explain the rationale of extracting data and analyze the extracted data in Section 4. For further details on the classification and encoding of every aspect related to extract data, please refer to Appendix C. 4. RESULTS In this section, we summarize the main results obtained in the systematic review together with an analysis of the collected data in order to determine the current research trends and identify existing gaps and open problems of the current methods. Table XVI in Appendix A shows the artifacts studied by each approach, type of semantic domain and consistency checking technique supported by existing literature, and the studied domains. For a list of these selected studies with full details, please refer to Table XVII in Appendix B. 4.1. Historical Development (RQ1) This section presents the result about the research, publication trend (i.e., time and venue) and active research groups in the context of behavioral models consistency checking. Figure 2 shows an overview of the distribution of selected studies per year grouped by publication types (e.g., journal, conference, workshop, symposium). We can see that consistency checking of software behavior models has drawn considerable attention and became active at the beginning of the 2000s. The number of studies reaches ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:11 Fig. 2. Primary studies per year grouped by publishing venues. Table IV. Active Research Groups and Numbers of Studies Affiliations Studies Total University of Paderborn, Germany S16, S17, S26, S29, S30, S39, S55 7 Nanjing University, China S20, S31, S32, S47, S73, S86 6 University of Potsdam, Germany S72, S81, S82, S83, S84 5 Universitat M ¨ unchen, Germany S5, S37, S64, S92 4 ¨ Eindhoven University of Technology, The Netherlands S19, S77, S78, S82 4 University of Toronto, Canada S56, S62, S63 3 Lingnan University, Hong Kong S88, S89, S90 3 a peak around 2006–2008. Apart from that, the trend of research in behavioral models consistency checking seems stable over time. We did not set a lower boundary for the year of publication in our search process, yet the timeframe of identified studies reflects also the timeframe of activeness and maturation of behavioral models consistency checking field. With regard to the active research groups within the area of behavioral models consistency checking, we looked at the affiliation details of the selected primary studies. The assignment of contributed studies of each active research group is based on the affiliations that is given in these studies. Table IV presents the active research groups (with at least three publications within behavioral models consistency checking) along with the corresponding number of contributed studies. The results depict that the University of Paderborn, Germany, and Nanjing University, China, are the leading ones in terms of the number of publications. 4.2. Targeted Software Models (RQ2.1) This subsection discusses different types of software models tackled in the current literature regarding behavior model consistency. Table XVI in the Appendix A shows the results of our SLR regarding targeted software models (i.e., the behavioral models being investigated for consistency checking). In Figure 3, we depict the distribution of different behavioral models considered in the selected studies. The UML specification 1.x uses the term “statecharts,” while the UML specification 2.x switches to “state machines” [OMG 2011b, Sec. 15] instead. Please note that UML statecharts (or UML state machines) are based on or derived from Harel’s statecharts, which extends the classic notion of finite state machines with additional support for hierarchy, concurrency, and communication [Harel 1987]. For this reason, we use the term “statecharts” in this study to denote the aforementioned variants of Harel’s statecharts and “state ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:12 F. Ul Muram et al. Fig. 3. Types of studied software behavioral models. machine” to explicitly denote the classic definition of finite state machines. Some other types of behavioral models, such as live sequence charts (LSCs) and Simulink Stateflow are also various extensions of Harel’s statecharts but target different modeling purposes and application domains. Thus, we opted to separate these types of models as shown in Figure 3. We note that “process model” is an umbrella term for BPMN, BPEL, and workflow models that are used for describing the behavior of process-centric information systems (PAIS). The category “Other” indicates a mix of various types of other behavioral models that do not belong to any of the aforementioned categories. We can see that research on behavioral models consistency checking focuses on four major types of behavior models, which are statechart (36.59%), sequence diagrams (21.14%), process model (14.63%), and activity diagrams (9.96%). Less attention (i.e., 1.63%) has been paid on collaboration diagrams, LSCs, stateflow, and labeled transition systems. 4.3. Types of Consistency Checking (RQ2.2) Another aspect to be considered in our study is the types of consistency checking tackled by the selected publications. Static consistency checking techniques examine and analyze the targeted behavioral models and/or their abstracted versions without running systems. Examples are model checking techniques that systematically and exhaustively explore the states of software systems. The advantage of static checking is that it can be performed in early phases of software modeling and development where no executable products are produced yet. However, static checking can be computationally expensive due to the cost of exhaustive analysis of large and complex models. For instance, model checking often suffers from the problem of state explosion [Clarke et al. 2011]. In contrast, dynamic consistency checking aims to reveal inconsistencies while the system is running, either by code instrumentation or monitoring. Dynamic checking can achieve better computational performance as it only examines a small subset of system spaces (i.e., the actual execution traces). However, dynamic checking techniques, on the one hand, require a running system. On the other hand, in contrast to static checking, these techniques can not reveal all potential errors. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:13 Table V. Types of Consistency Checking Types Studies Total Static S1–51, S53–S69, S71–S96 94 Dynamic S65 1 SYM/SIM S48, S50, S52, S70, S17, S96 6 Endogenous S15, S25, S37, S42, S44, S47, S55, S56, S61, S67–S69, S71, S74, S76, S79 16 Horizontal S1–S3, S6–S10, S13–S20, S22–S28, S30, S32–S36, S38, S41, S42, S44–S48, S50, 66 (Exogenous) S54, S56–S60, S62, S63, S65–S68, S70, S73, S77, S78, S84–S88, S90–S96 Vertical S4, S5, S11, S12, S15, S21, S29, S31, S39, S40, S43, S49, S51–S53, S56, S58, S64, 27 (Exogenous) S71, S72, S74, S75, S80–S83, S89 Fig. 4. Primary studies per year grouped by types of consistency checking. Apart from static and dynamic consistency checking, there are studies using symbolic execution or simulation approaches to validate the consistency of models. These approaches often consider a subset of input parameters and verify the response and consistency of the systems. As such, these studies do not exactly belong to either of the aforementioned categories. Therefore, we placed them in an additional category, namely, “symbolic execution/simulation” (SYM/SIM) to refer to these kinds of studies. Table V depicts the distribution of consistency checking types in terms of static, dynamic, and symbolic execution/simulation checking. “Static consistency checking” are addressed prominently in 94 studies (93.07%), whilst “symbolic execution/simulation” is used in 6 studies (5.94%). Only one study (0.99%) covers dynamic consistency checking. Note that the sum of the numbers of studies on consistency checking types exceeds the total number of studies within a specific category, because the same study could address more than one type of consistency. Four studies (4.17%) out of a total of 96 use a combination of static+simulation consistency checking, whereas only one study (1.04%) uses a combination of static+dynamic consistency checking. Figure 4 presents a bubble-plot distributed over two dimensions regarding: year of publication and consistency checking types. The results show that the majority of research attention has been paid to the static consistency checking throughout the years. Only one study has been found on dynamic consistency checking in the year 2009; however, the SYM/SIM consistency checking has received more attention in the timeframe 2010–2013 but is still at a rather low level. Based on the aforementioned statistics, we can see that the research on dynamic and simulation consistency checking is much less than the research on static consistency checking. This is perhaps in accordance with Giannakopoulou and Havelund’s [2001] ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:14 F. Ul Muram et al. Table VI. Support for the Notion of Time in Consistency Checking Level Studies Total 1 S1–S5, S7, S10–S17, S19, S23–S29, S33, S37, S40, S41, S43, S45, S46, S48, S49, S51–S56, S58, S60–S68, S70–S72, S74–S94, S96 72 2 S6, S18, S22, S30, S32, S34, S39, S47, S50, S57, S73 11 3 S8, S9, S20, S21, S31, S35, S36, S38, S42, S44, S59, S69, S95 13 observation that dynamic checking often requires special treatments ranging from extended semantics of temporal logics to monitoring and analysis algorithms. We also examined the existing methods from the perspective of distinguishing endogenous and exogenous consistency checking. As mentioned in Section 2, endogenous consistency concentrates on a single behavioral model, whereas exogenous consistency targets various types of models. In addition, exogenous consistency is further classified into vertical and horizontal consistency. Table V depicts the support of endogenous and exogenous (i.e., vertical and horizontal) consistency types of the selected primary studies. We can see that exogenous consistency (85.32%) is addressed more prominently than endogenous consistency (14.68%). Regarding exogenous consistency, we found that 66 studies (60.55%) focus on horizontal consistencies, while 27 studies (24.77%) investigate vertical consistencies. We found that 7 of the studies (7.29%) out of 96 tackle the combination endogenous+horizontal consistency and 2 of the studies (2.08%) endogenous+vertical. There are two studies (2.08%) using the combination horizontal+vertical, and one study (1.04%) uses endogenous+horizontal+vertical. The year-wise distribution of endogenous and exogenous consistency checking types is shown in Figure 4. We can see that much of research attention has been paid to the endogenous consistency checking from 2007 to 2008. Apart from that, horizontal consistency has received more attention from 2005 to 2009, while comparatively less attention has been given to the vertical consistency checking. As the complexity of the systems increases, the timing requirements become more and more stringent, especially if the system’s reliability is a key concern. It is, therefore, necessary to investigate the studies that support for time-related consistency checking. Considering the existing methods in terms of support for a notion of time during consistency checking, we propose a three-level category of time support in consistency checking, which includes “1 = Not considered,” “2 = Implicit using of underlying timing model or rules,” and “3 = Explicit timing model and analysis.”7 The first level refers to studies that do not consider time during consistency checking. The second and third levels refer to studies that consider implicit usage of an underlying time model and an explicit timing model and analysis (i.e., using explicit timed models or real-time constraints), respectively. Table VI presents the result of our SLR regarding the notion of time in consistency checking. We found that a majority of the existing studies do not consider support for time, that is, 72 studies (75.00%) out of a total of 96. Only 11 studies (11.46%) support the second level “2 = Implicit using of underlying timing model or rules,” whereas, 13 studies (13.54%) explicitly address timed models and real-time constraints. Naturally, these 13 studies stem from the domain of embedded and real-time systems where time critical conditions are often the highest priority. 4.4. Consistency Checking Techniques (RQ2.3) This section discusses the techniques used for checking the consistencies of software behavioral models. In particular, we have explored two major aspects of 7Please refer to Table XVIII in the Appendix C for further details of timing support scales. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:15 Table VII. Semantic Domains Semantic Domain Studies Total State Transitions S2–S6, S8–S14, S16, S18, S20–S25, S27, S28, S30–S34, S36–S39, S43–S48, S50, S51, S53–S57, S60, S63–S66, S68, S69, S71–S74, S76–S79, S82–S84, S86, S87, S90, S91, S93, S94, S95, S97 69 Logic S4, S11, S13, S18, S19, S21–S23, S26, S32, S34, S39, S42, S44, S45, S47, S50, S54, S57, S61, S67, S79, S84, S92–S94 26 Process Algebra S1, S8, S15, S29, S35, S40, S41, S52, S58, S59, S62, S80, S85, S87–S89, S96 17 Others S7, S17, S26, S49, S52, S70, S72, S75, S81 9 consistency checking techniques including the semantic domains (and correspondingly formal paradigms) employed by existing methods for formalizing the input models and constraints as well as the techniques (e.g., model checking, logical inference, theorem proving) that are used for performing consistency checking. The descriptive results of investigating the semantic domain used by each approach are shown in Table XVI of Appendix A. A semantic domain, when appropriately employed, can help reduce the ambiguity in modeling behavior of software systems with precise mathematical terms [Engels et al. 2001]. In addition, grounding on a solid semantic domain enables the use of several model checkers and theorem provers for consistency checking. We adopt the classification of semantic domains into the following formal paradigms as proposed by Habrias and Frappier [2006]. —State transition: The description defines the transition relation on a set of states. —Algebra: The description specifies the set of operations and their relations. An event is denoted by a function (also known as an operation). The behavior of functions is specified by a set of equations (axioms) that describes how these functions are related. A special form of algebra, namely, process algebra, in which operations are applied to elementary processes and events, describes how events may occur. —Logic: The behavior of functions is specified by a set of equations (axioms) that describes how these functions are related [Lucas et al. 2009]. —Other: The formal semantic domains that do not exactly belong to either of the aforementioned definitions are included in this particular category. Table VII presents the studies that are classified according to the aforementioned semantic domains. 57.02% of the studies (i.e., 69 out of 121 studies) are based on state transitions. 17 studies (14.05%) use process algebra as the semantic domain. Twentysix studies (21.49%) leverage different kinds of logics as their semantic domain. The category “Other” shown in Table VII embraces all studies (9, i.e., 7.44%) that use a semantic domain different from the three domains mentioned earlier. Please note that a study may employ more than one semantic domain and, therefore, may be based on multiple domains. 20 studies (20.83%) out of a total of 96 use the combination of state transitions+logics as their semantic domains, while two studies (2.08%) use state transitions+process algebra. There are three studies (i.e., 1.04%) using the combinations state transitions+other, logics+other, and process algebra+other. Considering the consistency checking techniques of the existing studies, we adopted and extended the four main categories proposed by Spanoudakis and Zisman [2001], which are: model checking (i.e., using or combining with existing model checkers), specialized algorithm (i.e., algorithms designed for analyzing models to detect inconsistencies), logical inference (i.e., using formal inference techniques to derive inconsistencies), and theorem proving (i.e., reasoning using theorem provers). Analyzing the selected primary studies we have found that a wide range of studies used specialized algorithm to identify and detect the inconsistencies of software models ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:16 F. Ul Muram et al. Table VIII. Consistency Checking Techniques Technique(s) Studies Total Model Checking S1, S4, S8, S9, S11, S13–S16, S18, S22, S23, S29, S32, S33, S34, S36, S39, S44–S47, S50, S53, S54, S57–S59, S63, S70, S78, S84, S85, S87, S88, S90–S94, S96 41 Specialized Algorithm S2, S4–S7, S11, S12, S17, S20, S21, S24–S28, S30, S31, S35, S37, S38, S40, S41, S43, S44, S48, S49, S51, S52, S55, S60, S62, S64, S65, S66, S68, S69, S71–S77, S79, S80–S83, S86, S89, S95 51 Logic Inference S10, S19, S42, S61, S67 5 Theorem Proving S3, S56, S67 3 (i.e., 51 studies (51.00%) out of 100). The main reason is that these approaches investigate different sets of inconsistency problems on particular domains and, therefore, often propose specific algorithms or heuristics for the particular contexts being studied. About 41 studies (41.00%) use model checking techniques given the existence of several model checkers such as NuSMV,8 SPIN,9 FDR,10 UPPAAL,11 GROOVE,12 and LSTA,13 to name but a few. We note that, even though a reasonable number of studies (21.49%) are based on the logic semantic domain as mentioned earlier, logic inference techniques are used only in five studies for identifying inconsistencies or checking consistency of software models (5.00%). The use of theorem proving techniques is even less frequent, they are used only in three studies (3.00%). These results partially explains the emergence of model checking techniques backed by powerful model checkers [Clarke et al. 1999]. Nonetheless, model checking techniques cannot thoroughly cover every aspect of consistency problems and may suffer from the problem of state explosion [Clarke et al. 2011]. This could explain why a majority of the existing methods need to develop specialized algorithms for particular purposes. The summary of consistency checking techniques is shown in Table VIII. Please note that the sum of the numbers of studies on a technique category exceeds the total number of studies because some studies used more than one technique. Two studies (2.08%) out of a total of 96 use the combination model checking+specialized algorithm to identify the inconsistencies of software models, while one study (1.04%) uses the combination specialized algorithm+theorem proving, and one study (1.04%) uses the combination logic inference+theorem proving techniques. 4.5. Inconsistency Handling (RQ2.4) Detecting behavioral inconsistencies of software systems is important but still far from complete. Handling inconsistencies has been considered a central task in consistency management [van Lamsweerde et al. 1998; Spanoudakis and Zisman 2001]. Handling of inconsistencies addresses how to deal with any inconsistencies and analyzing the impacts and consequences of particular methods of dealing with inconsistencies [Spanoudakis and Zisman 2001]. We derive a scale from 1 to 5 for different degrees of handling with inconsistencies based on the set of activities proposed by Spanoudakis and Zisman [2001]: —1. Not mentioned/not considered —2. Systematic inconsistency diagnosis —3. Identifying handling actions 8See http://nusmv.fbk.eu. 9See http://spinroot.com. 10See https://www.cs.ox.ac.uk/projects/fdr. 11See http://www.uppaal.org. 12See http://groove.cs.utwente.nl. 13See http://www.doc.ic.ac.uk/ltsa. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:17 Table IX. Degree of Inconsistency Handling Level Studies Total 1 S2–S10, S13, S14, S16, S17, S20–S27, S29–S34, S36–S38, S40–S44, S46–S59, S63–S70, S72–S78, S80–S82, S85–S96 78 2 S1, S11, S12, S15, S18, S19, S22, S28, S35, S39, S45, S60–S62, S71, S79, S83, S84 18 3–5 – 0 —4. Evaluating costs and risks —5. Automated action selection and execution In summary, inconsistencies can be handled by different activities including systematically diagnosing inconsistencies and their causes, identifying necessary actions for resolving inconsistencies, evaluating costs and risks, and automatically selecting and execution the corresponding actions. Table IX shows our analysis results for inconsistency handling. We have found that most of the existing approaches provide little or even no support for this important aspect. In particular, a large number of studies (78, i.e., 81.25%) do not consider any sort of inconsistency handling. Only 18 studies (18.75%) support certain forms of inconsistency diagnosis, for instance, analyzing the counterexamples or error traces back to the origins of the problems. Unfortunately, none of the studies supports any higher levels of inconsistency handling (i.e, from 3 to 5). 4.6. Automation Support (RQ2.5) Some prior secondary studies such as Lucas et al. [2009] examined the support for automation in model consistency checking, but they propose only two levels “yes” (automated) and “no” (manual). In fact, there exist some techniques where human intervention is partially necessary for specifying models or consistency rules, and the rest can be automatically performed. Moreover, there are different phases in consistency checking including specification, checking, and possibly handling inconsistencies (as mentioned earlier). Therefore, first of all, we decided to refine the automation support into a three-level scale in order to cover the aforementioned cases: which are “1 = Manual,” “2 = Semi-automated,” and “3 = Fully automated.”14 Moreover, we investigate the automation support in three main phases of consistency checking (i.e., specification, checking, and handling inconsistencies) as analyzed earlier and assign the corresponding level for each primary study. We noticed that the phase consistency checking, due to sound formalization and well-defined checking algorithms and/or techniques, can be performed automatically (i.e., reaching Level 3 “Fully-automated”). Unfortunately, this is not the case for inconsistency handling. As only a few of the existing approaches consider a limited form of inconsistency handling, which is the diagnosis of the checking results, the level of automation supported by these approaches is Level 2 (i.e., “Semi-automated”) because human intervention is necessary for investigating the yielded errors. For these reasons, we do not present details for the two phases consistency checking and inconsistency handing, but rather focus only on the specification phase. We present the distribution of the existing approaches in the literature with respect to the aforementioned scale of automation support in the specification phase in Table X. A considerable number of studies (31.25%) falls into Level 1 (i.e., “Manual”) because they either assume the existence of formal logic constraints or require manual efforts in specifying the input consistency constraints (e.g., rules specified in LTL/CTL) or, sometimes, the input models. We note that these manual tasks ask for considerable knowledge of the underlying formalisms and formal techniques and, therefore, are 14Please refer to Table XX in the Appendix C for further details of automation support levels. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:18 F. Ul Muram et al. Table X. Automation Support for Specification Phase Automation Level Studies Total 1 S3, S5, S10, S11, S12, S18, S19, S21, S22, S23, S26, S29, S35, S43, S44, S46, S47, S48, S50, S54, S57, S59, S65–S67, S70, S77, S79, S93, S94 30 2 S1, S2, S4, S6, S7, S9, S13–S17, S24, S25, S27, S28, S30–S34, S36, S38, S39–S42, S45, S49, S51–S53, S55, S56, S58, S60–S64, S68, S69, S71–S76, S78, S80–S92, S95, S96 62 3 S8, S20, S37, S74 4 Table XI. Evidence of Tool Support for Consistency Checking Level Studies Total 1 S2, S6, S11, S17, S24, S27, S38, S42, S64, S66, S68, S71, S73, S75, S81, S83 16 2 S3, S4, S9, S13, S15, S22, S23, S29, S31, S33, S40, S41, S43, S47, S50, S55–S59, S69, S70, S74, S77, S78, S80, S85, S87–S91, S93, S95 34 3 S1, S5, S7, S8, S10, S12, S14, S16, S18–S21, S25, S26, S28, S30, S32, S34–S37, S39, S44–S46, S48, S49, S51–S54, S60–S63, S65, S67, S72, S76, S79, S82, S84, S86, S92, S94, S96 46 often not easy to use for many software engineers. A larger number of studies (64.58%) assume the existence of the input models but propose automated transformations of these inputs into formal representations. For this reason, we consider them “semiautomated” because strictly speaking the input models assumed by these approaches are often manually created. Nonetheless, these are design models and created during the modeling and development of software systems, anyway, and there is automated translation support. Only four studies (4.17%) can be considered “fully-automated,” as they encode consistency rules in the corresponding algorithms or libraries and, therefore, do not require any manual effort in the specification phase. However, the downside could be that the hard-coded consistency rules could lessen the extensibility and generality and, therefore, the scope of application of these approaches. 4.6.1. Tool Support for Consistency Checking (RQ2.5, RQ3). Another important dimension is to consider tool support provided by the existing approaches for consistency checking. We believe that more or better tool support would showcase a proof of feasibility and attract the attention of practitioners to the corresponding techniques. The lack of any sort of tool support (e.g., research prototypes, demos) will hinder practitioners to use or at least explore the corresponding proposed methods and techniques and, therefore, hinder the transfer of research results into industrial practice. We propose three levels of evaluation of tool support as shown in Table XXII: “1 = Not mentioned/Not considered,” “2 = Only using existing tools/libraries,” and “3 = Prototypes (includes using existing tools/libraries).”15 The distribution of levels of tool support found in the literature on software model consistency checking is shown in Table XI. We did not find evidence of tool support for detecting and handling inconsistencies in 16 studies (16.67%). Thirty-four studies (35.42%) used existing tools and libraries to carry out consistency checking, for instance, model checkers such as SPIN, (Nu)SMV, UPPAAL, FDR, GROOVE, LSTA, and Maude or theorem provers such as SPASS and Z/Eves. Out of 96 there are 46 studies (47.92%) that provide specific tool support for checking consistency between software models including the development of prototypes or tool-chains that combine the implementation of various aspects such as model transformations, consistency checking algorithms, and/or existing tools and libraries. Figure 5 shows the distribution of primary studies per year grouped by levels of tool support. We can see that 15Please refer to Table XXII in the Appendix C for a full description of tool support levels. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:19 Fig. 5. Levels of tool support per year distribution. Table XII. Types of Study and Evaluation Types Studies Total DC+EX S7, S16, S47, S50, S57 5 RA+DC S2, S13, S30, S31, S52 5 RA+DC+EX S8, S9, S15, S18–S20, S24, S25, S28, S32, S36–39, S41, S45, S46, S48, S49, S53, S54, S58, S59, S61–S66, S74, S76, S79, S82, S87, S90 35 RA+DC+EX+LS S12, S22, S26, S35, S55, S72, S81, S83 8 RA+EX S1, S3, S4, S6, S10, S11, S14, S17, S21, S23, S27, S29, S33, S34, S40, S42–S44, S51, S56, S60, S67, S68, S71, S73, S75, S77, S78, S80, S84–S86, S88, S89, S91, S93–S96 39 RA+EX+ER S5, S69, S70, S92 4 the trend of tool support for checking consistency by using existing tools and libraries (i.e., Level 2) has received more attention from 2005 to 2008, while comparatively the developing prototypes or tool-chains (i.e., Level 3) reached a peak in 2014. 4.7. Types of Study and Evaluation (RQ2.6) As a part of this SLR, we investigated the nature of evidence presented in the selected primary studies and analyzed how an evaluation is performed and reported. The main motivation is that different types of studies provide different strengths of evidence and evaluation. Practitioners could take the strength of evidence and evaluation into consideration before adopting a specific methodology and tool. We adopted the classification of types of study and evaluation in six categories as proposed by Chen and Ali Babar [2011]. The six types of study and evaluation are Rigorous Analysis (RA), Case Study (CS), Discussion (DC), Example (EX), Experience Report (ER), Field Study (FS), Laboratory Experiment with Human Subjects (LH), Laboratory Experiment with Software Subjects (LS), and Simulation (SI).16 Analyzing the data extracted from the selected studies, we see that a study may be a combination of different types. For instance, S8 presents and discusses formal foundations, consistency problems and gives concrete examples in BPEL. Thus, we clustered the data and divided them into appropriate groups based on the combination employed by each study. Table XII shows the results of data analysis and clustering. We observed that a great amount of the studies (36.46%) use the combinations RA+DC+EX or RA+EX (40.63%). There are eight studies (8.33%) using the combination RA+DC+EX+LS that, in addition to what is done in RA+DC+EX, also perform some sorts of evaluation with software models, for instance, rigorously estimating the 16Please refer to Table XXIV in the Appendix C for a full description of these categories. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:20 F. Ul Muram et al. Fig. 6. Sum of rigor scores per dimension. scalability, correctness, algorithm precision, or comparing with other approaches. Only a few studies (4.17%) report RA+EX+ER, experiences in some academic settings or application scenarios derived from industry practice. Unfortunately, none of the existing studies presents clear evidence using any sort of empirical study or validation. Along with the aforementioned investigation on types of study and evaluation, we also assessed the level of rigor of the reported evaluations. Rigor is concerned with assessing how an evaluation is performed and reported [Galster et al. 2014]. In particular, we examined how well the selected studies report their evaluations with respect to three dimensions: the context of the evaluations (C), the design of the studies conducted in the evaluation (S), and the validity discussion (V). We leverage the score for each rigor dimension as proposed in Ivarsson and Gorschek [2011] that comprises three levels “1 = Weak,” “2 = Medium,” and “3 = Strong.”17 Figure 6 depicts an overview of the results regarding the aforementioned dimensions of the evaluation’s rigor. We note that the rigor measure proposed above could be more appropriate for empirical studies. Applying the rigor measure uncovered that many studies mentioned dimensions related to rigor but do not describe these fully. Nonetheless, we consider this measure in our study in order to show the lack of rigorous evaluations and provide some insights that researchers can consider when designing and conducting their studies to achieve a reasonable degree of quality. The results shown in Figure 6 indicate that very few studies have been scored the strongest level in each dimension. The evidence regarding describing the study design and the validity discussions is very poor. It also reflects our analysis on the type of study and evaluation where only a small number of studies perform certain rigorous experiments with software models. To illustrate the distribution of rigor measures, we considered a collective rigor metrics for each study, namely, R, which is calculated as R = C + S + V, following the suggestions of Ivarsson and Gorschek [2011] and Galster et al. [2014]. Thus, the value of R for each study is an integer in the range of 3–9. We show the distribution of the collective rigor metrics R in Figure 7. In accordance with our prior observation, the collective scores of the rigor metrics of the selected studies are distributed significantly towards the lower side of 3 (40 studies), 4 (32 studies), and 5 (19 studies), respectively. None of these studies could achieve the collective score of 7 or 8. The only study that reaches the best collective score is S26, in which the authors present rigorously and thoroughly all three dimensions of evaluation 17Please refer to Table XXV in the Appendix C for a full description of the rigor levels. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:21 Fig. 7. Distribution of the collective rigor metrics R. Fig. 8. Trend of rigor dimension per year. rigor. Figure 8 presents the temporal distribution of the rigor metrics for each study. We can see that the trend of rigor evaluations of studies increased in the last 10 years, but overall the rigor is still rather low. 4.8. Practical Impact (RQ3) Analyzing the practical impact, or in other words, the effectiveness, of the existing methods for consistency checking of software behavioral models is a very challenging task because there is no consensus on an ultimate measurement or metrics for this aspect, to the best of our knowledge. After studying other relevant surveys in the field of CS and SE, such as Alves et al. [2010] and Galster et al. [2014], to name but a few, we propose to examine the practical impact of the existing methods from different dimensions, including citations (how a study influences the others), CASE/IDE tool integration (support in daily working environment of practitioners), tool support (feasibility study: mentioned in Section 4.6.1), degree of evidence, and the applicability of the proposed techniques in industrial settings. 4.8.1. Citations. Table XIII presents an overview of the top 10 most highly cited studies along with the year of publishing, the total number of citations, and the average number of citations per year. The numbers of citations are obtained from Google Scholar. We ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:22 F. Ul Muram et al. Table XIII. Top 10 Most Cited Studies SID Author(s) Year Total Citations Avg. Citations per Year S6 Bernardi et al. 2002 322 25 S63 Schafer et al. 2001 247 18 ¨ S55 Nejati et al. 2007 243 30 S27 Harel et al. 2002 196 15 S36 Knapp et al. 2002 188 14 S15 Engels et al. 2001 169 12 S76 Van der Aalst et al. 2008 152 22 S18 Eshuis & Wieringa 2004 137 12 S28 Hausmann et al. 2002 127 10 S83 Weidlich et al. 2011 124 31 Table XIV. CASE/IDE Support and/or Integration Level Studies Total 1 S2–S4, S6, S7, S9–S15, S17–S23, S26, S27, S29–S33, S35–S43, S45–S47, S49–S52, S55–S59, S61–S68, S71–S78, S80–S96 80 2 S5, S24, S28, S34, S48, S53, S54, S60, S69, S70, S79 11 3 S1, S8, S16, S25, S44 5 note that these numbers are roughly estimated because Google Scholar may mistakenly count the citations of different authors who have similar or the same names. Also please note that these numbers can change over time. We sampled two times on Aug 9, 2015 and Aug 19, 2015 and noticed slight differences. The presented numbers are obtained as of date Aug 19, 2015. Nonetheless, the average of citations per year is quite stable as the change margins are small. Because each of the studies in the top 10 is highly and frequently cited, we can draw the conclusion that at least the studies in the top 10 have substantial influence on other researchers. Unfortunately, we could not find sufficient evidence to conclude on the correlation between the citations and other aspects such as the used semantic domains, methods, techniques, or application domains. 4.8.2. Development Tool Support and Integration. The second dimension of the practical impact that should be considered is the integration of the proposed techniques into the working environments of practitioners (such as software analysts, software architects, software developers), tht is, development tool support. Here, specifically, CASE18 tools and IDEs19) are relevant. CASE/IDE integration could foster the practical use of the consistency checking techniques. It could help users of the development tools to obtain feedbacks and support them in handling inconsistencies detected in the models. Researchers could benefit by receiving improvement suggestions from the practitioners who are using the corresponding CASE/IDE tools in industrial settings. We propose a three-level scale to assess the evidence of IDEs/CASE integration that comprises “1 = Not mentioned/Not considered,” “2 = Proposed/planned integration,” and “3 = Fully implemented integration.”20 Table XIV shows the analysis result of CASE/IDE integration found in the selected studies. The weakest scale is applied for studies that do not consider or mention at all about the integration with any CASE/IDE. There are 80 studies (83.33%) that fall into this level. On the other end of the spectrum, there are only five studies (5.21%) that develop fully working integration with existing CASE/IDE tools (i.e., reaching the strongest level). The rest of the studies (11.46%) (i.e., 18CASE: Computer-Aided/Assisted Software Engineering. 19IDE: Integrated Development Environment. 20Please refer to Table XXI in the Appendix C for a full description of CASE/IDE support/integration levels. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:23 Table XV. Levels of Empirical Evidence Level Studies Total 1 – 0 2 S2–S11, S13–S19, S21, S23–S31, S34, S36, S38, S40–S47, S49, S51, S53, S54, S56–S58, S60, S62–S64, S66–S68, S71, S73–S80, S84–S96 73 3 S1, S12, S22, S32, S33, S39, S44, S48, S50, S59, S61, S72 12 4 – 0 5 S20, S35, S37, S52, S55, S65, S69, S70, S81, S82, S83 11 6 – 0 11 out of total 96 studies) mainly propose plans for CASE/IDE integration or partially implement the integration. 4.8.3. Application in Industrial Settings. For many research works in the field of software engineering, it is significant to apply and evaluate newly proposed or improved techniques in a real industrial settings. As Ivarsson and Gorschek [2011] emphasize, research methods and techniques need to provide tangible evidence of the advantages of using them in order to impact industry. Ivarsson and Gorschek [2011] also suggest a step-wise validation to enable researchers to test and evaluate in real settings with real users and applications (i.e., empirical evidence). Hence, in our study, we examined the evidence of any industrial settings presented in the chosen primary studies. The types of evidence could be critical for researchers to identify new topics for empirical studies, and for practitioners to assess the maturity of a particular method or tool. Some of the approaches use (small-scale) industrial scenarios to illustrate the applicability of their techniques, while others evaluate their approaches by using small examples or cases (i.e., providing the readers with a rough idea of how to apply and use the proposed approaches). We adopted the six-level scale of evidence as proposed by Alves et al. [2010]21: —1. No evidence provided. —2. Evidence obtained from demonstration or working out toy examples. —3. Evidence obtained from expert opinions or observations. —4. Evidence obtained from academic studies. —5. Evidence obtained from industrial studies. —6. Evidence obtained from industrial practice. The data of our investigation and analysis shows that 73 out of the total 96 studies (76.04%) are tested and evaluated using toy examples. Twelve studies (12.50%) use some scenarios obtained from expert opinion or observation. Only 11 studies (11.46%) show a higher level of evidence based on industrial studies. None of the studies support the Levels 4 and 6. Table XV shows the distribution of empirical evidence found in the literature on software model consistency checking. We note that there are no studies of Level 1 either. After closer analysis, we found that this is mainly due to the fact that studies falling into this level have been excluded because they are not presented at an acceptable quality level (c.f. Section 3.2.2). We map roughly the six corresponding levels of evidence mentioned above into two levels that indicate whether the proposed methods/techniques are tried out in any industrial settings. Levels 4–6 of evidence can be considered being applied or tested in industrial settings, while Levels 1–3 are not. We summarized the data according to this two-level scale, shown in Figure 9. Only 11 studies have validated or evaluated their proposed techniques and tools in any sort of industrial settings. The majority of studies (85, i.e., 88.54%) did not (but rather using other means such as rigorous analysis, discussion, and toy examples). We also observed that only one study (S70) out of 11 studies 21Please refer to Table XXIII in the Appendix C for a full description of evidence levels. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:24 F. Ul Muram et al. Fig. 9. Studies tried in industrial settings. (11.46%) was evaluated in industrial settings in the year 2001, while all the others were evaluated between 2007 and 2015. That is, while the level industrial evidence seems rather weak in the total set of studies, it has clearly increased in recent years. 5. DISCUSSIONS In this section, we will discuss the limitations of existing methods to answer the remaining Research Question RQ4 (c.f. Section 3.1.2) and discuss the validity of our study. 5.1. Limitations of the Existing Methods This SLR is an attempt to give a comprehensive view of the state of the art of research in software behavioral model consistency checking. Based on our analysis and interpretation presented in this article, we observe a number of limitations and open problems in the existing methods that could potentially point into interesting further research directions. 5.1.1. Unbalanced Focus of Consistency Checking Problems. There is a predominance of primary studies concentrating on static consistency checking and little attention has been paid on dynamic checking or consistency checking by means of symbolic execution or simulation. The huge advantage of static checking is to identify many inconsistencies without the need of and/or connection to any running systems. However, static checking also has considerable limitations like suffering from the state explosion problem [Clarke et al. 2011] or being computationally expensive. A large number of static checking–based approaches use model checkers, which only support finite and discrete data types (e.g., Boolean, integer, vector) and provide limited or even no support for continuous data types (e.g., floating-point numbers) or strings. Many distributed software systems are built upon highly flexibly and dynamic architectures like event-driven architectures [Muhl et al. 2006]. Besides, there is also an ¨ emerging trend in software development on supporting higher degrees of flexibility [van der Aalst 2013; Goedertier et al. 2015] or on-the-fly adaptations [Mac´ ıas-Escriva et al. ´ 2013] at runtime. Static consistency checking of these types of flexible systems might not be solely enough as several unanticipated changes can happen during system execution. Symbolic execution or simulation can help to partially examine these dynamic changes, but ultimately, dynamic checking, through either system instrumentation or monitoring, should be employed for catching any potential inconsistencies. However, runtime verification is much more challenging [Giannakopoulou and Havelund 2001] than static checking and, therefore, needs to draw more attention from both researchers and practitioners. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:25 Similar observations can be made (in a less severe manner) for the case of support for time-related consistency checking. A rather small portion of the existing methods investigates time-related inconsistencies, especially in the context of real-time constraints. With respect to the niche fragment of the domain of embedded and real-time systems versus the remaining application domains, we do not see this as critical nuisance. In summary, while the consistency topic overall is mature, much more work needs to be done in areas, such as dynamic checking, simulation, and timing consistency checking of software behavioral models. 5.1.2. Focus on Specific Model Types. Besides consistency checking problems, we identified, research on behavioral models consistency checking focuses on four major types of behavior models, which are statecharts, sequence diagrams, process models, and activity diagrams. Considerably less attention has been paid on collaboration diagrams, LSCs, stateflow, and labeled transition systems. The semantics of stateflow and labeled transition systems are quite complex, which makes these models a weak option for modeling the system behavior. Hence, more work is required on all these models. 5.1.3. Lack of Consideration for Inconsistency Handling. Spanoudakis and Zisman [2001] emphasized the vital role of inconsistency handling and suggested many aspects to consider for adequately handling consistency issues. Improper or inadequate inconsistency handling could lead to severe negative effects [Spanoudakis and Zisman 2001] and could be one of the major obstacles to making significant practical impact or to transferring the proposed methods and techniques to industry practice. Unfortunately, we found limited evidence of inconsistency handling on diagnosing inconsistencies. Most of the studies produce consistency checking outcomes in terms of formal representations such as counterexamples [Clarke et al. 1999] or erroneous traces of states. However, this assumption should be reconsidered because most of the practitioners (e.g., software engineers) do not have sufficient knowledge of the underlying formal methods to understand these outcomes. As a result, a considerable amount of time and effort is consumed by the practitioners in order to handle the detected inconsistencies. It would be more pragmatic to present the consistency checking outcomes in appropriate forms of representation (which can be textual or graphical notations or visualization) with suitable abstractions that practitioners can better comprehend and use to resolve the problems detected in the models. Further, important activities such as identifying corresponding handling actions, estimating the costs and risks of handling inconsistencies, and eventually choosing and executing handling actions need to be taken into account. 5.1.4. Lack of Rigorous Studies, Evaluation, and Practical Impacts. Lack of rigorous studies, evaluation, and practical impacts reflect the quality of studies being conducted in the literature of software behavioral model consistency checking. As we observed in Section 4.7, several primary studies tend to focus on introducing new ideas and solutions to consistency checking problems but fail to evaluate their contributions properly and show the validity of their approaches in larger contexts. This issue is mainly due to less consideration of rigorous study design and validity discussion. Moreover, our findings in Section 4.7 also show that many of the research results have not been tried out in any sort of industrial settings. As a consequence, these results remain pure academic contributions and exercises. This could be explained as academic results tend to be context-specific, and thus, difficult to generalize to many industrial situations Alves et al. [2010]. While the average rigor of evaluations has increased in the last 10 years, it is still at a very low level. The strongest case found in this SLR is a study which has been thoroughly and rigorously designed with test cases based on industry practice. We expect that the categories of types of study and evaluation presented in Table XXIV and ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:26 F. Ul Muram et al. discussed in Section 4.7 could be considered for design and evaluation of future studies to achieve more scientific and repeatable results that have greater practical impact. Another aspect that is crucial for gaining more practical impact is to provide adequate tool support for consistency checking and/or integration with existing development tools (such as CASE/IDE tools). It is noted that automation is usually achieved by integrating the techniques with existing software development tools. The benefit of tool support for consistency checking and development tool integration, which has been underlined earlier, is twofold. It provides intuitive means for fostering the application of the implemented techniques in industrial settings, and it allows for obtaining valuable feedbacks from practitioners who are able to use the tools for their daily tasks. Unfortunately, only a small number of research results from the existing studies have been incorporated in commercially used development tools. The rest is either focused on developing research prototypes or using existing tools and libraries (e.g., model checkers, theorem provers). 5.2. Study Validity The main threats to validity in this systematic review are potential bias in our search strategy, the selection of the studies to be included, and data extraction. We aimed for an exhaustive list of high-quality primary studies. Therefore, we followed strictly the guidelines recommended in Kitchenham and Charters [2007] and took into account lessons learned in Brereton et al. [2007]. We have developed a clear protocol for searching and choosing primary studies (c.f. Section 3) including defining research questions, inclusion and exclusion criteria, and search strategy. The first author prepared the protocol, and the second and third authors reviewed and assessed whether the review plan was appropriate for addressing the research questions. We also obtained critical comments from the journal reviewers on earlier versions of the article that lead to revisions of our research questions and replications of our study. Despite a well-defined systematic procedure, we acknowledge that there may be missing primary studies, for instance, due to the coverage and quality of the search engines and search portals that we used in our search. It is crucial with respect to the extent where relevant keywords are not standardized or clearly defined [Brereton et al. 2007]. That is, various studies may use the same terms with different semantics. To deal with this kind of threat, we have performed a number of activities. On the one hand, we considered our attempt reported in the earlier version of the article as a pilot study. Then we synthesized relevant studies in the literature and, with help from internal and external reviewers, came up with a set of clearly defined concepts and categories as shown in Section 2. We used this set as a basis for guiding the replication of our study and for analyzing the selected primary studies. Second, we decided to carry out a “snowballing” process [Budgen et al. 2011] in which two researchers manually scanned and analyzed the references and citations of the primary studies retrieved from the automated search with the search engines/portals. The main goal is to make sure that our study also covers follow-up works that might exist but have not been included in the search. As we presented in Section 3.2.1, the snowballing process has caught 72 missing studies, out of which 26 studies have been selected. Bias in study selection may be due to different interpretation and understanding of the researchers involved in the study and potential misalignment between our term and category definitions and other definitions. To alleviate these problems, two researchers performed an in-depth analysis and selection of primary studies with respect to the predefined inclusion and exclusion criteria and clearly recorded the reasons for including or excluding. The third researcher reviewed independently and provided corresponding judgments, especially for the cases of discrepancy or “in doubt.” Then all researchers discussed and came to the final conclusion for each study. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:27 During the data extraction phase, we confronted some difficulties, and thus, potential bias, in extracting objective information from the selected primary studies. First of all, this issue could be due to the subjective interpretation and/or poorly described report of relevant terms, methods, techniques, evaluations, and so on. Another difficulty is that different studies could choose different research and evaluation methods, as mentioned in Section 4.7, and, therefore, can compromise the accuracy of data extraction. We tried to minimize this bias by, first of all, defining the extraction strategy and date format with clear encoding descriptions in order to ensure consistent extraction of relevant data. Furthermore, the extraction has been performed independently by two researchers. Any sort of discrepancy from all involved researchers has been adequately recorded. The third researcher then performed the final reviewing and cross-checking. After that, we conducted discussion meetings to resolve any remaining divergence and disagreements. Nonetheless, we acknowledge that there is still a certain possibility of misunderstanding regarding the way we extracted data from the primary studies. Considering the reliability, which concerns the ability of repeating a study with the same results, we have compared the results obtained in the earlier version and in the replication of this SLR. Apart from additional studies due to newly added keywords and the extended timeframe (shifted from 2013 to July 2015), we observed no significant differences between the two results. Given the well-defined protocol, data encoding, extraction, and analysis, we could expect that replications of our study should offer similar results. Precisely speaking, there might be slight variations of the search strategy and/or the set of selected primary studies but the underlying trends of our findings should remain unchanged. Regarding external validity concerned with generalizing our study’s findings, we aimed for a representative set of primary studies with clear research scope and objectives. We set no constraints on the time period such that our SLR can cover from the beginning of research on consistency checking in the eighties to the year 2015. Nonetheless, our findings might not be the same when being generalized to broader scopes or time periods. This could be due to the fact that our current findings are mainly based on qualitative analysis. We could consider rigorous quantitative analysis and inferences to enable the possibility of analytical and statistical generalizations, but this is rather beyond the scope of this SLR. 6. CONCLUSIONS This article presents a systematic review of the literature on research in software behavioral consistency checking. We aimed to identify the current trend in this field and investigate various dimensions ranging from used methods, languages, techniques to the practical impacts of the identify studies. To achieve this, we have identified a total of 1,770 studies by combining automated searches and manual snowballing, out of which 96 have been studied in-depth according to our predefined SLR protocol. Through in-depth analysis and interpretation of the collected data, we obtain many interesting findings along with a number of gaps and open problems that could provide insights for further investigation. In summary, there are promising accomplishments thanks to sound formal foundations that have been employed in most of the existing studies for consistency checking of software behavioral models. Nevertheless, future studies should pay sufficient consideration for improving the quality of study design and conducting evaluations to achieve solid and repeatable scientific results that have greater impact in both academia and industry. Moreover, the application of research outcomes in industrial settings should also be considered and planned adequately in earlier stages. For this purpose, appropriate strategy for inconsistency handling, better tool support for consistency checking and/or development tool integration should be considered in future studies. Our proposed categories and encoding in this SLR, ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:28 F. Ul Muram et al. which is adapted from other relevant studies, can also be considered as a guideline or recommendation for practitioners in evaluating relevant methods and techniques as well as for researchers in designing rigorous studies and evaluations that aim for higher practical impact. APPENDIXES A. OVERVIEW OF SELECTED PRIMARY STUDIES Table XVI. Overview of 96 Selected Primary Studies SID Studied Artifacts Semantic Domains Checking Techniques Studied Domains S1 Statechart, Activity Diagram (fUML) Process Algebra (CSP) Model Checking fUML S2 Sequence Diagram, Statechart State Transition (Transition Matrix) Specialized Algorithm (Transition Set/Matrix, Super State Analysis) UML S3 Statechart State Transition (Z) Theorem Proving UML S4 Disjunctive Modal Transition Systems (DMTS) Temporal Logic (LTL), State Transition Model Checking+ Specialized Algorithm (synthesis) General S5 Modal Transition System State Transition, Quantified Boolean formulae (QBF) Specialized Algorithm General S6 Statechart, Sequence Diagram Petri Net (Generalized Stochastic PNs) Specialized Algorithm (Analysable PNs) UML S7 Process Model Guard-Action-Trigger (based on EventCondition-Action) Specialized Algorithm (Guard-Action-Trigger) BPM+SOA S8 Process Model (BPEL) Process Algebra (D-LOTOS), Durational Action Timed Automata (DATA) Model Checking BPM+SOA S9 Message Sequence Chart (MSC) State Transition (Timed Message-Passing Automata) Model Checking UML S10 Object Behavior Logic Models State Transition (OBL) Logical Inference Object-Oriented Design S11 Modal Transition Systems Temporal Logic (CTL), State Transition Model Checking, Specialized Algorithm (Repair) General S12 SCR Behavioral Requirements, Control-flow–based PDL State Transition, Graph (DFG) Specialized Algorithm (Dataflow analysis) SCR/PDL S13 SDL, Message Sequence Chart (MSC) Temporal Logic (LTL), State Transition (Finite State Automata) Model Checking (SPIN) Distributed Systems S14 Service Behavioral Interface (STS) State Transition (Symbolic Transition System STS) Model Checking (Maude) SOC/SOA S15 Statechart, Sequence Diagram (UML-RT) Process Algebra (CSP) Model Checking Embedded/Real-time S16 Activity Diagram, Visual Contract State Transition, Graph Model Checking BPM+SOA S17 Activity Diagram Typed Graph Specialized Algorithm (Graph Transformation) UML S18 Activity Diagram Temporal Logic, State Transition Model Checking UML S19 Workflow Ontology, Description Logic (DL) Ontology Logic Reasoning Workflow S20 Activity Diagram (UML-MARTE) State Transition (Time Transition System TTS) Specialized Algorithm Real-time S21 Sequence Diagram, Statechart (UML/SPT) Temporal Logic (CTL), Timed Automata Specialized Algorithm (Schedulability Analysis) Embedded/Real-time S22 Statechart Temporal Logic (CTL), State Transition Model Checking (AGAVE) UML (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:29 Table XVI. Continued SID Studied Artifacts Semantic Domains Checking Techniques Studied Domains S23 Statechart Temporal Logic (ACTL), Automata (Hierarchical Automata) Model Checking (JACK) UML S24 Statechart, Sequence Diagram State Transition (State Machine) Specialized Algorithm UML S25 Sequence Diagram, Use Case Modal Sequence Diagram (based on Harel’s LSC) Specialized Algorithm (Playout + Synthesis) UML S26 Process Model (BPMN) Description Logic, GRL (i*+NFR) Specialized Algorithm BPM S27 Live Sequence Chart (LSC) Global System Automaton Specialized Algorithm UML S28 Use Case, Activity Diagram, Collaboration Diagram State Transition (Typed Graph) Specialized Algorithm (Graph Transformation) UML S29 Statechart Process Algebra (CSP) Model Checking (FDR) UML S30 Sequence Diagram, Interface Automaton State Transition (Interface Automata) Specialized Algorithm (Existential Consistency) Embedded/Real-time S31 Sequence Diagram w. Timing Constraints State Transition (Real-time Interface Automata RIA) Specialized Algorithm Embedded/Real-time S32 Statechart, Sequence Diagram Temporal Logic (LTL), State Transition Model Checking Software Architecture S33 Statechart, LTS State Transition (Labeled Transition System) Model Checking UML S34 Sequence Diagram, Activity Diagram Temporal Logic of Actions (cTLA) Model Checking UML S35 Statechart (Timed Resource-Oriented Statechart TRoS) Process Algebra (ACSR, for Real-time) Specialized Algorithm Embedded/Real-time S36 Statechart, Sequence Diagram Timed Automata Model Checking Real-time S37 Sequence Diagram State Transition (csTS, Modal TS) Specialized Algorithm (synthesis) UML S38 Statechart, Sequence Diagram (UML-RT) State Transition Specialized Algorithm Embedded/Real-time S39 Process Model (BPMN) Temporal Logic, Petri Nets Model Checking (GROOVE) BPMN S40 Statechart Process Algebra (Pi-Calculus) Specialized Algorithm (Weak Bisimulation) UML S41 Statechart, Sequence Diagram Process Algebra (Pi-Calculus) Specialized Algorithm (Open Bisimulation) UML S42 Sequence, Collaboration Diagram Real-time Action Logic (RAL) Logical Inference Real-time S43 Modal Transition System State Transition Specialized Algorithm General S44 Label Transition System, I/O FSM State Transition, Modal Logic Model Checking + Specialized Algorithm (Bit Vectors) UML S45 Process Model, Domain Ontology Kripke Structure, Description Logic (DL) Model Checking BPM S46 Statechart State Transition (Automata) Model Checking UML S47 Sequence Diagram, Statechart Temporal Logic, State Transition (SMV Language) Model Checking BPM+SOA S48 Statechart State Transition, Operational Semantics Specialized Algorithm (Assertion Checking) UML S49 Sequence Diagram Trace Semantics Specialized Algorithm UML S50 Statechart Temporal Logic, State Transition (Kripke Structure) Model Checking Embedded/Real-time S51 Abstract, Executable Process Model (BPEL) Petri Net, Communication Graph Specialized Algorithm (Simulation) BPM S52 Stateflow (Simulink) Process Algebra (Circus:Z, CSP, Guarded Commands) Specialized Algorithm (Simulation) Simulink S53 Activity Diagram State Transition Model Checking (NuSMV) UML S54 Process Model Temporal Logic (CTL), State Transition (LTS) Model Checking BPM (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:30 F. Ul Muram et al. Table XVI. Continued SID Studied Artifacts Semantic Domains Checking Techniques Studied Domains S55 Statechart State Transition Specialized Algorithm (Matching+Merging, Bisimulation) UML S56 Statechart State Transition (B) Theorem Proving UML S57 Statechart Temporal Logic, State Transition (PROMELA) Model Checking Adaptive System S58 Statechart, Class Diagram Process Algebra (CSP) Model Checking (FDR) UML S59 Statechart Process Algebra (CSP) Model Checking UML S60 Activity Diagram, Statechart State Transition Specialized Algorithm (Rules) BPM S61 Statechart First Order Logic Logical Inference (CrocoPat) UML S62 Modal Transition System Process Algebra (Modal Pi-Calculus) Specialized Algorithm Modal Transition Systems S63 Statechart, Sequence Diagram State Transition Model Checking UML S64 Object Behavior Diagram Petri Net (Object Behavior Diagram) Specialized Algorithm BPM S65 Statechart State Transition (Symbolic Transition System STS) Specialized Algorithm UML S66 Statechart (Invariant Statechart) State Transition Specialized Algorithm (State Invariant) UML S67 Sequence Diagram, Statechart First Order Logic Logical Inference, Theorem Proving UML S68 Use Case, Activity Diagram, Statechart, Sequence Diagram Petri Net (Colored PN) Specialized Algorithm (PN consistency) UML S69 Statechart State Transition (Timed Automata) Specialized Algorithm (Reachability Analysis+Constraint Solving) Real-time S70 Stateflow (Simulink) Graph (Binary Decision Diagram BDD) Model Checking (Salsa) Simulink S71 Process Model Petri Net Specialized Algorithm BPM S72 Process Model (BPEL) Petri Nets, BPEL Program Dependence Graph (BPD) Specialized Algorithm BPM S73 Statechart Petri Net (Object Behavior Diagram) Specialized Algorithm (Observation Consistency) UML S74 Activity Diagram Petri Net (Instantialble PN) Specialized Algorithm UML S75 Use Case, Statechart Graph Specialized Algorithm (Graph Search) UML S76 Process Model (BPEL), Service Behavior (Message) Petri Net Specialized Algorithm (Conformance Checking) SOC/SOA S77 Use Case, Sequence Diagram, Statechart State Transition (Petri Net) Specialized Algorithm (Systhesis, Invariant) UML S78 Statechart, Sequence Diagram State Transition, Traces Model Checking UML S79 Workflow Formal Workflow, First Order Logic Specialized Algorithm (Rules) Workflow S80 Process Model (BPEL 2.0) Process Algebra (Pi-Calculus) Specialized Algorithm (Bisimulation) BPM+SOA S81 Process Model Graph Specialized Algorithm (Matching, Edit Distance) BPM S82 Process Model Petri Net Specialized Algorithm (Causal Behavioral Profiles) BPM S83 Process Model (BPMN,BPEL,EPC) Petri Net Specialized Algorithm (Behavioral Profiles) BPM S84 Statechart, Sequence Diagram Temporal Logic, State Transition (Extended Sequence Diagram) Model Checking SOC/SOA S85 Activity Diagram Process Algebra (CSP) Model Checking (FDR) UML S86 Statechart, Sequence Diagram Petri Net (Extended Colored PN) Specialized Algorithm (Coverability Checking) UML (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:31 Table XVI. Continued SID Studied Artifacts Semantic Domains Checking Techniques Studied Domains S87 Statechart Process Algebra (CSP), Abstract Machine Notation (B) Model Checking UML S88 WS-CDL, BPEL Process Algebra (CSP) Model Checking BPM+SOA S89 Process Model (ebXML BPSS, BPEL) Process Algebra (Pi-Calculus) Specialized Algorithm (Trace Refinement CSP) BPM+SOA S90 Statechart, Sequence Diagram Automata (Split Automata) Model Checking UML S91 Statechart, Message Sequence Chart (MSC) Automata Model Checking UML S92 Statechart, Sequence Diagram Propositional Logic Model checking (SAT) UML S93 Statechart Temporal Logic (LTL), Automata (Extended Hierarchical Automata) Model Checking UML S94 Statechart Temporal Logic (CTL), State Transition (Kripke Structure) Model Checking UML S95 Modal Transition System State Transition Specialized Algorithm Modal Transition Systems S96 Statechart Process Algebra (CSP) Model Checking UML B. SELECTED PRIMARY STUDIES Table XVII. List of Selected Primary Studies ID Full Reference S1 Abdelhalim, I., Schneider, S., and Treharne, H. Towards a Practical Approach to Check UML/fUML Models Consistency using CSP. 13th International Conference on Formal Methods and Software Engineering (ICFEM), Durham, UK, 2011, Springer, 33–48. S2 Alanazi, M. N. and Gustafson, D. A. Super State Analysis for UML State Diagrams. WRI World Congress on Computer Science and Information Engineering (CSIE)- Volume 07, 2009, IEEE, 560–565. S3 Amalio, N., Stepney, S., and Polack, F. ´ Formal Proof from UML Models. 6th International Conference on Formal Engineering Methods (ICFEM), 2004, Springer, 418–433. S4 Benes, N., C ˇ ¸ erna, I., and K ´ ret ˇ ´ ınsky, J. ´ Modal Transition Systems: Composition and LTL Model Checking. 9th International Symposium on Automated Technology for Verification and Analysis (ATVA), 2011, Springer, 228–242. S5 Benes, N., K ˇ ret ˇ ´ ınsky, J., and Larsen, K. G. ´ Refinement Checking on Parametric Modal Transition Systems. Acta Informatica, Vol. 52, No. 2–3 (2015), 269–297. S6 Bernardi, S., Donatelli, S., and Merseguer, J. From UML Sequence Diagrams and Statecharts to Analysable Petri Net Models. 3rd International Workshop on Software and Performance (WOSP), 2002, ACM, 35–45. S7 Bhuiyan, J., Nepal, S., and Zic, J. Checking Conformance between Business Processes and Web Service Contract in Service Oriented Applications. Australian Software Engineering Conference (ASWEC), 2006, IEEE, 80–89. S8 Chama, I. E., Belala, N., and Saidouni, D.-E. FMEBP: A Formal Modeling Environment of Business Process. 20th International Conference on Information and Software Technologies (ICIST), 2014, Springer, 211–223. S9 Chandrasekaran, P. and Mukund, M. Matching Scenarios with Timing Constraints. 4th International Conference on Formal Modeling and Analysis of Timed Systems (FORMATS), 2006, Springer, 98–112. S10 Chang, K., Kung, D. C., and Hsia, P. OBL: A Formal Deduction Method for Object-Oriented Systems. 23rd International Computer Software and Applications Conference (COMPSAC), 1999, IEEE, 450–455. S11 Chatzieleftheriou, G., Bonakdarpour, B., and Smolka, S. A. Abstract Model Repair. 4th International Symposium on NASA Formal Methods (NFM), Norfolk, VA, 2012, Springer, 341–355. S12 Chechik, M. and Gannon, J. Automatic Analysis of Consistency between Requirements and Designs. IEEE Transactions on Software Engineering, Vol. 27, No. 7 (2001), 651–672. S13 D’Souza, D. and Mukund, M. Checking Consistency of SDL+MSC Specifications. 10th International SPIN Workshop on Model Checking Software, 2003, Springer, 151–166. S14 Duran, F., Ouederni, M., and Sala ´ un, G. ¨ Checking Protocol Compatibility using Maude. Electronic Notes in Theoretical Computer Science, Vol. 255 (2009), 65–81. S15 Engels, G., Kuster, J. M., and Heckel, R. ¨ A Methodology for Specifying and Analyzing Consistency of Object-Oriented Behavioral Models. 8th European Software Engineering Conference, Held jointly with 9th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), 2001, ACM, 186–195. (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:32 F. Ul Muram et al. Table XVII. Continued ID Full Reference S16 Engels, G., Guldali, B., and Soltenborn, C. ¨ Assuring Consistency of Business Process Models and Web Services Using Visual Contracts. Applications of Graph Transformations with Industrial Relevance, 17–31, (2008). S17 Ermel, C., Gall, J., and Lambers, L. Modeling with Plausibility Checking: Inspecting Favorable and Critical Signs for Consistency between Control Flow and Functional Behavior. 14th International Conference on Fundamental Approaches to Software Engineering (FASE): part of the joint European Conferences on Theory and Practice of Software (ETAPS), 2011, Springer, 156–170. S18 Eshuis, R. and Wieringa, R. Tool Support for Verifying UML Activity Diagrams. IEEE Transactions on Software Engineering (TSE), Vol. 30, No. 7 (2004), 437–447. S19 Fan, S., Zhao, J. L., and Dou, W. A Framework for Transformation from Conceptual to Logical Workflow Models. Decis. Support Syst., Vol. 54, No. 1 (2012), 781–794. S20 Ge, N., Pantel, M., and Cregut, X. ´ Time Properties Dedicated Transformation from UML-MARTE Activity to Time Transition System. SIGSOFT Software Engineering Notes, Vol. 37, No. 4 (2012), 1. S21 Gherbi, A. and Khendek, F. Consistency of UML/SPT Models. 13th International SDL Forum Conference on Design for Dependable Systems (SDL), 2007, Springer, 203–224. S22 Ghezzi, C., Menghi, C., and Molzam Sharifloo, A. On Requirement Verification for Evolving Statecharts Specifications. Requirements Engineering, Vol. 19, No. 3 (2014), 231–255. S23 Gnesi, S., Latella, D., and Massink, M. Model Checking UML Statechart Diagrams Using JACK. 4th IEEE International Symposium on High-Assurance Systems Engineering, 1999, 46–55. S24 Graaf, B. and van Deursen, A. Model-Driven Consistency Checking of Behavioural Specifications. 4th International Workshop on Model-Based Methodologies for Pervasive and Embedded Software (MOMPES), 2007, IEEE, 115–126. S25 Greenyer, J. and Frieben, J. Consistency Checking Scenario-based Specifications of Dynamic Systems by Combining Simulation and Synthesis. 4th Workshop on Behaviour Modelling - Foundations and Applications (BM-FA), 2012, ACM, 2:1–2:9. S26 Grøner, G., Asadi, M., and Mohabbati, B. Validation of User Intentions in Process Orchestration and Choreography. Information Systems, Vol. 43, No. 0 (2014), 83–99. S27 Harel, D. and Kugler, H. Synthesizing State-Based Object Systems from LSC Specifications. International Journal of Foundations of Computer Science, Vol. 13, No. 1 (2002), 5–51. S28 Hausmann, J. H., Heckel, R., and Taentzer, G. Detection of Conflicting Functional Requirements in a Use Case-Driven Approach: A Static Analysis Technique Based on Graph Transformation. 24th International Conference on Software Engineering (ICSE), 2002, ACM, 105–115. S29 Heckel, R. and Kuster, J. M. ¨ Behavioral Constraints for Visual Models. Electronic Notes in Theoretical Computer Science, Vol. 50, No. 3 (2001), 257–265. S30 Hu, J., Yu, X., Wang, L., et al. Scenario-Based Specifications Verification for Component-Based Embedded Software Designs. International Conference on Parallel Processing Workshops (ICPPW), 2005, IEEE, 240–247. S31 Hu, J., Yu, X., and Zhang, Y. Checking Component-Based Embedded Software Designs for Scenario-Based Timing Specifications. International Conference Embedded and Ubiquitous Computing (EUC), 2005, Springer, 395–404. S32 Inverardi, P., Muccini, H., and Pelliccione, P. Automated Check of Architectural Models Consistency Using SPIN. 16th IEEE International Conference on Automated Software Engineering (ASE), 2001, IEEE, 346–349. S33 Junwei, D., Zhongwei, X., and Meng, M.Verification of Scenario-Based Safety Requirement Specification on Components Composition. 2008 International Conference on Computer Science and Software Engineering (CSSE)- Volume 02, 2008, IEEE, 686–689. S34 Kaliappan, P. and Koenig, H. An Approach to Synchronize UML-Based Design Components for Model-Driven Protocol Development. IEEE 34th Software Engineering Workshop (SEW), 2011, IEEE, 27–35. S35 Kim, J., Kang, I., and Choi, J.-Y. Formal Synthesis of Application and Platform Behaviors of Embedded Software Systems. Software & Systems Modeling, Vol. 14, No. 2 (2015), 839–859. S36 Knapp, A., Merz, S., and Rauh, C. Model Checking - Timed UML State Machines and Collaborations. 7th International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems: Co-sponsored by IFIP WG 2.2 (FTRTFT), 2002, Springer, 395–416. S37 Krka, I. and Medvidovic, N. Component-Aware Triggered Scenarios. IEEE/IFIP Conference on Software Architecture (WICSA), 2014, IEEE, 129–138. S38 Kuster, J. M. and Stroop, J. ¨ Consistent Design of Embedded Real-Time Systems with UML-RT. 2008 11th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC), 2001, IEEE, 31–40. S39 Kwantes, P. M., Gorp, P. V., and Kleijn, J. Towards Compliance Verification between Global and Local Process Models. 8th International Conference on Graph Transformation ICGT, Held as Part of STAF 2015, 2015, Springer, 221–236. (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:33 Table XVII. Continued ID Full Reference S40 Lam, V. S. W. and Padget, J. Consistency Checking of Statechart Diagrams of a Class Hierarchy. 19th European conference on Object-Oriented Programming (ECOOP), 2005, Springer, 412–427. S41 Lam, V. S. W. and Padget, J. Consistency Checking of Sequence Diagrams and Statechart Diagrams Using the π-Calculus. 5th International Conference on Integrated Formal Methods (IFM), 2005, Springer, 347–365. S42 Lano, K. Formal Specification Using Interaction Diagrams. 5th IEEE International Conference on Software Engineering and Formal Methods (SEFM), 2007, IEEE, 293–304. S43 Larsen, K. G., Nyman, U., and Wa̧sowski, A. On Modal Refinement and Consistency. 18th International Conference on Concurrency Theory (CONCUR), 2007, Springer, 105–119. S44 Lee, J.-D., Jung, J.-I., and Lee, J.-H. Verification and Conformance Test Generation of Communication Protocol for Railway Signaling Systems. Computer Standards and Interfaces, Vol. 29, No. 2 (2007), 143–151. S45 Letia, I. A. and Goron, A. Model Checking as Support for Inspecting Compliance to Rules in Flexible Processes. Journal of Visual Languages & Computing, Vol. 28, No. C (2015), 100–121. S46 Diego, L., Istvan, M., and Mieke, M. Automatic Verification of UML Statechart Diagrams Using the SPIN Model-Checker. Formal Aspects of Computing, Vol. 11, No. 637-664 (1999), Springer. S47 Li, B., Zhou, Y., and Pang, J. Model-Driven Automatic Generation of Verified BPEL Code for Web Service Composition. 2009 16th Asia-Pacific Software Engineering Conference (APSEC), 2009, IEEE, 355–362. S48 Liu, S., Liu, Y., and Sun, J. USMMC: A Self-contained Model Checker for UML State Machines. 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE), 2013, ACM, 623–626. S49 Lu, L. and Kim, D.-K. Required Behavior of Sequence Diagrams: Semantics and Conformance. ACM Transactions on Software Engineering Methodology, Vol. 23, No. 2 (2014), 15:1–15:28. S50 Majzik, I., Micskei, Z., and Pinter, G. ´ Development of Model Based Tools to Support the Design of Railway Control Applications. 26th International Conference on Computer Safety, Reliability, and Security (SAFECOMP), 2007, Springer, 430–435. S51 Martens, A. Consistency between Executable and Abstract Processes. IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE), 2005, IEEE, 60–67. S52 Miyazawa, A. and Cavalcanti, A. Refinement-oriented Models of Stateflow Charts. Science of Computer Programming, Vol. 77, No. 10-11 (2012), 1151–1177. S53 Muram, F. U., Tran, H., and Zdun, U. Automated Mapping of UML Activity Diagrams to Formal Specifications for Supporting Containment Checking. 11th International Workshop on Formal Engineering Approaches to Software Components and Architectures (FESCA), 2014, arXiv.org, 1–16. S54 Nagel, B., Gerth, C., and Engels, G. Ensuring Consistency Among Business Goals and Business Process Models. 17th IEEE International Enterprise Distributed Object Computing Conference (EDOC), 2013, IEEE, 17–26. S55 Nejati, S., Sabetzadeh, M., and Chechik, M. Matching and Merging of Statecharts Specifications. 29th International Conference on Software Engineering (ICSE), 2007, IEEE, 54–64. S56 Ossami, D. D. O., Jacquot, J.-P., and Souquieres, J. ` Consistency in UML and B Multi-view Specifications. 5th International Conference on Integrated Formal Methods (IFM), 2005, Springer, 386–405. S57 Ramirez, A. J. and Cheng, B. H. C. Verifying and Analyzing Adaptive Logic through UML State Models. International Conference on Software Testing, Verification, and Validation (ICST), 2008, IEEE, 529–532. S58 Rasch, H. and Wehrheim, H. Checking Consistency in UML Diagrams: Classes and State Machines. 6th IFIP WG 6.1 International Conference on Formal Methods for Open Object-Based Distributed Systems (FMOODS), 2003, Springer, 229–243. S59 Roscoe, A. W. and Wu, Z. Verifying Statemate Statecharts Using CSP and FDR. 8th International Conference on Formal Methods and Software Engineering (ICFEM), 2006, Springer, 324–341. S60 Ryndina, K., Kuster, J. M., and Gall, H. ¨ Consistency of Business Process Models and Object Life Cycles. International Conference on Models in Software Engineering (MoDELS), 2006, Springer, 80–90. S61 Sabetzadeh, M., Nejati, S., and Easterbrook, S. Global Consistency Checking of Distributed Models with TReMer+. 30th International Conference on Software Engineering (ICSE), 2008, ACM, 815–818. S62 Sassolas, M., Chechik, M., and Uchitel, S. Exploring inconsistencies between modal transition systems. Software and Systems Modeling (SoSyM), Vol. 10, No. 1 (2010), 117–142. S63 Schafer, T., Knapp, A., and Merz, S. ¨ Model Checking UML State Machines and Collaborations. Electronic Notes in Theoretical Computer Science, Vol. 55, No. 3 (2001), 357–369. S64 Schrefl, M. and Stumptner, M. Behavior-consistent specialization of object life cycles. ACM Trans. Softw. Eng. Methodol., Vol. 11, No. 1 (2002), 92–148. S65 Schwarzl, C. and Peischl, B. Static- and Dynamic Consistency Analysis of UML State Chart Models. 13th International Conference on Model Driven Engineering Languages and Systems (MODELS) - Part I, 2010, Springer, 151–165. S66 Sekerinski, E. Verifying Statecharts with State Invariants. 13th IEEE International Conference on on Engineering of Complex Computer Systems (ICECCS), 2008, IEEE, 7–14. S67 Shan, L. and Zhu, H. A Formal Descriptive Semantics of UML. 10th International Conference on Formal Methods and Software Engineering (ICFEM), 2008, Springer, 375–396. S68 Shinkawa, Y. Inter-Model Consistency in UML Based on CPN Formalism. 13th Asia Pacific Software Engineering Conference (APSEC), 2006, IEEE, 411–418. (Continued) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:34 F. Ul Muram et al. Table XVII. Continued ID Full Reference S69 Shu, G., Li, C., Wang, Q., et al.Validating Objected-Oriented Prototype of Real-Time Systems with Timed Automata. 13th IEEE International Workshop on Rapid System Prototyping (RSP’02), 2002, IEEE, 99–106. S70 Sims, S., Cleaveland, R., and Butts, K. Automated Validation of Software Models. 16th IEEE International Conference on Automated Software Engineering (ASE), 2001, IEEE, 91–96. S71 Smirnov, S., Farahani, A. Z., and Weske, M. State Propagation in Astracted Business Processes. 9th International Conference on Service-Oriented Computing (ICSOC), 2011, Springer, 16–31. S72 Song, W., Zhang, W., and Zhang, G. Quantifying Consistency between Conceptual and Executable Business Processes. IEEE International Conference on Services Computing (SCC), 2013, IEEE, 9–16. S73 Stumptner, M. and Schrefl, M. Behavior Consistent Inheritance in UML. 19th International Conference on Conceptual Modeling (ER), 2000, Springer, 527–542. S74 Thierry-Mieg, Y. and Hillah, L.-M. UML Behavioral Consistency Checking Using Instantiable Petri Nets. Innovations in Systems and Software Engineering (ISSE), Vol. 4, No. 3 (2008), 293–300. S75 Truong, N.-T., Tran, T.-M.-T., and To, V.-K. Checking the Consistency between UCM and PSM Using a Graph-Based Method. 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS), 2009, IEEE, 190–195. S76 van der Aalst, W. M. P., Dumas, M., and Ouyang, C. Conformance Checking of Service Behavior., ACM Transactions on Internet Technology (TOIT), Vol. 8, No. 3 (2008), 1–30. S77 van Hee, K., Sidorova, N., and Somers, L. Consistency in model integration. Data & Knowledge Engineering, Vol. 56, No. 1 (2006), 4–22. S78 Wang, H., Feng, T., and Zhang, J. Consistency check between behaviour models. IEEE International Symposium on Communications and Information Technology (ISCIT), 2005, IEEE, 486–489. S79 Wang, H. J. and Zhao, J. L. Constraint-centric Workflow Change Analytics. Decision Support Systems, Vol. 51, No. 3 (2011), 562–575. S80 Weidlich, M., Decker, G., and Weske, M. Efficient Analysis of BPEL 2.0 Processes Using p-Calculus. 2nd IEEE Asia-Pacific Service Computing Conference (APSCC), 2007, IEEE, 266–274. S81 Weidlich, M., Dijkman, R., and Mendling, J. The ICoP Framework: Identification of Correspondences Between Process Models. 22nd International Conference on Advanced Information Systems Engineering (CAiSE), 2010, Springer, 483–498. S82 Weidlich, M., Polyvyanyy, A., and Mendling, J. Causal Behavioural Profiles - Efficient Computation, Applications, and Evaluation. Fundamenta Informaticae, Vol. 113, No. 3–4 (2011), 399–435. S83 Weidlich, M., Mendling, J., and Weske, M. Efficient Consistency Measurement Based on Behavioral Profiles of Process Models. IEEE Transactions on Software Engineering, Vol. 37, No. 3 (2011), 410–429. S84 Xie, Y., Du, D., and Liu, J. Towards the Verification of Services Collaboration. 33rd Annual IEEE International Computer Software and Applications Conference (COMPSAC), 2009, IEEE, 428–433. S85 Xu, D., Miao, H., and Philbert, N. Model Checking UML Activity Diagrams in FDR. 8th IEEE/ACIS International Conference on Computer and Information Science (ICIS), 2009, IEEE, 1035–1040. S86 Yao, S. and Shatz, S. M. Consistency Checking of UML Dynamic Models Based on Petri Net Techniques. 15th International Conference on Computing (CIC), 2006, IEEE, 289–297. S87 Yeung, W. L. Checking Consistency between UML Class and State Models Based on CSP and B. Journal of Universal Computer Science (JUSC), Vol. 10, No. 11 (2004), 1540–1559. S88 Yeung, W. L. CSP-Based Verification for Web Service Orchestration and Choreography. Simulation, Vol. 83, No. 1 (2007), 65–74. S89 Yeung, W. L. A Formal Basis for Cross-Checking ebXML BPSS Choreography and Web Service Orchestration. IEEE Asia-Pacific Services Computing Conference (APSCC), 2008, IEEE, 524–529. S90 Zhao, X., Long, Q., and Qiu, Z. Model Checking Dynamic UML Consistency. 8th International Conference on Formal Methods and Software Engineering (ICFEM), 2006, Springer, 440–459. S91 Knapp, A. and Wuttke, J. Model Checking of UML 2.0 Interactions. 2006 International Conference on Models in Software Engineering (MoDELS), 2006, Springer, 42–51. S92 Kaufmann, P., Kronegger, M., Pfandler, A., Seidl, M., and Widl, M. A SAT-Based Debugging Tool for State Machines and Sequence Diagrams. 7th International Conference on Software Language Engineering (SLE), 2014, Springer, 21–40. S93 Dong, W., Wang, J., Qi, X., and Qi, Z. Model Checking UML Statecharts. 8th Asia-Pacific Software Engineering Conference (APSEC), 2001, IEEE, 363–370. S94 Zhao, Q., and Krogh, B. H. Formal Verification of Statecharts Using Finite-State Model Checkers. IEEE Transactions on Control Systems Technology, Vol. 14, 2006, IEEE, 943–950. S95 Fischbein, D. and Uchitel, S. On Correct and Complete Strong Merging of Partial Behaviour Models. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2008, ACM, 297–307. S96 Zhang, S. J. and Liu, Y. An Automatic Approach to Model Checking UML State Machines. 4th International Conference on Secure Software Integration and Reliability Improvement Companion (SSIRI-C), 2010, IEEE, 1–6. ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:35 C. DATA EXTRACTION: DATA ITEMS AND ENCODING Table XVIII. Evidence of Timing Support Level Description 1 Not considered 2 Implicit using of underlying timing model or rules, for instance, providing a way to represent temporal information using timers, temporal logics, CSP, etc. 3 Explicit timing model and analysis, for instance, by using explicitly timed models or real-time constraints Table XIX. Evidence of Inconsistency Handling (adapted from Spanoudakis and Zisman [2001]) Level Description 1 Not mentioned/Not considered 2 Systematic inconsistency diagnosis 3 Identifying handling actions 4 Evaluating costs and risks 5 Automated action selection and execution Table XX. Evidence of Automation Support Level Description 1 Manual: requiring human interactions or manually specifying rules, constraints 2 Semi-automated: assuming existing input models, partially human interaction 3 Fully automated Table XXI. Evidence of Development Tool Support and Integration Level Description 1 Not mentioned/Not considered 2 Proposed/planned integration 3 Fully implemented integration Table XXII. Evidence of Tool Support for Consistency Checking Level Description 1 Not mentioned / Not considered 2 Only using existing tools / libraries 3 Prototypes (including using existing tools / libraries) ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:36 F. Ul Muram et al. Table XXIII. Level of Empirical Evidence (adapted from Kitchenham [2004] and Alves et al. [2010]) Level Description 1 No evidence. 2 Evidence obtained from demonstration or working out toy examples. 3 Evidence obtained from expert opinions or observations. 4 Evidence obtained from academic studies, e.g., controlled experiments 5 Evidence obtained from industrial studies. 6 Evidence obtained from industrial practice. Table XXIV. Type of Study and Evaluation (adapted from Chen and Ali Babar [2011]) Type Description Rigorous analysis (RA) Rigorous derivation and proof, suited for formal model Case study (CS) An empirical inquiry that investigates a contemporary phenomenon within its real-life context; when the boundaries between phenomenon and context are not clearly evident; and in which multiple sources of evidence are used Discussion (DC) Provided some qualitative, textual, opinion Example (EX) Authors describing an application and provide an example to assist in the description, but the example is “used to validate” or “evaluate” as far as the authors suggest Experience Report (ER) The result has been used on real examples, but not in the form of case studies or controlled experiments, the evidence of its use is collected informally or formally Field study (FS) Controlled experiment performed in industry settings Laboratory experiment with human subjects (LH) Identification of precise relationships between variables in a designed controlled environment using human subjects and quantitative techniques Laboratory experiment with software subjects (LS) A laboratory experiment to compare the performance of newly proposed system with other existing systems Simulation (SI) Execution of a system with artificial data, using a model of the real word Not mentioned These types of studies have been excluded ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:37 Table XXV. Rigor (adapted from Ivarsson and Gorschek [2011]) Aspect Description Level = Weak Level = Medium Level = Strong Context described (C) There appears to be no description of the context in which the evaluation is performed. The context in which the study is performed is mentioned or presented in brief but not described to the degree to which a reader can understand and compare it to another context. The context is described to the degree where a reader can understand and compare it to another context. Study design described (S) There appears to be no description of the design of the presented evaluation. The study design is briefly described, e.g., “10 students did step 1, step 2, and step 3”. The study design is described to the degree where a reader can understand, e.g., the variables measured, the control used, the treatments, the selection/sampling used etc. Validity discussed (V) There appears to be no description of any threats to validity of the evaluation. The validity of the study is mentioned but not described in detail. The validity of the evaluation is discussed in detail where threats are described and measures to limit them are detailed. ACKNOWLEDGMENTS The authors are grateful to the anonymous reviewers and the editor for the thorough feedback on earlier versions of this article. Their comments and suggestions were of great help. REFERENCES Vander Alves, Nan Niu, Carina Alves, and George Valenc ¸a. 2010. Requirements engineering for software product lines: A systematic literature review. Information and Software Technology 52, 8 (Aug. 2010), 806–820. DOI:http://dx.doi.org/10.1016/j.infsof.2010.03.014 Luigi Atzori, Antonio Iera, and Giacomo Morabito. 2010. The internet of things: A survey. Computer Networks 54, 15 (Oct 2010), 2787–2805. DOI:http://dx.doi.org/10.1016/j.comnet.2010.05.010 Victor R. Basili and David M. Weiss. 1984. A methodology for collecting valid software engineering data. IEEE Transactions on Software Engineering 10, 6 (Nov. 1984), 728–738. DOI:http://dx.doi.org/10.1109/ TSE.1984.5010301 Eerke Boiten, Howard Bowman, John Derrick, Peter Linington, and Maarten Steen. 2000. Viewpoint consistency in {ODP}. Computer Networks 34, 3 (2000), 503–537. DOI:http://dx.doi.org/10.1016/S1389- 1286(00)00114-6 Pearl Brereton, Barbara A. Kitchenham, David Budgen, Mark Turner, and Mohamed Khalil. 2007. Lessons from applying the systematic literature review process within the software engineering domain. Journal of Systems and Software 80, 4 (2007), 571–583. DOI:http://dx.doi.org/10.1016/j.jss.2006.07.009 David Budgen, Andy J. Burn, O. Pearl Brereton, Barbara A. Kitchenham, and Rialette Pretorius. 2011. Empirical evidence about the UML: A systematic literature review. Software: Practice and Experience. 41, 4 (April 2011), 363–392. DOI:http://dx.doi.org/10.1002/spe.1009 Lianping Chen and Muhammad Ali Babar. 2011. A systematic review of evaluation of variability management approaches in software product lines. Information and Software Technology 53, 4 (2011), 344–362. DOI:http://dx.doi.org/10.1016/j.infsof.2010.12.006 E. M. Clarke, William Klieber, M. Nova ´cek, and P. Zuliani. 2011. Model checking and the state explosion ˇ problem. In International Summer School on Tools for Practical Software Verification (LASER) - Revised Tutorial Lectures. Springer, 1–30. DOI:http://dx.doi.org/10.1007/978-3-642-35746-6 1 ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.17:38 F. Ul Muram et al. Edmund M. Clarke, Jr., Orna Grumberg, and Doron A. Peled. 1999. Model Checking. MIT Press, Cambridge, MA. Gregor Engels, Jochem M. Kuster, Reiko Heckel, and Luuk Groenewegen. 2001. A methodology for ¨ specifying and analyzing consistency of object-oriented behavioral models. In Proceedings of the 8th European Software Engineering Conference, Held jointly with 9th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, New York, NY, 186–195. http://dx.doi.org/10.1145/503209.503235 Rik Eshuis and Paul Grefen. 2007. Structural matching of BPEL processes. In Proceedings of the 5th European Conference on Web Services (ECOWS’07). IEEE, 171–180. DOI:http://dx.doi.org/10.1109/ ECOWS.2007.22 Anthony Finkelsteiin, Dov M. Gabbay, Anthony Hunter, Jeff Kramer, and Bashar Nuseibeh. 1993. Inconsistency handling in multi-perspective specifications. In Proceedings of the 4th European Software Engineering Conference on Software Engineering (ESEC’93). Springer-Verlag, Berlin, 84–99. Pascal Fradet, Daniel Le Metayer, and Micha ´ el P ¨ erin. 1999. Consistency checking for multiple view software ´ architectures. ACM SIGSOFT Software Engineering Notes 24, 6 (Nov. 1999), 410–428. David Frankel. 2002. Model Driven Architecture: Applying MDA to Enterprise Computing. John Wiley & Sons, Inc., New York, NY. Matthias Galster, Danny Weyns, Dan Tofan, Bartosz Michalik, and Paris Avgeriou. 2014. Variability in software systems–A systematic literature review. IEEE Transactions on Software Engineering 40, 3 (March 2014), 282–306. DOI:http://dx.doi.org/10.1109/TSE.2013.56 Dimitra Giannakopoulou and Klaus Havelund. 2001. Runtime Analysis of Linear Temporal Logic Specifications. Technical Report August. Research Institute for Advanced Computer Science. 1–11 pages. Stijn Goedertier, Jan Vanthienen, and Filip Caron. 2015. Declarative business process modelling: principles and modelling languages. Enterprise Information Systems 9, 2 (2015), 161–185. DOI:http://dx.doi.org/ 10.1080/17517575.2013.830340 G. Goth. 2008. Ultralarge systems: Redefining software engineering? Software, IEEE 25, 3 (May 2008), 91–94. DOI:http://dx.doi.org/10.1109/MS.2008.82 Henri Habrias and Marc Frappier. 2006. Software Specification Methods - An Overview Using a Case Study. ISTE Ltd., London, UK. David Harel. 1987. Statecharts: A visual formalism for complex systems. Science of Computer Programming 8, 3 (1987), 231–274. DOI:http://dx.doi.org/10.1016/0167-6423(87)90035-9 Zbigniew Huzar, Ludwik Kuzniarz, Gianna Reggio, and Jean Louis Sourrouille. 2005. Consistency problems in UML-based software development. In Proceedings of the International Conference on UML Modeling Languages and Applications (UML) Satellite Activities – Revised Selected Papers. Springer, Berlin, 1–12. DOI:http://dx.doi.org/10.1007/978-3-540-31797-5_1 Martin Ivarsson and Tony Gorschek. 2011. A method for evaluating rigor and industrial relevance of technology evaluations. Empirical Software Engineering 16, 3 (2011), 365–395. DOI:http://dx.doi.org/ 10.1007/s10664-010-9146-4 Barbara Kitchenham. 2004. Procedures for performing systematic reviews. Keele, UK, Keele University 33, 2004 (2004), 1–26. Barbara Kitchenham and Pearl Brereton. 2013. A systematic review of systematic review process research in software engineering. Information and Software Technology 55, 12 (2013), 2049–2075. DOI:http://dx.doi.org/10.1016/j.infsof.2013.07.010 Barbara Kitchenham, David Budgen, and Pearl Brereton. 2015. Evidence-Based Software Engineering, Empirical SE, Software Design. CRC Press, Boca Raton, FL. Barbara Kitchenham and Stuart Charters. 2007. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report EBSE 2007-001. Keele University and Durham University Joint Report. Barbara Kitchenham, O. Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering - A systematic literature review. Information and Software Technology 51, 1 (Jan. 2009), 7–15. DOI:http://dx.doi.org/10.1016/j.infsof.2008.09.009 Francisco J. Lucas, Fernando Molina, and Ambrosio Toval. 2009. A systematic review of UML model consistency management. Information and Software Technology 51, 12 (Dec. 2009), 1631–1645. DOI:http://dx.doi.org/10.1016/j.infsof.2009.04.009 Frank D. Mac´ ıas-Escriva, Rodolfo Haber, Raul Del Toro, and Vicente Hernandez. 2013. Self-adaptive systems: ´ A survey of current approaches, research challenges and applications. Expert Systems with Applications 40, 18 (Dec. 2013), 7267–7279. DOI:http://dx.doi.org/10.1016/j.eswa.2013.07.033 ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.Systematic Review of Software Behavioral Model Consistency Checking 17:39 MathWorks. 2015. Stateflow R : Model and simulate decision logic using state machines and flow charts. Retrieved from http://www.mathworks.com/products/stateflow. Tom Mens and Pieter Van Gorp. 2006. A taxonomy of model transformation. Electronic Notes Theoretical Computer Science 152 (March 2006), 125–142. DOI:http://dx.doi.org/10.1016/j.entcs.2005.10.021 Gero Muhl, Ludger Fiege, and Peter Pietzuch. 2006. ¨ Distributed Event-Based Systems. Springer, Secaucus, NJ. OASIS. 2007. Web Services Business Process Execution Language Version 2.0. Retrieved from http://docs. oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html. OMG. 2011a. Business Process Model and Notation (BPMN) Version 2.0.2. Retrieved from http://www.omg. org/spec/BPMN/2.0.2/PDF. OMG. 2011b. OMG Unified Modeling LanguageTM (OMG UML), Superstructure, Version 2.4.1. Retrieved from http://www.omg.org/spec/UML/2.4.1/Superstructure/PDF. Nayan B. Ruparelia. 2010. Software development lifecycle models. SIGSOFT Software Engineering Notes 35, 3 (May 2010), 8–13. DOI:http://dx.doi.org/10.1145/1764810.1764814 August-Wilhelm Scheer and Markus Nuttgens. 2000. ARIS architecture and reference models for business ¨ process management. In Business Process Management, Models, Techniques, and Empirical Studies. Springer, London, UK, 376–389. Ian Sommerville, Dave Cliff, Radu Calinescu, Justin Keen, Tim Kelly, Marta Kwiatkowska, John Mcdermid, and Richard Paige. 2012. Large-scale complex IT systems. Communications of the ACM 55, 7 (July 2012), 71–77. DOI:http://dx.doi.org/10.1145/2209249.2209268 George Spanoudakis and Andrea Zisman. 2001. Handbook of Software Engineering and Knowledge Engineering. World Scientific, Singapore, Chapter Inconsistency management in software engineering: Survey and open research issues, 329–380. Thomas Stahl and Markus Volter. 2006. ¨ Model-Driven Software Development: Technology, Engineering, Management. John Wiley & Sons, Hoboken, NJ. Muhammad Usman, Aamer Nadeem, Tai-hoon Kim, and Eun-suk Cho. 2008. A survey of consistency checking techniques for UML models. In Proceedings of the International Conference on Advanced Software Engineering and Its Applications (ASEA). IEEE, 57–62. DOI:http://dx.doi.org/10.1109/ASEA.2008.40 Wil M. P. van der Aalst. 2013. Business process management: A comprehensive survey. ISRN Software Engineering 2013 (2013), 1–37. DOI:http://dx.doi.org/10.1155/2013/507984 Ragnhild Van der Straeten. 2005. Inconsistency Management in Model-Driven Engineering. Ph.D. Dissertation. Vrije University Brussel, Dept. of Computer Science. Axel van Lamsweerde, Emmanual Letier, and Robert Darimont. 1998. Managing conflicts in goal-driven requirements engineering. IEEE Transactions on Software Engineering. 24, 11 (Nov. 1998), 908–926. DOI:http://dx.doi.org/10.1109/32.730542 He Zhang, Muhammad Ali-Babar, and Paolo Tell. 2011. Identifying relevant studies in software engineering. Information and Software Technology 53, 6 (2011), 625–637. DOI:http://dx.doi.org/10.1016/ j.infsof.2010.12.010 Received April 2014; revised November 2016; accepted January 2017 ACM Computing Surveys, Vol. 50, No. 2, Article 17, Publication date: April 2017.