Mining Executable Specifications of Web Applications from Selenium IDE Tests Dianxiang Xu National Center for the Protection of the Financial Infrastructure, Dakota State University Madison, SD 57042, USA [email protected] Weifeng Xu, Bharath K Bavikati Computer and Information Science Department Gannon University Erie, PA 16541, USA {xu001, bavikati001}@gannon.edu W. Eric Wong Department of Computer Science University of Texas at Dallas Richardson, TX 75083 USA [email protected] Abstract— A common practice for system testing of web-based applications is to perform the test cases through a web browser. These tests are often recorded and managed by a record and replay tool, such as Selenium IDE. Mining specifications from such tests can be very useful for understanding, verifying, and debugging the system under test. This paper presents an approach to mining a behavior specification from a Selenium IDE test suite such that (a) it captures the behavior of the tests at a high level of abstraction, (b) the behavior can be simulated, and (c) all the tests are completely reproducible from the specification. We first identify similar test actions through context-sensitive clustering so as to normalize the given Selenium IDE tests. Then, we mine patterns of test actions that represent meaningful functions and transform Selenium IDE tests into abstract tests, which are similar to the tests used in the existing model-mining techniques. From the abstract tests, we synthesize a high-level Petri net that captures both temporal constraints and data values. For evaluation purposes, we applied our approach to eight test suites of two real-world systems, Magento (an online shopping system being used by many live stores) and Amazon. Two of the test suites are for security testing, aiming at SQL injection and XSS vulnerabilities. The result shows that our approach is effective in producing abstract yet executable specifications and reducing the complexity of the models. Keywords- software testing; test mining; Petri nets; web applications; model-based testing I. INTRODUCTION A common practice for system testing of web-based applications is to perform the test cases through a web browser. These tests are often recorded and managed by a record and replay tool, such as Selenium IDE (a Firefox plug-in, http://seleniumhq.org/projects/ide/). They can be replayed as needed. Mining specifications from such tests can be very useful for understanding, verifying, and debugging the system under test. However, the existing mining techniques are not yet applicable to the web-based tests because they target well-structured program tests that are composed of method invocations. A web-based test usually consists of general-purpose GUI operations and dynamically generated data (e.g., hyperlinks). For example, each action in a Selenium IDE test is a triple , where command is a GUI operation, target is a dynamically generated object, and value is a user input. When a web-based test is performed and recorded, one test step (e.g., login) may result in a number of actions. Section II will provide an in-depth analysis of the issues for mining a behavior specification from such tests. In this paper, we present an approach to mining a behavior specification from a given Selenium IDE test suite such that (a) it captures the behavior at a high level of abstraction, (b) the behavior can be simulated, and (c) all the tests are completely reproducible from the specification. The resultant specification, called MID (Model-Implementation Description) [1], consists of a Predicate/Transition (PrT) net and a MIM (Model-Implementation Mapping) description. The PrT net captures the interactions of Selenium IDE tests, whereas the MIM description maps the elements of the PrT net into Selenium IDE code and data. PrT nets [2] are a wellstudied formal method for modeling and verifying distributed systems [3][4]. Since MID is the input language of the model-based test code generator, MISTA (Modelbased Integration and System Test Automation) [1], the MID specification mined from Selenium IDE tests can be used for various purposes, such as analyzing the system behavior, selecting regression tests, and generating new tests. MISTA is not only able to generate executable test code for various coverage criteria (e.g., transition coverage, state coverage, and reachability coverage), but also provides a simulator and a model checker for analyzing the specification. Our approach consists of three steps. First, we identify similar test actions through context-sensitive clustering so as to normalize the given Selenium IDE tests. Second, we discover patterns of test actions that represent meaningful functions and transform Selenium tests into abstract tests. The abstract tests are similar to the tests used in the existing model mining techniques. Third, we exploit an existing process miner to synthesize a place/transition net and then upgrade it to a PrT net. To evaluate our approach, we applied our approach to eight test suites of two real-world systems, Magento (an online shopping system being used by many live stores) and Amazon. The result shows that our approach can produce abstract yet executable specifications for all of the test suites and that the pattern miner can reduce the complexity of the behavior models. The rest of this paper is organized as follows. Section II discusses the specific mining issues of Selenium IDE tests 2012 IEEE Sixth International Conference on Software Security and Reliability 978-0-7695-4742-8/12 $26.00 © 2012 IEEE DOI 10.1109/SERE.2012.39 263and gives an overview of our approach. Section III describes the pattern miner. Section IV presents the model miner. Section V describes the empirical studies. Section VI reviews the related work. Section VII concludes this paper. II. PROBLEM AND APPROACH A. A Motivating Example Consider the testing of a web-based shopping application. According to the test requirements, the tester has designed the following three test scripts with specific test data, such as user account, items, and quantities: • Log in, add an item to the cart, remove the item from the cart, log out; • Log in, add an item to the cart, check out, log out; • Log in, log out. Then, the tester decides to carry out and record these tests with Selenium IDE. Each step in the above tests requires a varying number of GUI actions. For example, to log in, the tester has to perform the following: open the web page, click on the login link, type user name, type password, and click on the submit button. When the test is recorded with Selenium IDE, each action results in a triple . Each action of the test oracle (e.g., verifying the presence of particular text) is also represented by such a triple. We refer to as a Selenium test action. Each test or test suite is a sequence of Selenium test actions. For discussion purposes, Tables I-III show the aforementioned Selenium IDE tests. We denote these tests as ST1 = , ST2= , and ST3=, where ei represents a test action. TABLE I. SELENIUM TEST ST1 Command Target Value e1 Open /magento/index.php/ e2 clickAndWait link=Log In e3 Type Email [email protected] e4 Type Pass Mypswd1 e5 clickAndWait send2 e6 clickAndWait //img[@alt='Magento Commerce'] e7 clickAndWait link=Acer Ferrari 3200 Notebook Computer PC … e11 clickAndWait link=Log Out TABLE II. SELENIUM TEST ST2 Command Target Value e12 Open /magento/index.php/ e13 clickAndWait link=Log In e14 Type Email [email protected] e15 Type Pass Mypswd2 e16 clickAndWait send2 e17 clickAndWait //img[@alt='Magento Commerce'] e18 clickAndWait link=Olympus Stylus 750 7.1MP Digital Camera e19 Type Qty 2 … e36 clickAndWait link=Log Out TABLE III. SELENIUM TEST ST3 Command Target Value e37 Open /magento/index.php/ e38 clickAndWait link=Log In e39 Type Email [email protected] e40 Type Pass Mypswd3 e41 clickAndWait send2 e42 clickAndWait link=Log Out B. Research Issues To mine a behavior specification of the above Selenium IDE tests, there are several issues. (1) How to determine which types of Selenium test actions have the same behavior? In traditional tests, method invocations are considered to have the same behavior if they call the same method. A Selenium test action, however, has three components . To apply the existing mining techniques, we have three possible options: (a) Treating command as the method and the pair as the parameters; (b) Treating pair as the method and value as the parameter; (c) Using a regular expression on target to treat command and part of target as the method, and the rest of target and value as the parameters; For option (a), two Selenium test actions and have the same behavior if command1=command2. The behavior model simply depends on the sequences of commands, which are application-independent GUI operations (e.g., open, clickAndWait, and type). Due to over-generalization, such a model provides little useful information about the application-specific behaviors. For option (b), two test actions and have the same behavior if command1=command2 and target1=target2. This can cause over-specialization. Consider e7 in Table I and e18 in Table II. Although their targets are different, they are part of the same behavior “add item”. For option (c), we may use a regular expression to normalize Selenium test actions as in Synoptic [5]. For example, test actions with the same command clickAndWait whose targets start with link= may be treated as the same behavior. In this case, e7 and e18 have the same behavior with different parameters in target, Acer Ferrari 3200 Notebook Computer PC, and Olympus Stylus 750 7.1MP Digital Camera, respectively. However, e2 and e11 also have the same behavior as e7 and e18 although they have different purposes. In essence, whether test actions ei and ej have the same behavior also depends on their contexts - test actions before and after them. It can be hard to use statically defined regular expressions to classify contextsensitive test actions. To address the above issue, we classify test actions by calculating context-sensitive similarity scores. (2) How to discover patterns of GUI-based test actions that represent high-level system functions? 264Individual Selenium test actions may not reflect the system functions because of their low-level abstraction. The test in Table I has 11 test actions but involves only four system functions: “login”, “add item”, “remove item”, and “log out”. In fact, e1-e5 in Table I and e12-e16 in Table II, e37- e41 in Table III represent the same function “login”. The behavior model synthesized directly from the raw Selenium test actions can be too large or complex to manage. In comparison, a model that captures high-level functions is not only more manageable, but also more meaningful. This important issue has not been addressed in the literature. In this paper, we present a mining algorithm for discovering patterns of test actions across the given tests. (3) How to synthesize an executable specification from the given tests so that all tests can be reproduced? One of our motivations is to use the synthesized specification for the selection of regression tests and generation of new executable tests. This requires that the given tests should be completely reproducible from the specification - not only temporal constraints but also specific data in the given tests. Existing techniques have focused on mining either models that can reproduce the sequences of test actions without specific data or models that capture the data constraints in the tests. C. Overview of Our Approach This paper aims at discovering a MID specification from the given Selenium tests such that: • it captures the behavior of all tests, • it can be simulated or executed, and • it can completely reproduce all tests. Figure 1. Overview of our approach A MID specification consists of two parts: a PrT net and a MIM description. The PrT net captures the abstract behaviors, whereas the MIM description maps the modeling elements (e.g., places and transitions) into implementation constructs. Figure 1 shows the round-trip workflow of our approach. First, we cluster individual test actions through context-sensitive similarity scores of pairs. The individual test actions in the same cluster are considered having the same behavior. Second, we mine patterns of test actions in the normalized Selenium tests. A pattern is a sequence of test actions that occur in individual tests or across different tests – all the occurrences have the same behavior. It is specified by the sequence of test actions, where differences between all the occurrences in the target entries are replaced with placeholder variables ?yi and value entries are replaced with placeholder variable ?xj. A specific occurrence can be created from its pattern by replacing ?yi with the corresponding difference of target and ?xi with the corresponding value. According to the patterns, the Selenium tests are transformed into abstract tests (similar to method calls in traditional tests). The patterns are also used to create the MIM specification, which maps each pattern to the test actions in the original Selenium tests. Third, we use an existing process miner to derive a traditional Petri net from the abstract tests, which can reproduce the control sequences of the abstract tests. Then we upgrade the resultant net to a PrT net with data flows. This PrT net and the MIM description together form a MID specification, which can be used by MISTA to re-generate executable Selenium tests. Figure 2. PrT net for the running example TABLE IV. PARTIAL MIM SPEC FOR THE RUNNING EXAMPLE Method s Model-Level Transition Implementation/Selenium Code T1(?x1, ?x2) Open /magento/index.ph p/ clickAndWait link=Log In Type Email ?x1 Type Pass ?x2 clickAndWait send2 T3(?x1, ?y1) clickAndWait //img[@alt='Mage nto Commerce'] clickAndWait link=?y1 Type Qty ?x1 clickAndWait //button[@onclick ='productAddToC artForm.submit()'] Objects Model-level Object Implementation-Level Object C1 [email protected] C2 Acer Ferrari 3200 Notebook Computer PC Applying our approach to the running example results in the MID specification in Figure 2 (PrT net) and Table IV (MIM description). In Figure 2, transitions T1-T5 represent five patterns of test actions discovered from the given tests. Conceptually, they are corresponding to log in, log out, add item, remove item, and check out, respectively. In Table IV, the “methods” section specifies each pattern. T1 is an abstraction of five test actions in the tests (e1-e5 in Table I, e12-e16 in Table II, and e37-e41 in Table III). It has two parameters ?x1, and ?x2, representing the value entries of the pattern. T3 is an abstraction of four test actions (e6-e9 in Table I and e17-e20 in Table II). It has two parameters ?x1 representing the value entries and ?y1 representing the difference between multiple occurrences. Since each data token in a PrT net is a constant (an identifier starting with an upper-case letter or a number), we use a named constant in the model to denote an actual Selenium tests 2. Pattern Miner Abstract tests 3. Model Miner MID 4.MISTA   1. Clustering Test Actions Normalized tests 265parameter in test cases if the parameter does not conform to the syntax of data tokens. We use the “objects” section as shown in Table IV for mapping such named constants to the corresponding parameters. For example, C1 is constant, denoting [email protected]. Our approach can reproduce the original tests as follows: the tests in Tables I, II and III are corresponding to the following firing sequences. They are corresponding to the original tests when the transitions and parameters are substituted for the code and data in the MIM specification. III. PATTERN MINING In this section, we deal with the first two issues in Section II.B. We first discuss context-sensitive similarity for clustering individual Selenium test actions. Then we present our pattern miner for identifying test action patterns that would represent abstract system functions. A. Clustering Selenium Test Actions We cluster Selenium test actions by grouping the individual test actions such that the test actions in the same group will be similar to each other but different from the test actions in other groups. Given a set of Selenium tests ST = {ST1, ST2,…, STn}, where n>1, STi = , and j≥1. SA = {e11, …, e21, …, en1, …} is the set of all test actions aggregated from ST. Let s(ej, ek) be the similarity score between test actions ej and ek and λ be a similarity threshold. The clustering attempts to seek a k-partition of SA, {E1, E2, …, Ek},such that: (a) Ei ≠∅, 1≤i≤k,(b) s(ej, ek)≥ λ for any ej, ek∈ Ei, 1≤i≤k,(c) i=SA; (d) Ei ∩Ej = ∅, for any 1≤i, j≤ k and i≠j. Obviously, the results of clustering depend on the similarity matrix s and threshold λ. As mentioned in Section II.B, our similarity function is defined upon the pair of test actions as well as their contexts (preceding and succeeding commands). Our work is based on Blondel et al.’s approach to computing the similarity between vertices in graphs [6]. Let GA and GB be two directed graphs and s(i, j) be the similarity score between vertex i ∈GA and vertex j ∈ GB. The similarity between vertices in GA and GB is calculated as follows: = (1) Where A and B are adjacency matrices of GA and GB , respectively, A⊗B is the Kronecker matrix product of label (edge) matrix of GA and GB, (A⊗B)T is the transpose of A⊗B, and δk is the similarity matrix of k-th iteration. A⊗B indicates the similarity of preceding vertices adds weights to the similarity of s(i,j). (A⊗B)T indicates the similarity of succeeding vertices adds weight to the similarity of s(i, j). The ||.||2 is the matrix norm of a matrix defined as the square root of the sum of the absolute squares of its elements. It guarantees that =δk+1 is a stable value after iterations. We adapt Blondel et al.’s approach to context-sensitive similarity of test actions as follows: First, GA and GB represent the same graph for the concatenation of all tests, i.e., , where each test action eij is a vertex and eij is connected to eik if eik is the succeeding action in the same test (k=j+1). Suppose the total number of test actions in all tests is m, then the size of similarity matrix is m×m. Second, we initialize the similarity matrix s0(i,j) as follows: where e(command) and e(target) represent the command and target of test action e, respectively. For any i and j, s0(i, j) is in the range [0, 1]. s0(i, j)=0 if and only if the two test actions have no common substring in their targets, and s0 (i,j)=1 if and only if they have the same target. For example, s0(i, j)=10/16 =0.625 for test actions ei = “clickAndWait link=abc” and ej = “clickAndWait link=def” because the total length of targets is 16 and the common substring of targets is “link=”. δi is the column-wise vector representation of similarity matrix si, i.e., δi=[si(1,1), …, si(m,1), si (1,2),…,si (m,2),…,si (1, m), …, si (m,m) ]. Third, we represent the contexts of test actions by the adjacency matrices A and B. Each entry of A (or B) describes a relation between two vertices in a graph. A(i, j) = if ei, and ej are two adjacent test actions, i.e., two adjacent vertices in GA (or GB), otherwise A(i, j)=∅. Fourth, we replace the Kronecker matrix product A⊗B with the following formula: (2) The formula indicates A(i,j)•B(p,q) =1 only when the two contexts for i → j and p → q are the same. After the similarity matrix is calculated (when δk+1=δk), it is used to group test actions according to the similarity threshold. Specifically, for each row i and column j of the similarity matrix, ei and ej belong to the same cluster if sk(i, j) ≥ λ. Figure 3 shows a portion of the similarity matrix in the running example. After the clusters for e1 and e2 are identified, we find the cluster for e3 as follows: in the row for e3, the entries in e14 and e39 have a similarity score of 1, greater than λ = 0.8. Thus, e3, e14, and e39 belong to the same cluster, say E3. Since the cluster for e14 and e39 is found, rows for e14 and e39 are marked as traversed. Once all rows of the matrix are traversed, all test actions are clustered. Figure 3 Similarity matrix We abstract each cluster Ei={e1, …,ek} into a parameterized test action , where e’ is corresponding to the command of ei and the common part of target of ei,(1≤i≤k). ?y is a variable representing the 266difference in target and ?x is a variable presenting the value field. In other words, < e’, ?y> is equivalent to . We use e’ or e’(?x, ?y) to denote a parameterized test action. Table V shows some clusters for the running example, where e7’ for cluster E7 originates from e7 and e18. TABLE V. CLUSTERS OF TEST ACTIONS IN THE EXAMPLE Cluster and Variable Original Tests ST1 ST2 ST3 e1’ e1 e12 e37 e2’ e2 e13 e38 e3’ ?x e3 e14 e39 …e7 ’ ?y e7 e18 … e26’ e11 e36 e42 We also normalize each Selenium test by substituting each test action for the corresponding parameterized test action with the actual parameters. For example, e7 is substituted for e7’ (Acer Ferrari 3200 Notebook Computer PC), where Acer Ferrari 3200 Notebook Computer PC is the actual parameter of e7’(?y). As such, ST1, ST2, and ST3 are transformed into the following normalized tests, where, for simplicity, the actual parameters of ei’ are omitted: ST1’: < e1’, e2’ , e3’ , e4’ , e5’, e6’ , e7’ , e8’, e9’, e10’, e26’> ST2’: ST3’: < e1’, e2’ , e3’ , e4 ’, e5’, e26’> B. Mining Patterns of Test Actions A pattern of test actions refers to a sequence of parameterized test actions that occurs in one or more normalized tests. The number of parameterized test actions in pattern Ti is called the length of Ti, denoted by |Ti|. Given a set of normalized tests ST’ = {ST1’, …, STn’}. Let SA’ = {e1’, …, em’} be the set of all parameterized test actions (i.e., clusters) of ST’. The problem of pattern mining is to find a set of test action patterns T such that, for each STi’∈ ST’, there exists a subset of T, say {T1,…,Tk}⊆ T, and STi’ is composed of T1, T2,…,and Tk (a sequence of occurrences of T1,T2,…,Tk). Obviously, there are different ways to group test actions into patterns. The simplest one is to treat each parameterized test action as a pattern, that is, T={T1,…,Tm} and Ti= (|Ti|=1) for any i (1≤ i≤m). Obviously, such patterns are of no interest because they provide no abstraction for test actions. Our approach tries to maximize the length of patterns in the mining process by finding the longest common subsequences in ST’. By treating each normalized test as a string e1’…ek’, we use the suffix tree technique [7] for calculating the occurrence frequency of test action sequences in ST’. In the suffix tree, each node is labeled by a node ID, a list of parameterized test actions, and a list of tests. For example, node 1e1’ [1,2,3] means that the node ID is 1, test action e1’ occurs in tests 1, 2, and 3 (so its frequency is 3). To reduce the size of the suffix tree, a node may represent a sequence of test actions if the node has no siblings. For example, the node 2e[2-4]’ [1,2,3] represents a test action sequence . After the suffix tree is constructed, we sort all nodes in the decreasing order of test action frequency. This makes it convenient to determine the highest frequency node for each test action. From such a node, we can retrieve the longest subsequence (substring) involving the corresponding test action. Figure 4 shows a portion of the suffix tree in the running example. The nodes with a frequency of 3 are 1e1’, 2e [2-4]’, 3e5’, 10e[2-4]’, 11e5’, 18e5’, 43e26’. Figure 4. Partial suffix tree in the running example We use the following rules to process the frequency table so as to identify patterns of test actions: Rule 1 (Highest frequency rule): if a test action appears in multiple nodes with different frequency counts, only the node with the highest frequency will be kept in the table and all other nodes will be removed from the table. For instance, e26’ appears in nodes 25e26’ and 43e26’. Node 25e26’ is removed due to its lower frequency. Rule 2 (A priori path rule): if nodes N1 and N2 have the same frequency and N1 is the parent node of N2. Then N1 is removed from the frequency table. A property of the suffix tree is that, if a test has traversed a node, then all parents of the node must have traversed by the test. Consider 5e9’[1,2], which is traversed by tests 1 and 2. Its parent 4e[6-8]’ must have been traversed by the two tests as well. We remove the parent node 4e[6-8]’ from the frequency table. Rule 3 (Longest path rule): if a test action appears in multiple nodes with the same frequency, only the node with the longest path to the earliest ancestor that has the same frequency will remain in the frequency table, other nodes are removed. Consider e5’, which appears in nodes 3e5’, 11e5’, and 18e5’ with the same frequency of 3. Their paths are <1e1’, 2e[2-4]’, 3e5’>, <10e4’, 11e5’>, and <18e5’>. Thus, nodes 11e5’, and 18e5’ will be removed because 3e’5 has the longest path. Rule 4 (No-duplication rule): If there are multiple longest paths in Rule 3, only one node will be kept in the frequency table, and all other nodes are removed. Consider nodes 5e9’,13e9’, 20e9’, and 27e9’. Their paths have the same length. Three of them are removed from the frequency table. After the rules are applied, each node remaining in the frequency table indicates a pattern of test actions – all the test actions in the path from the earliest ancestor node with 267the same frequency to the given node. In the running example, the following nodes will remain in the frequency table: 3e5’, 43e26’, 5e9’, 6e10’, 9e[12-25]’. Node 1e1’ is the earliest ancestor of 3e5’. The test action sequence from 1e1’ to 3e5’ is . Table VI shows the five patterns in the running example. We use Ti to denote patterns. For each pattern, we also collect and rename the variables in the sequence of normalized test actions. Recall that a normalized test action may have two parameters ?x and ?y, which represent the value and portion of the target of a Selenium test action. Suppose ?x1,…,?xr, ?y1,…?ys are the parameters in the sequence of normalized test actions of pattern Ti. We use Ti(?x1,…,?xr, ?y1,…?ys) to denote pattern Ti. As such, we specify the resultant patterns as the “methods” section of a MIM as shown in Table IV. TABLE VI. PATTERNS IN THE EXAMPLE Node Test Action Sequence Pattern Parameters 3e5’ e1’, e2’, e3’, e4’, e5’ T1 ?x1, ?x2 43e26’ e26’ T2 5e9’ e6’, e7’, e8’, e9’ T3 ?x1, ?y1 6e10’ e10’ T4 9e[12-25]’ e12’ , e13’ ,…,e24’ ,e25’ T5 ?x1, ?x2,…,?x7 Based on the patterns, we transform each normalized test into an abstract test by substituting its test actions for patterns with corresponding actual parameters. For example, ST1’ =< e1’, e2’, e3’ , e4’ , e5’, e6’ , e7’ , e8’, e9’, e10’, e26’> can be transformed into the following abstract test: . To specify actual parameters in a well-defined format, we introduce named constants Ci to denote an actual parameter that is neither number nor identifier. For example, we use C1 and C2 to represent [email protected] and Acer Ferrari 3200 Notebook Computer PC, respectively. Thus, the abstract test can be rewritten as . We refer to the occurrence of a pattern in an abstract test as an abstract test action, i.e., an abstract test is a sequence of abstract test actions. The introduction of named constants Ci leads to the object mapping in the MIM specification as shown in Table IV. As such, the pattern miner results in a set of abstract tests and a MIM specification. IV. MINING EXECUTABLE SPECIFICATIONS A. PrT Nets PrT nets in this paper are a simplified version of those in ISTA [1] (e.g., inhibitor arcs and reset arcs are not used). A PrT net N is a tuple , where (a) P is a set of places (i.e., predicates), T is a set of transitions, and F is a set of arcs (b) L is a labeling function on arcs F. L(f) is a label for arc f. Each label is a tuple of constants and variables, (c) ϕ is a guard function on T. ϕ(t) is a guard condition for transition t, (d) M0 is an initial marking. Initial marking M0= ∪ p P M p ∈ 0( ) is a set of tokens residing in place p. A token in p is a tuple of constants, denoted as p(X1, …, Xn). The zero-argument tuple is denoted as <¢>.For token <¢> in p, we simply denote it as p. We also associate a transition with a list of variables as formal parameters, if any. Let p and t be a place and transition, respectively. p is called an input (or output) place of t if there is a normal arc from p to t (or from t to p). Let x/V be a variable binding, meaning that variable x is bound to value V. A variable substitution is a set of variable bindings. Let θ be a variable substitution and l be an arc label. l/θ denotes the tuple (or token) obtained by substituting each variable in l for its bound value in θ. Transition t in a given PrT net is said to be enabled by substitution θ under a marking if (a) each input place p of t has a token that matches l/θ, where l is the label of the normal arc from p to t; (b) the guard condition of t evaluates true according to θ. Enabled transitions can be fired. Firing an enabled transition t with substitution θ under M0 removes the matching token from each normal input place and adds new token l/θ to each output place, where l is the label of the arc from t to the output place. This leads to new marking M1. We denote a firing sequence as M0 [t1θ1>M1…[tnθn>Mn, where ti (1≤i≤n) is a transition, θi (1≤i≤n) is the substitution for firing ti, and Mi (1≤i≤n) is the marking after ti fires, respectively. A marking M is said to be reachable from M0 if there is such a firing sequence that transforms M0 to M. In Figure 2, T1-T5 are transitions (rectangles), whereas Pbegin, P1, P2, pt1, pt2, and pt3 are places (circles). The arc from pt1 to T1 is labeled by a tuple of variables . If an arc is not labeled, the default label is the zero-argument tuple <¢>. T1 is enabled by θ ={?x1/C1, ?x2/Mypswd1} because token in pt1 matches /θ, and token in Pbegin matches <¢>. B. Mining PrT Nets from Abstract Tests Our approach to mining a PrT net from a set of abstract tests consists of two steps. First, we synthesize a place/transition net (traditional Petri net) from the abstract tests without considering the actual parameters. Second, we update the place/transition net into a PrT net according to the actual parameters of the abstract tests. As the first step is not the focus of this paper, we use an existing process miner in the ProM tool [8] (http://promtools.org/prom5/), i.e., the Parikh language-based region miner [9]. This miner was successful in producing sound place/transition nets from the abstract tests without the actual parameters in our empirical studies (we found that the place/transition nets of the running example produced by the well-known α algorithm [10] and its extensions were incorrect). A place/transition net created by the above miner consists of places, transitions, arcs, and an initial marking. Each transition is corresponding to an abstract test action (i.e., test pattern). The place/transition net can reproduce all abstract tests without actual parameters in that each abstract test is a firing sequence in the Petri net. To deal with the actual parameters in the abstract tests, we upgrade this net as follows. For each abstract test action with actual parameters Ti(X1,…Y1,…), where X1,…Y1…, are actual parameters of ?x1,…?y1,…, if place pti has not been created before, we create place pti, add an arc from place pti to 268transition Ti, label the arc with < ?x1,…?y1…>, and associate (?x1,…?y1…) with Ti as its formal parameters. In addition, we add token to place pti in the initial marking. For the running example, we obtain the net in Figure 2. Theorem 1. The MID specification (PrT net and MIM) mined from the given Selenium tests can reproduce all Selenium tests if the place/transition net mined from the abstract tests without actual parameters is reproducible. Proof. For the given Selenium tests, the MID specification is produced by the following steps: (1) Transforming Selenium tests into normalized tests (Section III.A) (2) Transforming normalized tests into abstract tests and MIM (Section III.B) (3) Transforming abstract tests and MIM into Petri net and MIM (MID) (this Section) The theorem can be proved by demonstrating that each of the above steps is reversible. In step (1), each Selenium test < e1, e2,…, en> is transformed into a normalized test with actual parameters for each ei’(1≤i≤n). They have the same semantics because ei is equivalent to the definition of ei’ after variables ?x and ?y (if they exist) are substituted for their actual parameters. In step (2), each normalized test is converted into an abstract test with actual parameters of Ti, where Ti is specified by a sequence of normalized test actions in the “methods” section of the MIM specification and the actual parameters may use constants in the “objects” section of the MIM specification. For each abstract test , we can replace the actual parameters of Ti with the corresponding constants, replace Ti with the corresponding sequence of test actions, and replace the variables with the corresponding actual parameters. This will exactly result in the corresponding normalized test . In step (3), we can prove that each abstract test with actual parameters of Ti is a firing sequence in the resultant PrT net. According to the assumption that the place/transition net mined from the abstract tests without actual parameters is reproducible, is a firing sequence in the place/transition net. For any i (1≤i≤k), in the place/transition net, each input place of Ti has at least one token before Ti is fired under marking Mi. If Ti has no actual parameter, it has the same input places in the PrT net as in the place/transition net. Thus, Ti is firable in the PrT net. If Ti has actual parameters X1,…,Y1….. In the PrT net, Ti has an additional input place pti with arc label . The initial marking, however, has a token < X1,…,Y1….> in pti. This token remains in pti after Tj (j≤i) are fired because firing of Tj does not remove this token. Therefore, Ti is firable with respect to substitution {?x1/X1,…,?y1/Y1….}. In summary, each abstract test is a firing sequence in the PrT net. This abstract test can be transformed into a normalized test ,which is corresponding to the original Selenium test.  The PrT net mined from a given Selenium test suite is an executable model of the abstract behaviors of the tests. ISTA provides a stepwise simulator that can replicate all abstract tests in the net. In Figure 2, for example, T1 is highlighted (in red) because it is firable under the initial marking. If we fire T1, the marking will be updated and the transitions enabled under the new marking will be highlighted. Mining behavior models from test cases is the reverse process of model-based testing. Model-based testing consists of generation of abstract tests from a model (e.g., PrT net or finite state machine) and transformation of abstract tests into concrete tests (e.g., Selenium tests). Tests generated from a model are abstract because the model is an abstract description of the system under test. Transformation of abstract tests into concrete tests maps each test action into an executable form, such as Selenium test actions. Mining a behavior model from abstract tests is the opposite of test generation from a model, whereas pattern mining (grouping concrete test actions into abstract test actions) is the opposite of transforming abstract test actions into concrete test actions. It is worth pointing out that test cases used to mine a behavior model are not necessarily generated from models. However, the process of creating system tests often includes test generation from test requirements or implicit mental models and transformation of these tests into an executable form. In this sense, mining an abstract model from concrete tests depends on the reversibility of both test generation and test transformation. C. Applications While model mining techniques have been used for various purposes (e.g., program understanding, debugging, and conformance verification), our approach allows for additional applications due to the ability to mine executable specifications that can completely reproduce the original tests. In the following, we briefly discuss several applications. Incremental modeling: Modeling is useful for understanding the behavior of a system and verifying conformance to the system requirements. Our approach can facilitate interactive and incremental modeling. When trying to understand an existing system, a user can perform some exploratory tests, synthesize a model from the tests, edit the model or the MIM specification, perform more tests, synthesize a new model, and so on. This is particularly beneficial for reverse engineering and incremental development. Test updating: System testing often requires a large number of test cases. Maintenance of system tests is a nontrivial task. One problem is that many tests may need to be updated with respect to a particular function due to the changes of requirements, design or implementation. Due to system changes, for example, we may not be able to replay the tests recorded by a record-replay tool like Selenium IDE. Our approach can alleviate this problem. A pattern of test actions often represent a high-level function. We can easily modify the sequence of test actions in a pattern so as to update the original tests. In this case, the abstract tests and the model may remain unchanged. Automated test selection and prioritization: In software development process, regression testing is conducted from time to time because of frequent requirements changes. As test execution can be very time-consuming, test selection and test prioritization have been two regression testing problems. 269Test selection is to select a subset of the existing tests. Test prioritization is to sort a set of tests so that test failures, if they exist, can be reported as early as possible. Our work provides an automated model-based approach to addressing these problems. The model mined from the given test cases can be used to automatically select and prioritize the original tests according to test coverage criterion (e.g., transition coverage) of the model. Consider transition coverage of the model in Figure 2, the first two tests can be selected. Generation of new executable tests: Our approach can be used to generate new executable tests from the mined MID specification (PrT net and MIM description). The generation and execution of these tests needs no human intervention. We can use the test generators for coverage criteria of PrT nets (e.g., transition coverage, state coverage, and round-trip) or the random generator to produce new tests. Before test generation, we can also provide new test parameters in the PrT net and new test oracles in the concrete test actions in the MIM description. Our approach is particularly useful for permutation of existing test data. For example, the Selenium tests in the running example include such test data as user accounts, quantities, and products. ISTA can automatically generate new executable tests that cover all combinations of these user accounts, quantities, and products. It is worth pointing out that, while the net mined from a test suite can reproduce all original tests, it also includes behaviors beyond the given tests. A new test derived from the net may not be valid. Consider a test suite that involves a normal customer and a system administrator. A new test resulting from the permutation of test data may test system administration functions through a normal customer account. This is invalid although it can be modified to produce a useful negative test. To produce effective behavior models, we may classify tests into different pools before mining. V. EMPIRICAL STUDIES The main aspects of our approach consist of the pattern mining and model mining techniques. Theorem 1 in Section IV demonstrates that our model mining technique is sound. Therefore, our empirical studies focus on the following questions that aim at evaluating our pattern miner. (1) Do the discovered patterns of test action reflect high-level functions and do the mined models represent meaningful system behaviors? (2) To what extent does the pattern mining reduce the complexity of behavior model? (3) To what extent does the pattern mining reduce the transition coverage tests? Table VII is the list of Selenium test suites used in our empirical studies. They were created for two systems: Magento [11] and Amazon (amazon.com). Tests were created with different levels of automation: • Full automation (test suites 1 and 2): Selenium tests are generated completely from given MID specifications by ISTA. These tests came from our previous work on model-based generation of security tests from threat models [11]. • Partial automation (test suites 3-6): Abstract tests are generated from MID specifications (using PrT nets, not MIM descriptions) by ISTA. Concrete tests are derived and performed manually. • No-automation (test suites 7-8): Abstract tests are created without a formal test model and concrete tests are derived and performed manually. Test suite 5 is a subset of test suite 6 for comparing model mining with and without the pattern mining technique. ProM that is used to mine a place/transition net from a test suite could not handle the entire set of tests in test suite 6 if the pattern miner was not applied. The size of test suite 5 (35) is the maximum number of raw Selenium tests in test suite 6 that ProM could process. This is similar for test suite 7, a subset of test suite 8. The size of test suite 7 (31) is the maximum number of raw Selenium tests in test suite 8 that ProM could process. TABLE VII. SELENIUM TEST SUITES IN THE EMPICIAL STUDIES No Description # Tests # Selenium Actions #Average Actions 1 Magento XSS tests 8 78 9.75 2 Magento SQL injection tests 12 135 11.25 3 Magento function tests 15 222 14.8 4 Magento function tests 25 334 13.36 5 Magento function tests 35 877 25 6 Magento function tests 121 3,778 27.91 7 Amazon function tests 31 525 17 8 Amazon function tests 74 1,477 19.95 A. Quality of Patterns and Models For all the test suites, the discovered patterns of Selenium test actions represent meaningful functions, like those in the running example (such as “add item”). The models all capture the interactions of these functions in the given tests. They only reflect partial behaviors of the underlying system, though. For test suites 1-6, we also compared the mined models with the pre-existing models from which the tests were generated. A main difference is that a sequence of transitions (abstract test actions) in a pre-existing net is corresponding to one transition in the mined net. This is due to the “greedy” nature of the pattern miner, which always tries to find the longest subsequence. Consider again the running example. Suppose the third test ST3 is not used. T1 (“login”) and T3 (“add item”) would be merged into one larger pattern, which means “log in and add item”. This has two implications. First, the pattern miner depends on coverage of functions by the test suite as well as the size of the test suite. Second, the pattern miner may not be the best solution to automatic abstraction, if best abstraction can be defined at all. One potential improvement is to introduce additional pattern parameters (e.g., maximum size). Table VIII shows the number of patterns and the number of abstract test actions for each test suite. “Reduction of test actions” = (number of Selenium test actions - number of abstract test actions)/number of Selenium test actions. The reduction of test actions by the pattern miner ranges from 45% to 78%. An interesting finding is that the discovered patterns in test suites 5 and 6 are the same (test suite 5 is a 270subset but covers the same functions) and the discovered patterns in test suites 7 and 8 are the same (test suite 7 is a subset but covers the same functions). This indicates that, although mining a meaningful behavior model may require a certain number of tests, the increase in test suite size does not always improve accuracy. TABLE VIII. PATTERNS AND ABSTRACT TESTS No #Patterns #Average Pattern Size #Abstract Test Actions Reduction of Test Actions 1 11 1.54 31 58% 2 8 4.25 42 69% 3 8 2 78 65% 4 8 1.75 154 54% 5 6 4.66 240 73% 6 6 4.66 841 78% 7 12 1.5 277 47% 8 12 1.5 80 45% B. Pattern Mining vs. Model Complexity Given a test suite, we may synthesize a PrT model without using the pattern miner although this can be too complex to manage. Due to the pattern miner, however, our approach can reduce the complexity of the resultant model. Table IX shows how the pattern miner has reduced the complexity of the resultant models for the given test suites, where T and P stand for transitions and places, respectively. The reduction in number of transitions ranges from 35% to 85%, and the reduction in number of places ranges from 26% to 86%. The reduction rate is not applicable to test suite 6 or test suite 8 because ProM could not process the Selenium tests when the pattern miner was not used. TABLE IX. PATTERN MINING AND REDUCTION OF COMPLEXITY No Without Mining With Mining Complexity Reduction #T #P #T #P %T %P 1 17 11 11 5 35% 55% 2 46 36 8 5 78% 86% 3 29 32 6 14 79% 56% 4 18 22 8 12 56% 45% 5 41 31 6 14 85% 55% 6 N/A N/A 6 14 N/A N/A 7 43 27 12 20 72% 26% 8 N/A N/A 12 20 N/A N/A C. Pattern Mining and Transition Coverage Tests In a mined PrT net, each transition represents one or more Selenium test actions. One strategy for regression test selection is to cover these test actions at least once. Table X shows the impacts of the pattern miner on the number of tests for transition coverage of the mined models. For large test suites 6 and 8, only a few (yet lengthy) tests are needed to cover all transitions after the pattern miner is applied. This is because the resultant models are strongly connected. For test suites 2-5 and 7, the reduction of transition coverage tests ranges from 33% to 85%. An exceptional case is test suite 1, where the pattern miner does not reduce the transition coverage tests. The reason is that the tests in test suite 1 were derived from a test tree generated by ITSA. The mined model is also a tree structure - the transition coverage tests need to exercise all leaves of the tree. TABLE X. PATTERN MINING AND TRANSITION COVERAGE TESTS In summary, the empirical studies based on real-world applications have demonstrated that our approach can produce useful specifications from Selenium tests. The main threats to validity are due to the small number of applications and test suites, the size of the test suites (the largest test suite has only 121 tests), and the average length of the tests (the longest is 27.9 Selenium test actions). VI. RELATED WORK This paper is related to two areas of research: testing mining and process mining. A. Test Mining Existing test mining research falls into two categories: (a) detecting invariants over data values or event sequences, and (b) generating state transition models from execution traces. This paper is related to the latter. Most existing approaches are based on the kTail algorithm [12] or its variants. kTail merges states based on the similarity of the next k invocations in the trace. GK-tail [13] generates Extended Finite State Machines (EFSMs) from positive invocation sequences and uses Daikon to infer from values the constraints on state transitions. Walkinshaw and Bogdanov [14] and Lo et al. [15] augment the inference of FSMs with a steering mechanism based on temporal properties. Schmerl et al. [16] proposed techniques for discovering software architectures from execution traces. While this paper bears similarity to the above work on mining state models, there are several differences. First, our approach produces a PrT net, rather than FSM or EFSM. PrT nets are more expressive in that they can model and simulate both control flows and data flows. Second, our approach produces a model from GUI-based Selenium tests. As discussed in Section II.B, these tests are different from program tests that are composed of method invocations. The existing techniques do not mine patterns of test actions. Third, in our approach, all tests are completely reproducible from the resultant specification. Jääskeläinen et al. [17] presented a domainspecific approach for synthesizing test models from GUI tests of smartphone applications. It is domain-specific in that the set of keywords and the corresponding weight values must be decided based on the domain knowledge. To achieve useful models, this approach relies on human interaction to merge test cases. The synthesized model tends to be complex due to the lack of pattern discovery from GUI tests. Synoptic [5] produces a model of system logs that satisfies temporal invariants. It uses regular expressions for clustering contextinsensitive events in the logs. It is not concerned with pattern mining or reproducibility. No Without Mining With Mining Reduction % 1 8 8 0% 2 12 5 58% 3 6 1 83% 4 3 2 33% 5 7 2 71% 6 N/A 2 N/A 7 21 5 76% 8 N/A 5 N/A 271B. Process Mining Creating an accurate workflow model is a complicated time-consuming and error-prone process. Execution logs are often the only means for gaining insight into the ongoing processes [10]. Process mining aims to extract workflow models from the process execution logs, also called event logs, audit trails, or transaction logs [18]. Process mining has been used for the discovery, conformance analysis, and extension of workflow models. The focus of most research is on mining heuristics based on ordering relations of events in the process log. Although we use a process miner for producing a place/transition net from abstract tests, our work is different from process mining. First, process mining assumes that the set of events is given. Our approach needs to discover such a set of events (i.e., patterns). The pattern miner is different from the pre-processing of process mining, e.g., for removing noise in the log or using event timestamps to extract causality information. Second, workflow nets focus on the control-flow aspect of workflows [19]. Mining is based on the temporal ordering of event occurrences in the logs. Places in the resultant workflow net are created to represent the control dependency, not data dependency. Rozinat et al. [20] considered data dependency by using colored Petri nets as process models. They analyzed data dependency at single decision points to determine guard condition of individual transitions after the net structure is mined by the α algorithm [10]. This approach is unable to reproduce the sequences or the actual parameters. VII. CONCLUSIONS We have presented the approach to mining an executable specification from a given Selenium IDE test suite. The resultant specification provides an abstract behavior model that captures the interactions of the tests. This is very useful for understanding the system under test. The specification can also be used to select tests for regression testing and generate new executable tests. Our work also demonstrates that the two opposite processes of model mining and modelbased test generation can be integrated into one framework. This round-trip paradigm can provide an effective means for interactive model synthesis, simulation, and test generation. For example, a major barrier to applying model-based testing is the building of high quality test models. The model synthesized from some exploratory tests can serve as a starting point for building a comprehensive test model. In this paper, our approach was evaluated by a number of test suites for real-world systems. However, the evaluation was limited in the size of test suites and the length of individual tests. Our future work will further evaluate the scalability of our approach and its impact on model-based selection and prioritization of regression tests by using much larger test suites. In addition, Selenium IDE tests are essentially GUI-based system tests. We plan to investigate how our approach can be adapted to mining specifications from test suites produced by other record and replay tools. ACKNOWLEDGMENT This work was supported in part by the open projects program of the State Key Laboratory of Novel Software Technology, Nanjing University, China. REFERENCES [1] D. Xu, “A tool for automated test code generation from high-level Petri nets,” Proc. of Petri Nets’2011, LNCS 6709, pp.308-317. [2] H.J. Genrich, “Predicate/transition nets,” Petri Nets: Central Models and Their Properties, 207–247, 1987. [3] D. Xu, and K.E. Nygard, “Threat-driven modeling and verification of secure software using aspect-oriented Petri nets,” IEEE Trans. on Software Engineering. Vol. 32, No. 4, April 2006, pp. 265-278. [4] D. Xu, J. Yin, Y. Deng, J. Ding, “A formal architectural model for logical agent mobility,” IEEE Trans. on Software Engineering, vol. 29, no.1, January 2003, pp. 31-45. [5] I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M.D. Ernst, “Leveraging existing instrumentation to automatically infer invariantconstrained models,” Proc. of ESEC/FSE’11, Sept. 2011, Hungary. [6] V.D. Blondel, A. Gajardo, M. Heymans, M. Senellart, and P. V. Dooren, “A measure of similarity between grahph vertices: applications to synonym extraction and web searching,” SIAM review. 46(4), 647-666, 2004. [7] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1999. USA: Cambridge University Press. [8] B. van Dongen, A. de Medeiros, H. Verbeek, A. Weijters, W. van der Aalst, “The ProM framework: A new era in process mining tool support,” Petri Nets’05. LNCS Vol. 3536, pp.444-454,. [9] J.M. Werf, B.F. Dongen, C.A. Hurkens, A. Serebrenik, “Process discovery using integer linear programming,” Petri Nets’08, pp. 368- 387, 2008, Springer-Verlag. [10] W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster, “Workflow mining: discovering process models from event logs,” IEEE TKDE, 16(9):1128-1142, 2004. [11] L. Thomas, W. Xu, D. Xu, “Mutation analysis of Magento for evaluating threat model-based security testing,” Proc. of the 3rd IEEE International Workshop on Software Test Automation (STA’11), in conjunction with COMPSAC 2011, Munich, Germany, July 2011. [12] A. Biermann, and J. Feldman, “On the synthesis of finite state machines from samples of their behavior,” IEEE Transactions on Computers, 21:592–597, 1972. [13] D. Lorenzoli, L. Mariani, M. Pezzè, “Automatic generation of software behavioral models,” Proc. of ICSE’08, pp. 501-510, Leipzig. [14] N. Walkinshaw, and K. Bogdanov, “Inferring finite-state models with temporal constraints,” Proc. of ASE’08, pp. 248-257. [15] D. Lo, L. Mariani, and M. Pezzè, “Automatic steering of behavioral model inference,” Proc. of ESEC/FSE'09, pp. 345-354, Amsterdam. [16] B. Schmerl, J. Aldrich, D. Garlan, R. Kazman, and H. Yan, “Discovering architectures from running systems,” IEEE Trans. on Software Engineering, Vol. 32, No. 7, pp.454-466, 2006. [17] A. Jääskeläinen, A. Kervinen, M. Katara, A. Valmari, and H. Virtanen, “Synthesizing test models from test cases,” Proc. of the 4th International Haifa Verification Conf. on Hardware and Software: Verification and Testing, pp. 179-193, 2009. [18] R. Agrawal, D. Gunopulos, and F. Leymann, “Mining process models from workflow logs,” Proc. of the Sixth International Conf. on Extending Database Technology, pp. 469-483, 1998. [19] W.M.P. van der Aalst, “The application of Petri nets to workflow management,” The Journal of Circuits, Systems and Computers, 8(1):21–66, 1998. [20] A. Rozinat, R.S. Mans, M. Song, W.M.P. van der Aalst, “Discovering colored Petri nets from event logs,” International Journal on Software Tools for Technology Transfer (STTT) 10(1): 57-74, 2008. 272