State of the Art: Automated Black-Box Web Application Vulnerability Testing Jason Bau, Elie Bursztein, Divij Gupta, John Mitchell Stanford University Stanford, CA {jbau, divijg}@stanford.edu, {elie, mitchell}@cs.stanford.edu Abstract—Black-box web application vulnerability scanners are automated tools that probe web applications for security vulnerabilities. In order to assess the current state of the art, we obtained access to eight leading tools and carried out a study of: (i) the class of vulnerabilities tested by these scanners, (ii) their effectiveness against target vulnerabilities, and (iii) the relevance of the target vulnerabilities to vulnerabilities found in the wild. To conduct our study we used a custom web application vulnerable to known and projected vulnerabilities, and previous versions of widely used web applications containing known vulnerabilities. Our results show the promise and effectiveness of automated tools, as a group, and also some limitations. In particular, “stored” forms of Cross Site Scripting (XSS) and SQL Injection (SQLI) vulnerabilities are not currently found by many tools. Because our goal is to assess the potential of future research, not to evaluate specific vendors, we do not report comparative data or make any recommendations about purchase of specific tools. Keywords-Web Application Security; Black Box Testing; Vulnerability Detection; Security Standards Compliance; I. INTRODUCTION Black-box web application vulnerability scanners are automated tools that probe web applications for security vulnerabilities, without access to source code used to build the applications. While there are intrinsic limitations of blackbox tools, in comparison with code walkthrough, automated source code analysis tools, and procedures carried out by red teams, automated black-box tools also have advantages. Black-box scanners mimic external attacks from hackers, provide cost-effective methods for detecting a range of important vulnerabilities, and may configure and test defenses such as web application firewalls. Since the usefulness of black-box web scanners is directly related to their ability to detect vulnerabilities of interest to web developers, we undertook a study to determine the effectiveness of leading tools. Our goal in this paper is to report test results and identify the strengths of current tools, their limitations, and strategic directions for future research on web application scanning methods. Because this is an anonymized conference submission, we note that the authors of this study are university researchers. Web application security vulnerabilities such as cross-site scripting, SQL injection, and cross-site request forgeries are acknowledged problems with thousands of vulnerabilities reported each year. These vulnerabilities allow attackers to perform malevolent actions that range from gaining unauthorized account access [1] to obtaining sensitive data such as credit card numbers [2]. In the extreme case, these vulnerabilities may reveal the identities of intelligence personnel [3]. Because of these risks, web application vulnerability remediation has been integrated into the compliance process of major commercial and governmental standards, e.g. the Payment Card Industry Data Security Standard (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), and the Sarbanes-Oxley Act. To meet these mandates, web application scanners that detect vulnerabilities, offer remediation advice, and generate compliance reports. Over the last few years, the web vulnerability scanner market as become a very active commercial space, with, for example, more than 50 products approved for PCI compliance [4]. This paper reports a study of current automated blackbox web application vulnerability scanners, with the aim of providing the background needed to evaluate and identify the potential value of future research in this area. To the best of our knowledge this paper is the most comprehensive research on any group of web scanners to date. Because we were unable to find competitive open-source tools in this area (see Section VII), we contacted the vendors of eight well-known commercial vulnerabilities scanners and tested their scanners against a common set of sample applications. The eight scanners are listed in Table I. Our study aims to answer these three questions: 1) What vulnerabilities are tested by the scanners? 2) How representative are the scanner tests of vulnerability populations in the wild? 3) How effective are the scanners? Because our goal is to assess the potential impact of future research, we report aggregate data about all scanners, and some data indicating the performance of the bestperforming scanner on each of several measures. Because this is not a commercial study or comparative evaluation of individual scanners, we do not report comparative detection data or provide recommendations of specific tools. No single scanner is consistently top-ranked across all vulnerability categories. We now outline our study methodology and summarize our most significant findings. We began by evaluating the 2010 IEEE Symposium on Security and Privacy 1081-6011/10 $26.00 © 2010 IEEE DOI 10.1109/SP.2010.27 332set of vulnerabilities tested by the scanners. Since most of the scanners provide visibility into the way that target vulnerability categories are scanned, including details of the distribution of their test vector sets by vulnerability classification, we use this and other measures to compare the scanner target vulnerability distribution with the distribution of in-the-wild web application vulnerabilities. We mine the latter from incidence rate data as recorded by VUPEN security [5], an aggregator and validator of vulnerabilities reported by various databases such as the National Vulnerability Database (NVD) provided by NIST [6]. Using database results, we also compare the incidence rates of web application vulnerability as a group against incidence rates for system vulnerabilities (e.g. buffer overflows) as group. In the first phase of our experiments, we evaluate scanner detection performance on established web applications, using previous versions of Drupal, phpBB, and Wordpress, released around January 2006, all of which include wellknown vulnerabilities. In the second phase of our experiments, we construct a custom testbed application containing an extensive set of contemporary vulnerabilities in proportion with the vulnerability population in the wild. Our testbed checks all of the vulnerabilities in the NIST Web Application Scanner Functional Specification [7] and tests 37 of the 41 scanner vulnerability detection capabililities in the Web Application Security Consortium [8] evaluation guide for web application scanners. (See Section VII). Our testbed application also measures scanner ability to understand and crawl links written in various encodings and content technologies. We use our custom application to measure elapsed scanning time and scanner-generated network traffic, and most importantly, we tested the scanners for vulnerability detection and false positive performance. Our most significant findings include: 1) The vulnerabilities for which the scanners test most extensively are, in order, Information Disclosure, Cross Site Scripting (XSS), SQL Injection, and other forms of Cross Channel Scripting (XCS). This testing distribution is roughly consistent with the vulnerability population in the wild. 2) Although many scanners are effective at following links whose targets are textually present in served pages, most are not effective at following links through active content technologies such as Java applets, SilverLight, and Flash. 3) The scanners as a group are effective at detecting well-known vulnerabilities. They performed capably at detecting vulnerabilities already reported to VuPen from historical application versions. Also, the scanners detected basic “reflected” cross-site scripting well, with an average detection rate of over 60%. 4) The scanner performed particularly poorly at detecting “stored” vulnerabilities. For example, no scanner Table I STUDIED VULNERABILITY SCANNERS Company Product Version Scanning Profiles Used Acunetix WVS 6.5 Default and Stored XSS Cenzic HailStorm Pro 6.0 Best Practices, PCI Infrastructure, and Session HP WebInspect 8.0 All Checks IBM Rational AppScan 7.9 Complete McAfee McAfee SECURE Web Hack Simulation and DoS N-Stalker QA Edition 7.0.0 Everything Qualys QualysGuard PCI Web N/A Rapid7 NeXpose 4.8.0 PCI detected any of our constructed second-order SQLI vulnerabilities, and the stored XSS detection rate was only 15%. Other limitations are discussed further in this paper. Our analysis suggests room for improvement in detecting vulnerabilities inserted in our testbed, and we propose potential areas of research in Section VIII. However, we have made no attempt to measure the financial value of these tools to potential users. Scanners performing as shown may have significant value to customers, when used systematically as part of an overall security program. In addition, we did not quantify the relative importance of detecting specific vulnerabilities. In principle, a scanner with a lower detection rate may be more useful if the smaller number of vulnerabilities it detects are individually more important to customers. Section II of this paper discusses the black box scanners and their vulnerability test vectors. Section III establishes the population of reported web vulnerabilities. Section IV presents scanner results on Wordress, phpBB, and Drupal versions released around January 2006. Section V discusses testbed results by vulnerability category for the aggregated scanner set and also false positives. Section VI contains some remarks by scanner, on individual scanner performance as well as user experience. Section VII discusses related work and section VIII concludes by highlighting research opportunities resultant from this work. II. BLACK BOX SCANNERS We begin by describing the general usage scenario and software architecture of the black-box web vulnerability scanners. We then discuss the vulnerability categories which they aim to detect, including test vector statistics where available. Table I lists the eight scanners incorporated in our study, which include products from several of the most-established security companies in the industry. All the scanners in the study are approved for PCI Compliance testing [4]. The prices of the scanners in our study range from hundreds to tens-of-thousands of dollars. Given such a wide price range and also variations in usability, potential customers of the scanners would likely not make a purchase decision on detection performance alone. 333XSS SQLI XCS Session CSRF Configuration Info leaks 0 10 20 30 40 50 Figure 1. Scanner Test Vector Percentage Distribution A. Usage Scenario To begin a scanning session using a typical scanner, the user must enter the entry URL of the web application as well as provide a single set of user login credentials for this application. The user then must specify options for the scanner’s page crawler, in order to maximize page scanning coverage. Most scanners tested allow a “crawl-only” mode, so that the user can verify that the provided login and the crawler options are working as expected. After setting the crawler, the user then specifies the the scanning profile, or test vector set, to be used in the vulnerability detection run, before launching the scan. All scanners can proceed automatically with the scan after profile selection, and most include interactive modes where the user may direct the scanner to scan each page. In our testbed experiments, we always set the scanner to run, in automated mode, the most comprehensive set of tests available, to maximize vulnerability detection capability. B. Software Architecture Descriptions We ran two of the tested scanners, McAfee and Qualys, as remote services whereby the user configures the scanner via a web-interface before launching the scan from a vendorrun server farm. The other six scanners were tested as software packages running on a local computer, although the NeXpose scanner runs as a network service accessed by browser via an IP port (thus naturally supporting multiple scanner instances run by one interface). All scanners, as would be expected of black box web-application testers, generate http requests as test vectors and analyze the http response sent by the web server for vulnerabilities. All local scanner engines seem to run in a single process, except for the Cenzic scanner, which runs a separate browser process that appears to actually render the http response in order to find potential vulnerabilities therein. Table II CONSENSUS VULNERABILITY CLASSIFICATION ACROSS SCANNERS Classification Example Vulnerability Cross-Site Scripting (XSS) Cross-Site Scripting SQL Injection (SQLI) SQL Injection Cross Channel Scripting Arbitrary File Upload Remote File Inclusion OS Command Injection Code Injection Session Management Session Fixation Session Prediction Authentication Bypass Cross-Site Request Forgery Cross Site Request Forgery SSL/Server Configuration SSL Misconfiguration Insecure HTTP Methods Information Leakage Insecure Temp File Path Traversal Source Code Disclosure Error Message Disclosure C. Vulnerability Categories Targeted by Scanners As each scanner in our study is qualified for PCI compliance, they are mandated to test for each of the Open Web Application Security Project (OWASP) Top Ten 2007 [9] vulnerability categories. We also examine the scanning profile customization features of each scanner for further insight into their target vulnerability categories. All scanners except Rapid7 and Qualys allow views of the scanning profile by target vulnerability category, which are often direct from the OWASP Top Ten 2007 and 2010rc1, Web Application Security Consortium (WASC) Threat Classification version 1 [10], or the Common Weakness Enumeration (CWE) top 25 [11]. In fact, each of the six allow very fine-grained test customization, resulting in a set of over 100 different targeted vulnerability categories, too numerous to list here. However, when related vulnerability categories were combined into more general classifications, we were able to find a set of consensus classifications for which all tools test. Table II presents this list of consensus classifications, along with some example vulnerabilities from each classification. We have kept Cross-Site Scripting and SQL Injection as their own vulnerability classifications due to their preponderant rate of occurrence (supported by “in the wild” data in the next section) and their targeting by all scanners. The Cross Channel Scripting classification [12] includes all vulnerabilities, including those listed in the table, allowing the user to inject code “across a channel” onto the web server that executes on the server or a client browser, aside from XSS and SQLI. D. Test Vector Statistics We were able to obtain detailed enough test profile information for four scanners (McAfee, IBM, HP, and Acunetix) to evaluate how many test vectors target each vulnerabilities classification, a rough measure of how much “attention” 334scanner vendors devote to each classification. Figure 1 plots the percentage of vectors targeting each classification aggregated over the four scanners. The results show that scanners devote most testing to information leakage vulnerabilities, followed by XSS and SQLI vulnerabilities. III. VULNERABILITY POPULATION FROM VUPEN-VERIFIED NVD In order to evaluate how well the vulnerability categories tested by the scanners represent the web application vulnerability population “in the wild”, we took all of the web vulnerability categories forming the consensus classifications from Table II and performed queries against the VUPEN Security Vulnerability Notification Service database for the years 2005 through 2009. We chose this particular database as our reference as it aggregates vulnerabilities, verifies them through the generation of successful attack vectors, and reports them to sources such as the Common Vulnerabilities and Exposures (CVE) [13] feed of the National Vulnerability Database. We collected from the VUPEN database the relative incidence rate trends of the web application vulnerability classes, which are plotted in Figure 2. Figure 3 plots incidences of web application vulnerabilities against incidences of system vulnerabilities, e.g. Buffer Overflow, Integer Overflow, Format String, Memory Corruption, and Race Conditions, again collected by us using data from VUPEN. Figure 2 demonstrates that Cross-Site Scripting, SQL Injection, and other forms of Cross-Channel Scripting have consistently counted as three of the top four reported web application vulnerability classes, with Information Leak being the other top vulnerability. These are also the top four vulnerability classes by scanner test vector count. Within these four, scanner test vectors for Information Leak amount to twice that of any other vulnerability class, but the Information Leak incidence rates in the wild are generally lower than that of XSS, SQLI, and XCS. We speculate that perhaps test vectors for detecting information leakage, which may be as simple as checking for accessible common default pathnames, are easier to create than other test types. Overall, however, it does appear that the testing emphasis for blackbox scanners as a group is reasonably proportional to the verified vulnerability population in the wild. We believe that the increase in SSL vulnerabilities shown in figure 2 does not indicate a need for increased blackbox scanning. A large number of SSL vulnerabilities were reported in 2009, causing the upward trend in SSL incidences. However, these are actually certificate spoofing vulnerabilities that allow a certificate issued for one domain name, usually containing a null-character, to become valid for another domain name [14], [15]. As this vulnerability is caused by mistakes made by the certificate authority and the client application (usually browser), it cannot be prevented Table III PREVIOUSLY-REPORTED VS SCANNER-FOUND VULNERABILITIES FOR DRUPAL, PHPBB2, AND WORDPRESS Category Drupal phpBB2 Wordpress 4.7.0 2.0.19 1.5strayhorn Known Found Known Found Known Found XSS 6 2 5 2 13 7 SQLI 2 1 1 1 8 4 XCS 4 0 1 0 8 3 Session 5 4 4 4 6 5 CSRF 2 0 1 0 1 1 Info Leak 4 3 1 1 6 4 by the website operator and thus cannot be detected by web application scanning. In effect, the number of SSL/Server configuration vulnerabilities that web application scanners may reasonably aim to detect does not appear to increase with the increased SSL vulnerability incidence rate. Finally, Figures 2 and 3 suggest that 2006 was a particularly high-incident year for web application vulnerabilities, with incidents actually decreasing in subsequent years. (This trend is also confirmed by searches in the CVE database.) While it is impossible to be certain, evidence gathered during the course of this study, including the effectiveness of the scanners at detecting basic XSS and SQLI vulnerabilities, suggests that the decrease may possibly be attributable to headway made by the security community against these vulnerabilities. Improved security, however, has been answered in turn by efforts to uncover more novel forms of the vulnerabilities. IV. SCANNER RESULTS ON COMMON WEB APPLICATIONS Having confirmed that the testing vector distribution of black-box web vulnerability scanners as a group roughly correlates with the vulnerability population trends in the wild, we now examine whether the scanners are actually successful at finding existent vulnerabilities. We ran all scanners on three popular web applications, Drupal, phpBB2, and Wordpress, all with known vulnerabilities. We chose to scan application versions released around January 2006, as this was prior to the peak in vulnerability reports in 2006. While these are field applications with some inherent uncertainty as to their exact vulnerability content, the early release dates mean these application versions are the most field-tested, with most vulnerabilities likely to have been recorded by VUPEN via the NVD. Table III lists the specific application versions tested as well as the number of known vulnerabilities, including those reported by the VUPEN database for each of these versions. For all applications, we installed only the default modules and included no add-ons. Table III also shows the number of vulnerabilities found by any scanners in the group, out of the set of known 335Number of vulnerability 0 100 200 300 400 500 600 700 800 900 1000 2005 2006 2007 2008 2009 XSS SQLi XCS Session CSRF SSL Infomation Leak Figure 2. Comparison of Web Application Vulnerability Classes in VUPEN Database 1186 2793 1528 996 1275 1095 2000 1951 1531 1647 Number of vulnerabilities 1000 2000 3000 2005 2006 2007 2008 2009 Web System Figure 3. Web Application Vulnerabilities versus System Vulnerabilities in VUPEN Database vulnerabilities. As the table shows, the scanner in total did a generally good job of detecting these previously known vulnerabilities. They did particularly well in the Information Disclosure and Session Management classifications, leading to the hypothesis that effective test vectors are easier to add for these categories than others. The scanners also did a reasonable job of detecting XSS and SQLI vulnerabilities, with about 50% detection rate for both. The low detection rate in the CSRF classification may possibly be explained by the small number of CSRF test vectors. Anecdotally, one scanner vendor confirmed that they do not report CSRF vulnerabilities due to the difficulty of determining which forms in the application require protection from CSRF. V. SCANNER RESULTS ON CUSTOM TESTBED In addition to testing scanner detection performance on established web applications, we also evaluated the scanners in a controlled environment. We developed our own custom testbed application containing hand-inserted vulnerabilities, each of which have a proven attack pattern. We verified each of the vulnerabilities present in this environment, allowing us significantly smaller uncertainty in vulnerability content than in the case of field-deployed applications. (The scanners as a group did not uncover any unintended vulnerabilities in our web application.) We plan to release this testbed publically. For each vulnerability classification, we incorporated both “textbook” instances and also forward-looking instances, such as XSS with non-standard tags, for each vulnerability classification. However, we kept the vulnerability content of our testbed fairly proportional with the vulnerability population in the wild. Our testbed has around 50 unique URLs and around 3000 lines of code, installed on a Linux 2.6.18-128.1.6.el5 server running Apache 2.2.3, MySQL 5.0.45, and PHP 5.1.6. PhpMyAdmin was also running on our server alongside the testbed application, solely for administrative purposes; we thus ignored any scanner results having to do with phpMyAdmin. The remainder of this section is devoted to scanner testbed data. We begin by presenting the performance footprint of each scanner on our testbed. Following this, we report page coverage results, designed to test scanner understanding of 336241 109 87 66 138 168 473 118 Acunetix Cenzic HP IBM McAfee N-Stalker Qualys Rapid7 0m 50m 100m150m200m250m300m350m400m450m500m (a) Scanner Execution Time in Minutes 123 76 35 71 25 122 48 186 146 116 206 125 53 877 145 649 Acunetix Cenzic HP IBM McAfee N-Stalker Qualys Rapid7 0 MB 100 MB 200 MB 300 MB 400 MB 500 MB 600 MB 700 MB 800 MB 900 MB Data sent Data received (b) Scanner Bytes Sent and Received Figure 4. Scanner Footprint various content technologies. We then present vulnerability detection results, first an overview and subsequently by vulnerability classification, giving a brief overview of our testbed design for each classification. Finally, we discuss false positives, including experimentally designed “traps” for false positives as well as scanner results. A. Scanner Time and Network Footprint Figures 4a and 4b respectively plot the time required to scan the testbed application and the number of network bytes sent/received by each scanner, as measured on the web server by tcpdump. Scanning time ranged from 66 to 473 minutes, while network traffic ranged from 80 MB to nearly 1 GB. Perhaps surprisingly, the scanning time and network traffic statistics seem to be relatively independent of each other, as exemplified by the Rapid7, Qualys, N-Stalker, and McAfee results. It is interesting that the two remote services, Qualys and McAfee, generated comparatively low amounts of network traffic. Finally, we wish to note that the footprint statistics are not indicative of vulnerability detection performance. B. Coverage Results To experimentally evaluate site coverage, we wrote hyperlinks using the technology in each category shown in figure 5 and embedded each landing page with tracker code that measured whether the link was followed. For Java, SilverLight, and Flash, the linked applet or movie is a simple, bare shell containing only the hyperlink. We then link to the technology page containing the link from the application home page, which is written in regular php. The link encoding category encompasses links written in hexadecimal, decimal, octal, and html encodings, with the landing page file named in regular ASCII. The “POST link” test involves a link that only shows up when certain selections are made on a POST form. The other technologies are self explanatory. Figure 5 shows the experimental results, where the measure is percentage of successful links crawled over total existent links by technology category. Figure 5 shows that the scanners as a group have fairly low comprehension of active technologies such as Java applets, SilverLight, and, surprisingly given its widespread use, Flash. We speculate that some scanners only perform textual analysis of http responses in order to collect URLs, thus allowing them to perform decently on script-based links, which are represented in text, but not allowing them to follow links embedded in compiled objects such as Java applets and Flash movies. This would also explain the better coverage of SilverLight over Flash and Java, as SilverLight is delivered in a text-based markup language. We also see that the scanners could improve their understanding of various link encodings. C. Vulnerability Detection Results 1) Overall Results: Figure 6 presents by vulnerability classification the vulnerability detection rate averaged over all scanners. The detection rate is simply calculated as the number of vulnerabilities found over the (known) number of total vulnerabilities. Results for each vulnerability classifications, including an added malware detection classification, are explained in detail in individual sub-sections to follow. Each vulnerability classification sub-section describes the testbed for the category, plots the average detection rate over all scanners, and also plots anonymous individual scanner results for the category sorted from best- to worst-performing for that category. The results show that the scanners as a group are fairly effective at detecting basic “reflected” cross-site scripting (XSS type 1), with a detection rate of over 60%. Also, although not shown, basic forms of first-order SQL Injection were detected by a majority of scanners. Unfortunately, the overall results for the first-order SQL vulnerability 33779.16 50 37.5 12.5 12.5 100 100 53.12 50 100 87.5 62.5 75 % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% Javascript events AJAX Silver Light Flash Java Applets PHP redirects Meta-refresh tag Link encoding Dynamic javascript Pop-up Iframe VBScript POST link Figure 5. Successful Link Traversals over Total Links by Technology Category, Averaged Over All Scanners. 62.5 15 11.25 20.4 15 21.4 0 26.5 32.5 31.2 0 XSS type 1 XSS type 2 XSS advance XCS CSRF SQL 1st order SQL 2nd order Session Config Info leak Malware 0% 10% 20% 30% 40% 50% 60% Figure 6. Average Scanner Vulnerability Detection Rate By Category classification were dragged down by poor scanner detection of more complex forms of first-order SQL injection that use different keywords. Aside from the XSS type 1 classification, there were no other vulnerability classifications where the scanners as a group detected more than 32.5% of the vulnerabilities. In some cases, scanners were unable to detect testbed vulnerabilities which were an exact match for a category listed in the scanning profile. We also note how poorly the scanners performed at detecting “stored” vulnerabilities, i.e. XSS type 2 and second-order SQL injection, and how no scanner was able to detect the presence of malware. We will discuss our thoughts on how to improve detection of these under-performing categories in Section VIII. 2) Cross-Site Scripting: Due to the preponderance of Cross-Site Scripting vulnerabilities in the wild, we divided Cross-Site Scripting into three sub-classes: XSS type 1, XSS type 2, and XSS advanced. XSS type 1 consists of textbook examples of reflected XSS, performed via the