Design and use of reference data sets for
testing scientific software
M.G. Cox*, P.M. Harris
Centre for Information Systems Engineering, National Physical Laboratory, Teddington, Middlesex TW11 0LW, UK
Received 1 July 1997; accepted 8 July 1998
Abstract
A general methodology for evaluating the accuracy of the results produced by scientific software has been developed at the
National Physical Laboratory. The basis of the approach is the design and use of reference data sets and corresponding
reference results to undertake black-box testing.
The approach enables reference data sets and results to be generated in a manner consistent with the functional specification
of the problem addressed by the software. The results returned by the software for the reference data are compared objectively
with the reference results. Quality metrics are used for this purpose that account for the key aspects of the problem.
In this paper it is shown how reference data sets can be designed for testing software implementations of solutions to a
broad class of problems arising throughout science. It is shown how these data sets can be used in practice and how the results
provided by software under test can properly be compared with reference results. The approach is illustrated with three
examples: (i) mean and standard deviation, (ii) straight-line fitting, and (iii) principal components analysis. Software for such
problems is used routinely in many fields, including optical spectrometry. # 1999 The National Physical Laboratory
Published by Elsevier Science B.V. All rights reserved.
Keywords: Reference data sets; Software testing; Black-box testing; Quality metrics
1. Introduction
There is a growing need to ensure that software used
by scientists is fit for purpose and especially that the
results it produces are correct, to within a prescribed
accuracy, for the problems purportedly solved. To this
end the National Physical Laboratory (NPL) has
developed a general methodology which is applicable
to a range of scientific software. The basis of the
approach is the design and use of reference data sets
and corresponding reference results to undertake
black-box testing.
The approach allows for reference data sets and
results to be generated in a manner consistent with the
functional specification of the problem addressed by
the software. In addition, data sets corresponding to
problems with various ‘‘degrees of difficulty’’, or with
application-specific properties, may be produced. The
comparison of the results returned by the test software
with the reference results is made objective by the use
of quality metrics that account for the absolute differences between the test and reference results, the
Analytica Chimica Acta 380 (1999) 339–351
*Corresponding author. Fax: +44-181-977-7091; e-mail:
[email protected]
0003-2670/99/$ – see front matter # 1999 The National Physical Laboratory Published by Elsevier Science B.V. All rights reserved.
P I I : S 0 0 0 3 - 2 6 7 0 ( 9 8 ) 0 0 4 8 1 - 4computational precision of the underlying hardware,
and the ‘‘degree of difficulty’’ of the data set.
The methodology has been applied in the NPL’s
own software development, and to test specific parts of
proprietary software packages, including spreadsheets, in common use.
In this paper it is shown how reference data sets can
be designed for testing software implementations of
solutions to a broad class of problems arising throughout science. It is shown how these data sets can be used
in practice and how the results provided by software
under test can properly be compared with reference
results. The approach is illustrated with three examples: (i) mean and standard deviation, (ii) straight-line
regression, and (iii) principal components analysis.
Software for such problems is used routinely in many
fields, including optical spectrometry.
1.1. Why software fails or produces unreliable
results
Software used in scientific disciplines can be unreliable because it implements numerical algorithms
that are unstable or not robust. Some of the reasons
[1] for such failings are:
1. failure to scale, translate, normalise or otherwise
transform the input data appropriately before
solution (and to perform the inverse operation if
necessary following solution);1
2. the use of an unstable parametrisation2 of the
problem;
3. the use of a solution process that exacerbates the
inherent (natural) ill-conditioning of the problem;3
and
4. a poor choice of formula from a set of mathematically (but not numerically) equivalent forms.
1.2. Requirements for testing: unambiguous
specification
Because detailed documentation on the algorithms
employed within software implementations is often
unavailable and interaction with the software is limited to ‘‘data in, results out’’, it is necessary to use
some form of black-box testing to discern whether any
particular item of software is fit for purpose. Such
testing can be carried out if there is a specification
available of the task carried out by the software. The
availability of such a specification is assumed here.
The specification may be simple, for example, this
software calculates the arithmetic mean and standard
deviation of a prescribed set of numerical values. Even
in this simple case, the specification is required to be
unambiguous. Hence, it is stated that the arithmetic
mean is of concern (rather than, say, the geometric or
the harmonic mean). In a more complicated case, such
as regression, it would be necessary to state such
aspects as whether (i) the regression is carried out
in a least-squares or in some other sense and (ii) the
residuals (the departures of the specified data points
from the model) are measured ‘‘vertically’’ (i.e., in the
‘‘direction’’ of the dependent variable), perpendicular
to the response surface, or in some other way.
Moreover, it is also necessary to define the ‘‘set of
numerical values’’, e.g., whether they form a column
vector or a row vector (possibly a contiguous set of
cells in a row or column in a spreadsheet model) or are
defined by some other data structure.
If the specification is inconsistent with the task
carried out by the software, testing in accordance with
the specification would yield the conclusion that the
software was deficient, when in fact it might be
acceptable.
1.3. Black-box testing and reference data sets
Although scientific software is very widely used, it
is rarely tested in an objective and impartial manner.
Some approaches for such testing have been developed, however. These address the individual software
modules called directly by scientists or as part of
software packages used in science [2,3,4]. Even when
1A simple instance of data transformation is ‘‘coding’’, as used
traditionally in many statistical studies. A data value x is replaced
by a value x0AxB which has been scaled (by A) and translated
(by B) such that all such x-values lie in a normalised range such as
ÿ1 to 1.
2Problem parametrisation is considered in Section 2.3.
3The condition of a problem is a measure of the sensitivity of the
solution to changes in the ‘‘input data’’. A well-conditioned
problem has the property that a small change to the input data
results in a correspondingly small change in the solution (when
solved using an optimally stable algorithm; see Section 3.1),
whereas for an ill-conditioned problem the results would change
appreciably. A poor algorithm can introduce the effects of illconditioning, even if not present in the problem. Moreover, it can
worsen any ill-conditioning effects already present.
340 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351the state-of-the-art has developed to the extent that
integrated systems can be tested properly, it will
remain necessary to test individual modules, because
of their importance in themselves, and their considerable use by software packages. Integrated-systems
testing concentrates on the interfaces between modules and such careful testing in conjunction with
individual module tests constitutes a key part of an
overall test.
The concern here is with reference data sets for the
black-box testing of scientific software. Such testing
involvessubmittingreferencedatasetstosoftwareunder
test, the test software, and the resulting output, the test
results, is compared with known reference results. For
theapplicationofthisapproach,itisnecessarytobeable
to generate reference pairs, i.e., reference data sets and
the corresponding reference results.
Section 2 of this paper addresses some of the main
considerations in designing reference data sets and
Section 3 the use of these data sets. Section 4 is
concerned with the production of reference data sets
in the important case of least-squares regression.
Section 5 presents examples of application. These
are (i) mean and standard deviation (the problem of
determining which can be posed as fitting a constant
function to data), (ii) straight-line regression (the
‘‘next-hardest’’ problem following fitting a constant),
and (iii) principal components analysis. Section 6
contains concluding remarks.
Some technical terms are used in this paper which
may not be familiar to the intended audience. Informal
definitions of these terms are included as footnotes.
2. Design of reference data sets
Some of the main considerations in designing
reference data sets for checking the fitness for the
purpose of software are:
1. the identification of possible deficiencies in the
test software, such as severe rounding-error
effects, subtractive cancellation4 and overflow5 in
the computation;
2. the incorporation of known properties in the data
sets, such as the ‘‘degree of difficulty’’6 of the
associated problem; and
3. the need to mimic actual measurement data sets,
i.e., the construction of data sets for which the
reference results are known and which are ‘‘close
to’’ the data sets used in a particular application
area.
In principle, the approach outlined here is capable
of application to a very general class of problems, viz.,
the determination of a minimum of a general function
of any number of variables. Here, however, the more
modest class of problems of least-squares regression7
is considered. This class contains the problem of
computing the arithmetic mean and standard deviation
of a data set.
2.1. Fitness for purpose
Software can at least in part be checked that it is fit
for purpose by ensuring that it performs properly for
reference data sets that are representative of the
application. It can further be so checked by including
data sets designed to reveal the extent to which the
software can yield sufficiently accurate solutions. For
both these purposes it is highly desirable that the data
sets possess certain properties. One valuable property
is that the problems identified by the data sets possess
known solutions.
Regression software can be deficient in diverse
ways, such as an inability to deal accurately with
observational data containing a significant noise component, large numbers of observations, or certain
distributions of points. Suitable tests can be devised
to help reveal such deficiencies.
2.2. Design based on prior knowledge of solution
The approach adopted at the NPL to the design of
reference data sets is based on the characterisation of
the solution to the problem addressed by the software.
For example, a necessary condition [5] for a solution
4An example of subtractive cancellation is given in Section 5.1.
5Overflow is the consequence of performing an arithmetic
operation such that the result is too large (in magnitude) to be held
within the arithmetic system employed. Similarly, underflow can
arise; the result is too small to be represented and is replaced by
zero or an approximation.
6The degree of difficulty of a problem is closely related to
condition (Section 1.1) and also incorporates the effect of problem
scaling (Section 3.1).
7Least-squares regression is the fitting of a model to data using
the method of least squares.
M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 341of a nonlinear least-squares problem is that JTe0.
Here e is the vector of residuals and J is the Jacobian
matrix, both evaluated at the solution. (J is the rectangular matrix of partial derivatives of the algebraic
expressions for the residuals with respect to the problem parameters, evaluated for each data point.) Note
that JTJ is the normal matrix, i.e., the matrix occurring
in the normal-equations approach ([6], p. 237) to
regression.
Reference data sets are constructed, using a data
generator, to have known solutions, i.e., solutions
specified a priori. The use of data generators is illustrated in Section 4.1.
There is an alternative approach to employing a data
generator, viz., to use reference software. Such software is software written to an extremely high standard
to solve the problem given in the functional specification. It is akin to a primary standard in measurement,
such as a kilogram mass to which secondary standards
are compared for calibration purposes. It has been
stated [3], however, that reference software is very
demanding to provide. At the NPL we have found that
the effort required to produce a data generator can be a
small fraction of that to produce reference software.
The reason is that the production of a data generator
tends to be a much more compact operation, with
fewer numerical pitfalls, and hence its testing overhead is significantly less than that of reference software for the forward calculation (i.e., of determining
reference results from reference data sets).
There is a further significant advantage of the use of
data generators over that of reference software. The
latter can of course be used to furnish the solution
corresponding to any data set within its domain of
applicability.8 However, in the case of the former,
required specific properties can be built in, e.g., in
polynomial regression a data set for which a particular
high-degree polynomial would be required.
Finally, real measurement data sets often have
certain properties that in total can be quite difficult
to characterise. However, data sets can be devised
which are ‘‘close’’ to such data sets and which correspond to known solutions [7].
2.3. Problem parametrisation
Problem parametrisation can readily be demonstrated using a straight line. A straight line can be
expressed in a variety of ways, including
y a1 a2x; (1)
where a1 is the y-axis intercept and a2 the gradient, and
y a1 a2
x ÿ c; (2)
where a2 is the gradient as before, but a1 now represents the intercept with the line xc, the value of c
being at choice, rather than with x0, the y-axis.
Parametrisation (2) is clearly a generalisation of (1).
This generality can be used to advantage in designing
an algorithm for straight-line fitting, by selecting a
value for c that yields desirable properties. The choice
of c as the arithmetic mean of the data abscissa xi has
several such properties ([2], p. 120).
There is an alternative to re-parametrisation in this
case, viz., data shifting. If the data abscissa values xi
are replaced before fitting by xiÿc, the form (1) then
obtains.
The type of testing considered here can implicitly
identify whether it is likely that a sensible parametrisation has or has not been used (or equivalently
whether a suitable data transformation has been
adopted).
Problem parametrisation can raise a difficulty for
the software tester. If there is no widely recognised or
‘‘canonical’’ parametrisation, either the values of the
parameters corresponding to the actual parametrisation used by the software must be compared with
reference values based on the same parametrisation,
or some subsidiary or derived quantities must be
assessed. The former means that the software tester
must be in a position to prepare reference solutions for
possibly a wide range of different parametrisations,
thus increasing the cost of the test. This is the solution
adopted by the International Standards Organisation
(ISO) [8] in its work on preparing a standard for
testing software implementations of a class of nonlinear regression algorithms used in industrial inspection. The latter implies that some invariant, e.g.,
canonical, form must be determined and moreover,
careful analysis carried out to ensure that the test on
these quantities implies an acceptable result for the
primary parameters. However, within the area of
8The domain of applicability of an item of software is the set of
inputs that the software has been designed to accept and for which
it should produce reliable results.
342 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351regression, the residuals of a correctly computed
solution are invariant to the parametrisation and are
often suitable for testing purposes. Section 4.1 considers the use of residuals in testing straight-line
regression software.
3. Use of reference data sets
This section treats the factors influencing the comparison of reference results and test results, and indicates the quality metrics used for this purpose. For
further details, see [2].
Major factors affecting the comparison of test and
reference results are the computational precision of
the arithmetic used to produce the test results,9 the
numerical precision of the reference data sets, the
degree of difficulty K (including the scale) of the
problem, and the problem parametrisation. The effects
of and K are considered here, assuming that problem-scale effects have been incorporated into K and
that the reference results are known to adequate
accuracy for comparison purposes. Further information concerning these factors and their influence and
other effects such as input–output is available [2,9].
3.1. Quality metrics
After the test software has been applied to each
reference data set, (at least) three quality metrics
which indicate how the test software has performed
on that data set can be determined [2]:
1. the difference between the test and reference
results, an absolute measure of departure;
2. the number of figures of agreement, a relative
measure of departure; and
3. a performance measure depending on the major
factors above.
The use of the third metric is considered here. It is
indicated below how it is determined and used in
practice.
Express the reference results and test results as
vectors of floating-point numbers. Let db(test)ÿb(ref),
where b(test) denotes the test results and b(ref) the
reference results. d(b)RMS(d) is used here to measure this difference, where RMS(x)kxk2/p n denotes
the root-mean-square value of an n-vector x.
For each data set the problem degree of difficulty K
is determined such that d(b) can sensibly be compared
with a tolerance equal to K. (The determination of K
would normally be carried out by a numerical analyst.)
Thus, K depends on the problem condition and scale.10
The quantity d(b)/(K) will then have the property that
it is of order unity for software that performs nearoptimally and is possibly very large for poorly performing software.
The performance measure P(b) is defined by
P
b log10 1 d
b
K : (3)
It indicates the number of figures of accuracy lost by
the test software over and above what software based
on an optimally stable algorithm11 would produce,
being near zero if b(test) and b(ref) agree as closely as
could be expected, and having a value of about k if the
agreement is k figures less than this. A related performance measure is used in testing software for
evaluating special functions [4].
4. Production of reference data sets
This section is concerned with the use of data
generators to produce reference data sets in regression
and related areas, leading to the null-space approach
to data generation (based on solution characterisation). It also covers graded data sets and performance
profiles. The approach is applicable to all aspects of
least-squares model fitting with physical or empirical
models, data with constant or non-constant variance,
9For the very commonly used floating-point arithmetic, is the
smallest positive representable number u, such that the value of
1u, computed using the arithmetic, exceeds unity. For the many
floating-point processors which today employ IEEE arithmetic,
2ÿ52210ÿ16, corresponding to approximately 16-digit working.
10The scale of a problem is taken into account through the
dimensions, physical or otherwise, of the results. It is essential that
the expression of the same results in different units, such as
kilograms or micrograms, does not adversely influence the
perception of the behaviour of the software.
11An optimally stable algorithm would produce as much
accuracy as is possible in solving the problem defined by the
specification and the data set, using the available computational
precision.
M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 343and data with or without correlated errors. Here, we
only consider the simplest case.
4.1. Reference software and data generators
A reference pair, i.e., a reference data set and the
corresponding reference results, may be produced in
two ways (Section 2.2):
1. Start with a reference data set and apply reference
software (as, e.g., in [10]) to it to produce the
corresponding reference results.
2. Start with some reference results and apply a data
generator to them to produce a corresponding
reference data set.
For problems with a unique solution there is one set
of reference results corresponding to a given reference
data set. Conversely, for given reference results there
is in general an infinite number of corresponding
reference data sets. This latter property can be used
to considerable advantage in generating reference data
sets, both in terms of forming data sets having selected
degrees of difficulty, and in determining data sets
which mimic actual data sets from applications. As
a simple example, there is a unique sample standard
deviation s for a given sample of two or more numbers,
whereas given s there are infinitely many data sets
having this value as their standard deviation. We return
to this example in Sections 4.2 and 5.1.
Null-space methods [7,9] provide a facility for
generating families or classes of data sets. When
any of these sets is submitted to correctly working
software, nominally the same solution is produced in
each case. Moreover, this solution is identical to a
known solution. A further advantage of null-space
methods is that they can be used to characterise the
space of reference data sets having the given solution.
In particular, a sequence of data sets from this
space can be extracted having a range of degrees of
difficulty that proves to be invaluable for software
testing.
The null-space approach is now illustrated for the
linear least-squares model yJae, where J denotes
the observation (Jacobian) matrix, y the observation
vector, a the vector of model parameters and e the
vector of residuals. For example, if the data to be fitted
is {(xi, yi)}, i1, . . ., m, and the model is the straight
line with parametrisation (1), then
J
1 x1
1 x2
...
...
1 xm
26664
37775
; y
y1
y2
...
ym
26664
37775
; e
e1
e2
...
em
26664
37775
; a
a1
a2 :
Now the least squares solution is characterised by
JTe0 [6]. This equation implies two conditions, one
that
mX i1
ei 0;
i.e., the sum of the residuals is zero, and the other that
mX i1
xiei 0;
i.e., the first moment (with respect to the abscissa
values) of the residuals is zero. Let N be a basis for the
null space of JT, i.e., JTN0.12 Then, for a vector
rNu, for any choice of vector u, the replacement of y
by yr will leave the solution vector a unchanged.
This result can be seen as follows. The condition
JTe0 and the model yJae imply the normal
equations JTJaJTy. But, from the definition of the
null space, JTrJTNu(JTN)u0. Thus, from any one
data set any number of further data sets having the
same solution can readily be constructed by choosing
vectors u.
The problem of reference software has hence essentially been ‘‘replaced’’ by that of a reference implementation of ‘‘null’’, which is at the heart of data
generation for all regression problems. The null space
of a general matrix J can be calculated very reliably
using the singular-value decomposition ([6], p. 602).
One such implementation is the null function in
Matlab [11]. Additionally, it is possible to check the
accuracy of the generated data sets by forming JTe and
assessing the closeness of kJTek2 to zero, relative to
the value of kJTk2 kek2. This check is generic and does
not require the availability of software to solve the
particular problem considered.
12This basis can be expressed as an independent set of vectors
which together form N. This set is in general not unique. It is
usually determined to be orthogonal, i.e., the inner product of any
two (different) vectors is zero. A vector in the null space can be
represented as a linear combination of these ‘‘null’’ vectors.
344 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–3514.2. Arithmetic mean and standard deviation
The arithmetic mean and standard deviation of a set
of numbers are considered as an illustrative example in
Section 5.1, but here we indicate how the above
concepts may be applied in this simple case. This
example is closely related to the above straight-line
model, since determining the mean of a set of numbers
is equivalent to fitting them by a constant, and their
standard deviation is identical to the root-mean-square
residual of the fit. Any two values having x as their
mean are given by x for any . (This is the only
null-space perturbation possible in this case.) Further,
any two values given by s=
p2 have s as their
(sample) standard deviation for any value . Thus, by
varying , e.g., different data sets of size two can be
generated, where in each case the solution standard
deviation is s.
These data sets contain only two values, and in each
case there are just two solution parameters, x and s.
Null-space methods [7] permit the above concept to be
extended to any number of values. Moreover, as
intimated earlier, they permit not just simple problems
such as that above to be handled, but also extend to
very complicated problems, even involving nonlinear
optimisation with constraints.13
In order to emphasise the importance of testing
standard-deviation software, we comment that a particular spreadsheet package loses 12 of the 14 decimal
figures available when s0.1 and 106, whereas
another loses just three figures. When is increased
to 107, the first spreadsheet package returns zero(!) as
the answer (as a result of losing all figures), whereas
the second delivers a value differing from the true
value by less than 10ÿ9. Thus, (or ) can be used as a
performance parameter, i.e., a varying parameter to
investigate performance.
4.3. Graded data sets and performance profiles
Software used in scientific disciplines often has the
property that it works well for certain data sets but
produces unacceptable results for other sets. There is
therefore a danger that limited testing might not reveal
its deficiencies. An example is polynomial regression
software, where the results of fitting a polynomial of
low-degree can be perfectly acceptable, whereas those
for higher degrees would be contaminated by errors
introduced by the solution algorithm. In Section 5.2
the simplest polynomial (apart from the constant), the
straight line, is again considered.
The testing of software on a small number of data
sets can provide valuable information about the software, particularly if the sets are selected to be as
disparate as possible in terms of their locations in the
space of possible data inputs. Such data sets provide
‘‘spot checks’’ of the test software, and are useful in
the sense that the software can be regarded as deficient
if it fails to perform adequately on any one of them.
They are also useful for the software developer to test,
e.g., particular paths through the software. They are
less effective for black-box testing where without
additional information little can be assumed about
the algorithms that have been implemented. Data sets
of this form have been used for a number of years in
testing software used in industrial inspection for fitting
mathematical features such as spheres, cylinders and
cones to data acquired using co-ordinate measuring
machines [12] and will become more formally established when an ISO standard is published [8], and in
chemometrics [2,13].
Graded data sets can be used to help quantify the
behaviour of scientific software. A sequence of data
sets is prepared, where each data set in the sequence
represents a problem that in some sense is more
difficult than the previous member in the sequence.
Such a suite of graded data sets is used to explore one
dimension of the problem addressed by the software.
The resulting performance profile, the graph of the
values of P against the degree of difficulty K (or some
related performance parameter), can provide valuable
information about the software. If, for graded reference data sets (taken in order), the values of the
resulting performance measure P(b) have a tendency
to increase, it is likely that an unstable method has
been employed. (Performance profiles in a somewhat
different setting have also been studied by Lyness
[14].) The performance profile can be expressed as a
series of points (or the set of straight-line segments
joining them). By introducing variability into each
data set for each value of the performance parameter,
13Constrained nonlinear optimization is the problem of minimizing a mathematical function of a number of variables subject to
a set of mathematical equalities and inequalities involving these
variables [5].
M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 345e.g., by using random numbers in a controlled way to
simulate observational error, the performance profile
will become a swathe of ‘‘replicated’’ points, which
can be advantageous in the fair comparison of rival
algorithms. Examples of performance profiles are
given in the following section.
5. Examples of application
We give three examples of application, the first two
of which arise in virtually all fields of science, and the
third in areas such as optical spectrometry and analytical chemistry. Although these examples are elementary, they are sufficient to illustrate the key aspects of
software testing using the presented methodology.
5.1. The sample arithmetic mean and standard
deviation
The arithmetic mean of m numbers xi, i1, . . ., m, is
defined by
x 1=m X
m
i1
xi: (4)
Apart from exceptional cases, most software for
computing x from given xi will produce a result that
has a relative accuracy approaching the computational
precision and hence be acceptable in all application
areas. The quality of software for computing the
standard deviation is in comparison very varied across
implementations. In fact, for some implementations,
the software degrades rapidly in its performance for a
sequence of graded data sets. The sample standard
deviation s is given (stably) by
s
1
m ÿ 1 X
m
i1
( )
xi ÿ x2 1=2; (5)
where x is the arithmetic mean defined in (4). Some
software uses the mathematically equivalent (but
unstable) form
s
1
m ÿ 1 X
m
i1
x2
( ) ! i ÿ m x2 1=2: (6)
It has the property that it suffers from subtractive
cancellation for data sets with small coefficient of
variation s= x.
For instance, if the values of xi are [0.98, 0.99, 1.00,
1.01,1.02], x 1:00; Pi x2 i 5:0010; m x2 5:0000,
and hence three to four significant figures are lost
due to subtractive cancellation in forming Pi x2 i
ÿm x2 5:0010 ÿ 5:0000 0:0010. In some cases,
e.g., a series of accurate replicate weighings of a
standard kilogram to within 1 mg, Pi x2 i will be
equal to m x2 to all figures held on the computer. It
is even possible that the computed value of Pi x2 i
will be (marginally) less than that of m x2 due to
the rounding errors occurred in forming these quantities.
In the former case, s is computed as zero (even
though the xi are not all identical). In the latter case,
some software packages internally replace the computed value of Pi x2 i ÿ m x2 by zero in order to avoid a
failure due to an attempt to take the square root of a
negative number. Such cases are not pathological. In a
comparison [13] of various software packages and
electronic calculators, not one of the eight tested on
weighing data produced the correct value of s to two
correct figures, although a perfectly acceptable value
would have been obtained by the use of formula
(5).
Consider a basic data set defined by X0[ÿnh,
ÿ(nÿ1)h, . . ., nh]. The arithmetic mean and
standard deviation of this data set are and
sh{(n1/2)(n1)/3}1/2, respectively. A family of
graded data sets can be derived from this set by adding
an increasing sequence of values: XkX0qk, k1,
. . ., N, for some q>1. The mean and standard deviation
of Xk are qk and s, respectively. Graded data sets
were so constructed using 3.172, h0.1, q1.5,
n12 and N60. A value of the degree of difficulty K
was calculated in each case as the inverse coefficient
of variation x=s [2]. Software implementations of
the above two formulae were applied to these
data sets and a performance profile determined (see
Fig. 1).
It is observed in Fig. 1 that the stable algorithm
(points joined by straight-line segments to give an
essentially horizontal line) performs completely satisfactorily for problem degrees of difficulty K spanning
10 decades. The unstable algorithm (points joined by
straight-line segments to give a generally increasing
line) loses a number of decimal figures over and above
that which would be lost by a stable algorithm in a
manner essentially proportional to log K. (This result
346 M.G. Cox, P.M. Harris/Analytica Chimica Acta 380 (1999) 339–351is predicted by a detailed floating-point error analysis
of formula (6).) A performance profile for standard
deviation software similar in behaviour to the generally rising line in Fig. 1 could be taken as an indication
that the unstable formula (b) (or a comparable
formula) had been used. It is to be noted that, for
almost 20% of the 60 cases, the results produced
by the unstable algorithm are as accurate as those
for the stable algorithm. This demonstration of the
fact that an unstable algorithm is (almost randomly)
capable of providing good results is a reason why
minimal checking of a piece of software is unsatisfactory.
There are counterparts of this result for many other
mathematical and statistical calculations.
5.2. Straight-line regression
Polynomial-regression software can be deficient in
a number of respects. Prime causes [2] are:
1. the parametrisation of the polynomial (i.e., the
choice of basis functions (monomials, Chebyshev
polynomials,14 etc.) for representing the polynomial);
2. the scaling or normalisation of the problem data;
and
3. the algorithm used to solve the linear algebraic
equations defining the polynomial coefficients.
Failure to scale or normalise the data sensibly or a
poor choice of parametrisation can cause the resulting
equations defining the parameters of the polynomial to
be unnecessarily ill-conditioned. A poor choice of
algorithm to solve this system can also induce additional ill-conditioning (which would have been
avoided by using a stable algorithm). Since the straight
Fig. 1. Performance profile for sample standard deviation against inverse coefficient of variation calculated using a stable algorithm (generally
horizontal line) and an unstable algorithm (generally increasing line).
14The monomial representation of a polynomial p(x) is
a0a1xa2x2 . Chebyshev polynomials provide another way
of expressing p(x), viz., a0T0(x)a1T1(x)a2T2(x) , for different coefficients. Here, Tj(x) is a polynomial of degree j in x, given
by T0(x)1, T1(x)x, Tj(x)2xTjÿ1(x)ÿTjÿ2(x), j>1 [15].
M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 347line is a special case of a polynomial the above causes
can also apply, although the first cause does not
because in this case the Chebyshev polynomials and
the monomials are identical.
For a chosen value of m, a basic data set f
xi; yigm 1
was devised by generating values yi lying exactly on a
straight line at m equispaced points xi in [ÿ1, 1]. Nullspace methods were then used to perturb these values
to simulate random noise. The actual test detailed
below applied to m41 but, to illustrate the null-space
approach, the case m3 is considered first.
The x-values in the case m3 are ÿ1, 0 and 1.
The corresponding Jacobian matrix (see Section 4.1)
is
J
1 ÿ1
1 0
1 1
24
35
;
and as can easily be verified by forming JTN, the
corresponding null-space matrix is
N
1
p6
1
ÿ2
1
24
35
:
Now, for any specific given straight line, generate
the three y-values corresponding to these values of x. If
the line is
y 5 2x; (7)
the y-values would be 3, 5 and 7. From the theory, any
vector that can be written as a linear combination
of the columns of N can be added to the vector of
y-values, and the least-squares straight line for these
perturbed values would be identical to the original line
(7). In this simple case the null matrix is a vector, so
any multiple,
p6 say, of the null-space vector can be
added to the y vector to give
ynew
3
5 ÿ 2
7
24
35
:
A small value of can be used to simulate a smallresidual problem and a large value a large-residual
problem. The significance of this statement is that
some software can perform less well for large-residual
problems and graded data sets spanning a wide range
of can be helpful in identifying the point at which,
for practical purposes, accuracy starts to be lost. It is to
be noted that for nonlinear least-squares problems, for
which iterative methods of solution, such as Gauss–
Newton and its variants [5], are employed, the convergence rates of these methods are affected by the
size of the residuals. Thus the performance (in terms
of execution time, accuracy of solution, ability to
converge to the required solution, etc.) of such algorithms with respect to this ‘‘size’’ can be examined.
For straight-line regression, a more important effect is
the data normalisation. In this regard, the performance
of test software to the choice of origin for the x-values
can be investigated. If a constant is added to the xvalues, the solution mathematically will be trivially
changed (Section 2.3): the gradient will be unaffected,
but the intercept term, if the parametrisation (1) is
used, will change. The change in the intercept value is
proportional to the constant translation of the x-values.
The residuals are of course unaffected.
Two items of straight-line regression software were
applied to a data set containing m41 points and
performance measure P based on (3) applied to the
residuals of the fitted function determined in each
case. The resulting performance profiles (P against
degree of difficulty K equal to the distance of the data
set from the origin) are shown in Fig. 2. The generally
lower line was obtained by joining the points representing the results for software based on a stable
algorithm, A, which uses a normalised variable, and
the resulting overdetermined linear system solved
using QR decomposition ([6], p. 239) (a stable algorithm for linear-least-squares problems). The generally higher line corresponds to values obtained from
the use of a less stable algorithm, B, which uses the
untransformed variable and forms and solves the
normal equations.
Note that Algorithm A performed near-optimally,
the software based on Algorithm B steadily losing
accuracy as K increased.
5.3. Principal components analysis
In building a good calibration model in analytical
chemistry, a key step is to predict concentration levels
of the analyte of interest, by assembling a representative range of samples [16]. Principal components
analysis (PCA) is a tool that assists in determining
this model. It is valuable because of its ability to
reduce the number of variables in a model to the
348 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351minimum number that provide the required information. Mathematically, the PCA problem is as follows.
Given an np observation matrix X, determine a
number, l