Design and use of reference data sets for testing scientific software M.G. Cox, P.M. Harris Centre for Information Systems Engineering, National Physical Laboratory, Teddington, Middlesex TW11 0LW, UK Received 1 July 1997; accepted 8 July 1998 Abstract A general methodology for evaluating the accuracy of the results produced by scientific software has been developed at the National Physical Laboratory. The basis of the approach is the design and use of reference data sets and corresponding reference results to undertake black-box testing. The approach enables reference data sets and results to be generated in a manner consistent with the functional specification of the problem addressed by the software. The results returned by the software for the reference data are compared objectively with the reference results. Quality metrics are used for this purpose that account for the key aspects of the problem. In this paper it is shown how reference data sets can be designed for testing software implementations of solutions to a broad class of problems arising throughout science. It is shown how these data sets can be used in practice and how the results provided by software under test can properly be compared with reference results. The approach is illustrated with three examples: (i) mean and standard deviation, (ii) straight-line fitting, and (iii) principal components analysis. Software for such problems is used routinely in many fields, including optical spectrometry. # 1999 The National Physical Laboratory Published by Elsevier Science B.V. All rights reserved. Keywords: Reference data sets; Software testing; Black-box testing; Quality metrics 1. Introduction There is a growing need to ensure that software used by scientists is fit for purpose and especially that the results it produces are correct, to within a prescribed accuracy, for the problems purportedly solved. To this end the National Physical Laboratory (NPL) has developed a general methodology which is applicable to a range of scientific software. The basis of the approach is the design and use of reference data sets and corresponding reference results to undertake black-box testing. The approach allows for reference data sets and results to be generated in a manner consistent with the functional specification of the problem addressed by the software. In addition, data sets corresponding to problems with various ‘‘degrees of difficulty’’, or with application-specific properties, may be produced. The comparison of the results returned by the test software with the reference results is made objective by the use of quality metrics that account for the absolute differences between the test and reference results, the Analytica Chimica Acta 380 (1999) 339–351 Corresponding author. Fax: +44-181-977-7091; e-mail: [email protected] 0003-2670/99/$ – see front matter # 1999 The National Physical Laboratory Published by Elsevier Science B.V. All rights reserved. P I I : S 0 0 0 3 - 2 6 7 0 ( 9 8 ) 0 0 4 8 1 - 4computational precision of the underlying hardware, and the ‘‘degree of difficulty’’ of the data set. The methodology has been applied in the NPL’s own software development, and to test specific parts of proprietary software packages, including spreadsheets, in common use. In this paper it is shown how reference data sets can be designed for testing software implementations of solutions to a broad class of problems arising throughout science. It is shown how these data sets can be used in practice and how the results provided by software under test can properly be compared with reference results. The approach is illustrated with three examples: (i) mean and standard deviation, (ii) straight-line regression, and (iii) principal components analysis. Software for such problems is used routinely in many fields, including optical spectrometry. 1.1. Why software fails or produces unreliable results Software used in scientific disciplines can be unreliable because it implements numerical algorithms that are unstable or not robust. Some of the reasons [1] for such failings are: 1. failure to scale, translate, normalise or otherwise transform the input data appropriately before solution (and to perform the inverse operation if necessary following solution);1 2. the use of an unstable parametrisation2 of the problem; 3. the use of a solution process that exacerbates the inherent (natural) ill-conditioning of the problem;3 and 4. a poor choice of formula from a set of mathematically (but not numerically) equivalent forms. 1.2. Requirements for testing: unambiguous specification Because detailed documentation on the algorithms employed within software implementations is often unavailable and interaction with the software is limited to ‘‘data in, results out’’, it is necessary to use some form of black-box testing to discern whether any particular item of software is fit for purpose. Such testing can be carried out if there is a specification available of the task carried out by the software. The availability of such a specification is assumed here. The specification may be simple, for example, this software calculates the arithmetic mean and standard deviation of a prescribed set of numerical values. Even in this simple case, the specification is required to be unambiguous. Hence, it is stated that the arithmetic mean is of concern (rather than, say, the geometric or the harmonic mean). In a more complicated case, such as regression, it would be necessary to state such aspects as whether (i) the regression is carried out in a least-squares or in some other sense and (ii) the residuals (the departures of the specified data points from the model) are measured ‘‘vertically’’ (i.e., in the ‘‘direction’’ of the dependent variable), perpendicular to the response surface, or in some other way. Moreover, it is also necessary to define the ‘‘set of numerical values’’, e.g., whether they form a column vector or a row vector (possibly a contiguous set of cells in a row or column in a spreadsheet model) or are defined by some other data structure. If the specification is inconsistent with the task carried out by the software, testing in accordance with the specification would yield the conclusion that the software was deficient, when in fact it might be acceptable. 1.3. Black-box testing and reference data sets Although scientific software is very widely used, it is rarely tested in an objective and impartial manner. Some approaches for such testing have been developed, however. These address the individual software modules called directly by scientists or as part of software packages used in science [2,3,4]. Even when 1A simple instance of data transformation is ‘‘coding’’, as used traditionally in many statistical studies. A data value x is replaced by a value x0AxB which has been scaled (by A) and translated (by B) such that all such x-values lie in a normalised range such as ÿ1 to 1. 2Problem parametrisation is considered in Section 2.3. 3The condition of a problem is a measure of the sensitivity of the solution to changes in the ‘‘input data’’. A well-conditioned problem has the property that a small change to the input data results in a correspondingly small change in the solution (when solved using an optimally stable algorithm; see Section 3.1), whereas for an ill-conditioned problem the results would change appreciably. A poor algorithm can introduce the effects of illconditioning, even if not present in the problem. Moreover, it can worsen any ill-conditioning effects already present. 340 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351the state-of-the-art has developed to the extent that integrated systems can be tested properly, it will remain necessary to test individual modules, because of their importance in themselves, and their considerable use by software packages. Integrated-systems testing concentrates on the interfaces between modules and such careful testing in conjunction with individual module tests constitutes a key part of an overall test. The concern here is with reference data sets for the black-box testing of scientific software. Such testing involvessubmittingreferencedatasetstosoftwareunder test, the test software, and the resulting output, the test results, is compared with known reference results. For theapplicationofthisapproach,itisnecessarytobeable to generate reference pairs, i.e., reference data sets and the corresponding reference results. Section 2 of this paper addresses some of the main considerations in designing reference data sets and Section 3 the use of these data sets. Section 4 is concerned with the production of reference data sets in the important case of least-squares regression. Section 5 presents examples of application. These are (i) mean and standard deviation (the problem of determining which can be posed as fitting a constant function to data), (ii) straight-line regression (the ‘‘next-hardest’’ problem following fitting a constant), and (iii) principal components analysis. Section 6 contains concluding remarks. Some technical terms are used in this paper which may not be familiar to the intended audience. Informal definitions of these terms are included as footnotes. 2. Design of reference data sets Some of the main considerations in designing reference data sets for checking the fitness for the purpose of software are: 1. the identification of possible deficiencies in the test software, such as severe rounding-error effects, subtractive cancellation4 and overflow5 in the computation; 2. the incorporation of known properties in the data sets, such as the ‘‘degree of difficulty’’6 of the associated problem; and 3. the need to mimic actual measurement data sets, i.e., the construction of data sets for which the reference results are known and which are ‘‘close to’’ the data sets used in a particular application area. In principle, the approach outlined here is capable of application to a very general class of problems, viz., the determination of a minimum of a general function of any number of variables. Here, however, the more modest class of problems of least-squares regression7 is considered. This class contains the problem of computing the arithmetic mean and standard deviation of a data set. 2.1. Fitness for purpose Software can at least in part be checked that it is fit for purpose by ensuring that it performs properly for reference data sets that are representative of the application. It can further be so checked by including data sets designed to reveal the extent to which the software can yield sufficiently accurate solutions. For both these purposes it is highly desirable that the data sets possess certain properties. One valuable property is that the problems identified by the data sets possess known solutions. Regression software can be deficient in diverse ways, such as an inability to deal accurately with observational data containing a significant noise component, large numbers of observations, or certain distributions of points. Suitable tests can be devised to help reveal such deficiencies. 2.2. Design based on prior knowledge of solution The approach adopted at the NPL to the design of reference data sets is based on the characterisation of the solution to the problem addressed by the software. For example, a necessary condition [5] for a solution 4An example of subtractive cancellation is given in Section 5.1. 5Overflow is the consequence of performing an arithmetic operation such that the result is too large (in magnitude) to be held within the arithmetic system employed. Similarly, underflow can arise; the result is too small to be represented and is replaced by zero or an approximation. 6The degree of difficulty of a problem is closely related to condition (Section 1.1) and also incorporates the effect of problem scaling (Section 3.1). 7Least-squares regression is the fitting of a model to data using the method of least squares. M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 341of a nonlinear least-squares problem is that JTe0. Here e is the vector of residuals and J is the Jacobian matrix, both evaluated at the solution. (J is the rectangular matrix of partial derivatives of the algebraic expressions for the residuals with respect to the problem parameters, evaluated for each data point.) Note that JTJ is the normal matrix, i.e., the matrix occurring in the normal-equations approach ([6], p. 237) to regression. Reference data sets are constructed, using a data generator, to have known solutions, i.e., solutions specified a priori. The use of data generators is illustrated in Section 4.1. There is an alternative approach to employing a data generator, viz., to use reference software. Such software is software written to an extremely high standard to solve the problem given in the functional specification. It is akin to a primary standard in measurement, such as a kilogram mass to which secondary standards are compared for calibration purposes. It has been stated [3], however, that reference software is very demanding to provide. At the NPL we have found that the effort required to produce a data generator can be a small fraction of that to produce reference software. The reason is that the production of a data generator tends to be a much more compact operation, with fewer numerical pitfalls, and hence its testing overhead is significantly less than that of reference software for the forward calculation (i.e., of determining reference results from reference data sets). There is a further significant advantage of the use of data generators over that of reference software. The latter can of course be used to furnish the solution corresponding to any data set within its domain of applicability.8 However, in the case of the former, required specific properties can be built in, e.g., in polynomial regression a data set for which a particular high-degree polynomial would be required. Finally, real measurement data sets often have certain properties that in total can be quite difficult to characterise. However, data sets can be devised which are ‘‘close’’ to such data sets and which correspond to known solutions [7]. 2.3. Problem parametrisation Problem parametrisation can readily be demonstrated using a straight line. A straight line can be expressed in a variety of ways, including y a1 a2x; (1) where a1 is the y-axis intercept and a2 the gradient, and y a1 a2x ÿ c; (2) where a2 is the gradient as before, but a1 now represents the intercept with the line xc, the value of c being at choice, rather than with x0, the y-axis. Parametrisation (2) is clearly a generalisation of (1). This generality can be used to advantage in designing an algorithm for straight-line fitting, by selecting a value for c that yields desirable properties. The choice of c as the arithmetic mean of the data abscissa xi has several such properties ([2], p. 120). There is an alternative to re-parametrisation in this case, viz., data shifting. If the data abscissa values xi are replaced before fitting by xiÿc, the form (1) then obtains. The type of testing considered here can implicitly identify whether it is likely that a sensible parametrisation has or has not been used (or equivalently whether a suitable data transformation has been adopted). Problem parametrisation can raise a difficulty for the software tester. If there is no widely recognised or ‘‘canonical’’ parametrisation, either the values of the parameters corresponding to the actual parametrisation used by the software must be compared with reference values based on the same parametrisation, or some subsidiary or derived quantities must be assessed. The former means that the software tester must be in a position to prepare reference solutions for possibly a wide range of different parametrisations, thus increasing the cost of the test. This is the solution adopted by the International Standards Organisation (ISO) [8] in its work on preparing a standard for testing software implementations of a class of nonlinear regression algorithms used in industrial inspection. The latter implies that some invariant, e.g., canonical, form must be determined and moreover, careful analysis carried out to ensure that the test on these quantities implies an acceptable result for the primary parameters. However, within the area of 8The domain of applicability of an item of software is the set of inputs that the software has been designed to accept and for which it should produce reliable results. 342 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351regression, the residuals of a correctly computed solution are invariant to the parametrisation and are often suitable for testing purposes. Section 4.1 considers the use of residuals in testing straight-line regression software. 3. Use of reference data sets This section treats the factors influencing the comparison of reference results and test results, and indicates the quality metrics used for this purpose. For further details, see [2]. Major factors affecting the comparison of test and reference results are the computational precision of the arithmetic used to produce the test results,9 the numerical precision of the reference data sets, the degree of difficulty K (including the scale) of the problem, and the problem parametrisation. The effects of and K are considered here, assuming that problem-scale effects have been incorporated into K and that the reference results are known to adequate accuracy for comparison purposes. Further information concerning these factors and their influence and other effects such as input–output is available [2,9]. 3.1. Quality metrics After the test software has been applied to each reference data set, (at least) three quality metrics which indicate how the test software has performed on that data set can be determined [2]: 1. the difference between the test and reference results, an absolute measure of departure; 2. the number of figures of agreement, a relative measure of departure; and 3. a performance measure depending on the major factors above. The use of the third metric is considered here. It is indicated below how it is determined and used in practice. Express the reference results and test results as vectors of floating-point numbers. Let db(test)ÿb(ref), where b(test) denotes the test results and b(ref) the reference results. d(b)RMS(d) is used here to measure this difference, where RMS(x)kxk2/p n denotes the root-mean-square value of an n-vector x. For each data set the problem degree of difficulty K is determined such that d(b) can sensibly be compared with a tolerance equal to K. (The determination of K would normally be carried out by a numerical analyst.) Thus, K depends on the problem condition and scale.10 The quantity d(b)/(K) will then have the property that it is of order unity for software that performs nearoptimally and is possibly very large for poorly performing software. The performance measure P(b) is defined by Pb log10 1 db K : (3) It indicates the number of figures of accuracy lost by the test software over and above what software based on an optimally stable algorithm11 would produce, being near zero if b(test) and b(ref) agree as closely as could be expected, and having a value of about k if the agreement is k figures less than this. A related performance measure is used in testing software for evaluating special functions [4]. 4. Production of reference data sets This section is concerned with the use of data generators to produce reference data sets in regression and related areas, leading to the null-space approach to data generation (based on solution characterisation). It also covers graded data sets and performance profiles. The approach is applicable to all aspects of least-squares model fitting with physical or empirical models, data with constant or non-constant variance, 9For the very commonly used floating-point arithmetic, is the smallest positive representable number u, such that the value of 1u, computed using the arithmetic, exceeds unity. For the many floating-point processors which today employ IEEE arithmetic, 2ÿ52210ÿ16, corresponding to approximately 16-digit working. 10The scale of a problem is taken into account through the dimensions, physical or otherwise, of the results. It is essential that the expression of the same results in different units, such as kilograms or micrograms, does not adversely influence the perception of the behaviour of the software. 11An optimally stable algorithm would produce as much accuracy as is possible in solving the problem defined by the specification and the data set, using the available computational precision. M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 343and data with or without correlated errors. Here, we only consider the simplest case. 4.1. Reference software and data generators A reference pair, i.e., a reference data set and the corresponding reference results, may be produced in two ways (Section 2.2): 1. Start with a reference data set and apply reference software (as, e.g., in [10]) to it to produce the corresponding reference results. 2. Start with some reference results and apply a data generator to them to produce a corresponding reference data set. For problems with a unique solution there is one set of reference results corresponding to a given reference data set. Conversely, for given reference results there is in general an infinite number of corresponding reference data sets. This latter property can be used to considerable advantage in generating reference data sets, both in terms of forming data sets having selected degrees of difficulty, and in determining data sets which mimic actual data sets from applications. As a simple example, there is a unique sample standard deviation s for a given sample of two or more numbers, whereas given s there are infinitely many data sets having this value as their standard deviation. We return to this example in Sections 4.2 and 5.1. Null-space methods [7,9] provide a facility for generating families or classes of data sets. When any of these sets is submitted to correctly working software, nominally the same solution is produced in each case. Moreover, this solution is identical to a known solution. A further advantage of null-space methods is that they can be used to characterise the space of reference data sets having the given solution. In particular, a sequence of data sets from this space can be extracted having a range of degrees of difficulty that proves to be invaluable for software testing. The null-space approach is now illustrated for the linear least-squares model yJae, where J denotes the observation (Jacobian) matrix, y the observation vector, a the vector of model parameters and e the vector of residuals. For example, if the data to be fitted is {(xi, yi)}, i1, . . ., m, and the model is the straight line with parametrisation (1), then J 1 x1 1 x2 ... ... 1 xm 26664 37775 ; y y1 y2 ... ym 26664 37775 ; e e1 e2 ... em 26664 37775 ; a a1 a2 : Now the least squares solution is characterised by JTe0 [6]. This equation implies two conditions, one that mX i1 ei 0; i.e., the sum of the residuals is zero, and the other that mX i1 xiei 0; i.e., the first moment (with respect to the abscissa values) of the residuals is zero. Let N be a basis for the null space of JT, i.e., JTN0.12 Then, for a vector rNu, for any choice of vector u, the replacement of y by yr will leave the solution vector a unchanged. This result can be seen as follows. The condition JTe0 and the model yJae imply the normal equations JTJaJTy. But, from the definition of the null space, JTrJTNu(JTN)u0. Thus, from any one data set any number of further data sets having the same solution can readily be constructed by choosing vectors u. The problem of reference software has hence essentially been ‘‘replaced’’ by that of a reference implementation of ‘‘null’’, which is at the heart of data generation for all regression problems. The null space of a general matrix J can be calculated very reliably using the singular-value decomposition ([6], p. 602). One such implementation is the null function in Matlab [11]. Additionally, it is possible to check the accuracy of the generated data sets by forming JTe and assessing the closeness of kJTek2 to zero, relative to the value of kJTk2 kek2. This check is generic and does not require the availability of software to solve the particular problem considered. 12This basis can be expressed as an independent set of vectors which together form N. This set is in general not unique. It is usually determined to be orthogonal, i.e., the inner product of any two (different) vectors is zero. A vector in the null space can be represented as a linear combination of these ‘‘null’’ vectors. 344 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–3514.2. Arithmetic mean and standard deviation The arithmetic mean and standard deviation of a set of numbers are considered as an illustrative example in Section 5.1, but here we indicate how the above concepts may be applied in this simple case. This example is closely related to the above straight-line model, since determining the mean of a set of numbers is equivalent to fitting them by a constant, and their standard deviation is identical to the root-mean-square residual of the fit. Any two values having x as their mean are given by x for any . (This is the only null-space perturbation possible in this case.) Further, any two values given by s= p2 have s as their (sample) standard deviation for any value . Thus, by varying , e.g., different data sets of size two can be generated, where in each case the solution standard deviation is s. These data sets contain only two values, and in each case there are just two solution parameters, x and s. Null-space methods [7] permit the above concept to be extended to any number of values. Moreover, as intimated earlier, they permit not just simple problems such as that above to be handled, but also extend to very complicated problems, even involving nonlinear optimisation with constraints.13 In order to emphasise the importance of testing standard-deviation software, we comment that a particular spreadsheet package loses 12 of the 14 decimal figures available when s0.1 and 106, whereas another loses just three figures. When is increased to 107, the first spreadsheet package returns zero(!) as the answer (as a result of losing all figures), whereas the second delivers a value differing from the true value by less than 10ÿ9. Thus, (or ) can be used as a performance parameter, i.e., a varying parameter to investigate performance. 4.3. Graded data sets and performance profiles Software used in scientific disciplines often has the property that it works well for certain data sets but produces unacceptable results for other sets. There is therefore a danger that limited testing might not reveal its deficiencies. An example is polynomial regression software, where the results of fitting a polynomial of low-degree can be perfectly acceptable, whereas those for higher degrees would be contaminated by errors introduced by the solution algorithm. In Section 5.2 the simplest polynomial (apart from the constant), the straight line, is again considered. The testing of software on a small number of data sets can provide valuable information about the software, particularly if the sets are selected to be as disparate as possible in terms of their locations in the space of possible data inputs. Such data sets provide ‘‘spot checks’’ of the test software, and are useful in the sense that the software can be regarded as deficient if it fails to perform adequately on any one of them. They are also useful for the software developer to test, e.g., particular paths through the software. They are less effective for black-box testing where without additional information little can be assumed about the algorithms that have been implemented. Data sets of this form have been used for a number of years in testing software used in industrial inspection for fitting mathematical features such as spheres, cylinders and cones to data acquired using co-ordinate measuring machines [12] and will become more formally established when an ISO standard is published [8], and in chemometrics [2,13]. Graded data sets can be used to help quantify the behaviour of scientific software. A sequence of data sets is prepared, where each data set in the sequence represents a problem that in some sense is more difficult than the previous member in the sequence. Such a suite of graded data sets is used to explore one dimension of the problem addressed by the software. The resulting performance profile, the graph of the values of P against the degree of difficulty K (or some related performance parameter), can provide valuable information about the software. If, for graded reference data sets (taken in order), the values of the resulting performance measure P(b) have a tendency to increase, it is likely that an unstable method has been employed. (Performance profiles in a somewhat different setting have also been studied by Lyness [14].) The performance profile can be expressed as a series of points (or the set of straight-line segments joining them). By introducing variability into each data set for each value of the performance parameter, 13Constrained nonlinear optimization is the problem of minimizing a mathematical function of a number of variables subject to a set of mathematical equalities and inequalities involving these variables [5]. M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 345e.g., by using random numbers in a controlled way to simulate observational error, the performance profile will become a swathe of ‘‘replicated’’ points, which can be advantageous in the fair comparison of rival algorithms. Examples of performance profiles are given in the following section. 5. Examples of application We give three examples of application, the first two of which arise in virtually all fields of science, and the third in areas such as optical spectrometry and analytical chemistry. Although these examples are elementary, they are sufficient to illustrate the key aspects of software testing using the presented methodology. 5.1. The sample arithmetic mean and standard deviation The arithmetic mean of m numbers xi, i1, . . ., m, is defined by x 1=m X m i1 xi: (4) Apart from exceptional cases, most software for computing x from given xi will produce a result that has a relative accuracy approaching the computational precision and hence be acceptable in all application areas. The quality of software for computing the standard deviation is in comparison very varied across implementations. In fact, for some implementations, the software degrades rapidly in its performance for a sequence of graded data sets. The sample standard deviation s is given (stably) by s 1 m ÿ 1 X m i1 ( ) xi ÿ x2 1=2; (5) where x is the arithmetic mean defined in (4). Some software uses the mathematically equivalent (but unstable) form s 1 m ÿ 1 X m i1 x2 ( ) ! i ÿ m x2 1=2: (6) It has the property that it suffers from subtractive cancellation for data sets with small coefficient of variation s= x. For instance, if the values of xi are [0.98, 0.99, 1.00, 1.01,1.02], x 1:00; Pi x2 i 5:0010; m x2 5:0000, and hence three to four significant figures are lost due to subtractive cancellation in forming Pi x2 i ÿm x2 5:0010 ÿ 5:0000 0:0010. In some cases, e.g., a series of accurate replicate weighings of a standard kilogram to within 1 mg, Pi x2 i will be equal to m x2 to all figures held on the computer. It is even possible that the computed value of Pi x2 i will be (marginally) less than that of m x2 due to the rounding errors occurred in forming these quantities. In the former case, s is computed as zero (even though the xi are not all identical). In the latter case, some software packages internally replace the computed value of Pi x2 i ÿ m x2 by zero in order to avoid a failure due to an attempt to take the square root of a negative number. Such cases are not pathological. In a comparison [13] of various software packages and electronic calculators, not one of the eight tested on weighing data produced the correct value of s to two correct figures, although a perfectly acceptable value would have been obtained by the use of formula (5). Consider a basic data set defined by X0[ÿnh, ÿ(nÿ1)h, . . ., nh]. The arithmetic mean and standard deviation of this data set are and sh{(n1/2)(n1)/3}1/2, respectively. A family of graded data sets can be derived from this set by adding an increasing sequence of values: XkX0qk, k1, . . ., N, for some q>1. The mean and standard deviation of Xk are qk and s, respectively. Graded data sets were so constructed using 3.172, h0.1, q1.5, n12 and N60. A value of the degree of difficulty K was calculated in each case as the inverse coefficient of variation x=s [2]. Software implementations of the above two formulae were applied to these data sets and a performance profile determined (see Fig. 1). It is observed in Fig. 1 that the stable algorithm (points joined by straight-line segments to give an essentially horizontal line) performs completely satisfactorily for problem degrees of difficulty K spanning 10 decades. The unstable algorithm (points joined by straight-line segments to give a generally increasing line) loses a number of decimal figures over and above that which would be lost by a stable algorithm in a manner essentially proportional to log K. (This result 346 M.G. Cox, P.M. Harris/Analytica Chimica Acta 380 (1999) 339–351is predicted by a detailed floating-point error analysis of formula (6).) A performance profile for standard deviation software similar in behaviour to the generally rising line in Fig. 1 could be taken as an indication that the unstable formula (b) (or a comparable formula) had been used. It is to be noted that, for almost 20% of the 60 cases, the results produced by the unstable algorithm are as accurate as those for the stable algorithm. This demonstration of the fact that an unstable algorithm is (almost randomly) capable of providing good results is a reason why minimal checking of a piece of software is unsatisfactory. There are counterparts of this result for many other mathematical and statistical calculations. 5.2. Straight-line regression Polynomial-regression software can be deficient in a number of respects. Prime causes [2] are: 1. the parametrisation of the polynomial (i.e., the choice of basis functions (monomials, Chebyshev polynomials,14 etc.) for representing the polynomial); 2. the scaling or normalisation of the problem data; and 3. the algorithm used to solve the linear algebraic equations defining the polynomial coefficients. Failure to scale or normalise the data sensibly or a poor choice of parametrisation can cause the resulting equations defining the parameters of the polynomial to be unnecessarily ill-conditioned. A poor choice of algorithm to solve this system can also induce additional ill-conditioning (which would have been avoided by using a stable algorithm). Since the straight Fig. 1. Performance profile for sample standard deviation against inverse coefficient of variation calculated using a stable algorithm (generally horizontal line) and an unstable algorithm (generally increasing line). 14The monomial representation of a polynomial p(x) is a0a1xa2x2 . Chebyshev polynomials provide another way of expressing p(x), viz., a0T0(x)a1T1(x)a2T2(x) , for different coefficients. Here, Tj(x) is a polynomial of degree j in x, given by T0(x)1, T1(x)x, Tj(x)2xTjÿ1(x)ÿTjÿ2(x), j>1 [15]. M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351 347line is a special case of a polynomial the above causes can also apply, although the first cause does not because in this case the Chebyshev polynomials and the monomials are identical. For a chosen value of m, a basic data set fxi; yigm 1 was devised by generating values yi lying exactly on a straight line at m equispaced points xi in [ÿ1, 1]. Nullspace methods were then used to perturb these values to simulate random noise. The actual test detailed below applied to m41 but, to illustrate the null-space approach, the case m3 is considered first. The x-values in the case m3 are ÿ1, 0 and 1. The corresponding Jacobian matrix (see Section 4.1) is J 1 ÿ1 1 0 1 1 24 35 ; and as can easily be verified by forming JTN, the corresponding null-space matrix is N 1 p6 1 ÿ2 1 24 35 : Now, for any specific given straight line, generate the three y-values corresponding to these values of x. If the line is y 5 2x; (7) the y-values would be 3, 5 and 7. From the theory, any vector that can be written as a linear combination of the columns of N can be added to the vector of y-values, and the least-squares straight line for these perturbed values would be identical to the original line (7). In this simple case the null matrix is a vector, so any multiple, p6 say, of the null-space vector can be added to the y vector to give ynew 3 5 ÿ 2 7 24 35 : A small value of can be used to simulate a smallresidual problem and a large value a large-residual problem. The significance of this statement is that some software can perform less well for large-residual problems and graded data sets spanning a wide range of can be helpful in identifying the point at which, for practical purposes, accuracy starts to be lost. It is to be noted that for nonlinear least-squares problems, for which iterative methods of solution, such as Gauss– Newton and its variants [5], are employed, the convergence rates of these methods are affected by the size of the residuals. Thus the performance (in terms of execution time, accuracy of solution, ability to converge to the required solution, etc.) of such algorithms with respect to this ‘‘size’’ can be examined. For straight-line regression, a more important effect is the data normalisation. In this regard, the performance of test software to the choice of origin for the x-values can be investigated. If a constant is added to the xvalues, the solution mathematically will be trivially changed (Section 2.3): the gradient will be unaffected, but the intercept term, if the parametrisation (1) is used, will change. The change in the intercept value is proportional to the constant translation of the x-values. The residuals are of course unaffected. Two items of straight-line regression software were applied to a data set containing m41 points and performance measure P based on (3) applied to the residuals of the fitted function determined in each case. The resulting performance profiles (P against degree of difficulty K equal to the distance of the data set from the origin) are shown in Fig. 2. The generally lower line was obtained by joining the points representing the results for software based on a stable algorithm, A, which uses a normalised variable, and the resulting overdetermined linear system solved using QR decomposition ([6], p. 239) (a stable algorithm for linear-least-squares problems). The generally higher line corresponds to values obtained from the use of a less stable algorithm, B, which uses the untransformed variable and forms and solves the normal equations. Note that Algorithm A performed near-optimally, the software based on Algorithm B steadily losing accuracy as K increased. 5.3. Principal components analysis In building a good calibration model in analytical chemistry, a key step is to predict concentration levels of the analyte of interest, by assembling a representative range of samples [16]. Principal components analysis (PCA) is a tool that assists in determining this model. It is valuable because of its ability to reduce the number of variables in a model to the 348 M.G. Cox, P.M. Harris / Analytica Chimica Acta 380 (1999) 339–351minimum number that provide the required information. Mathematically, the PCA problem is as follows. Given an np observation matrix X, determine a number, l