Assignment title: Information

Chapter 10 collecting data for statistical analysis learning objectives When you have studied this chapter, you should be able to: • select a random sample • classify variables according to their level of measurement • describe the main methods for collecting data for statistical analysis • discuss the strengths and weaknesses of different methods • design questions for questionnaire and interview surveys. 10.1 Introduction You may be reading this chapter because you are designing a positivist study and you need to identify and discuss your intended method(s) of data collection to finalize your proposal. Alternatively, you may be reading this chapter because your proposal has been accepted, and you are now ready to start collecting primary research data for statistical analysis. In either case, this chapter will help guide you. We would emphasize that we do not recommend that you use the terms 'quantitative methods' or 'qualitative methods' as it is the data rather than the means of collecting the data that are in numerical or non- numerical form. You can read this chapter quite independently from Chapter 7, which focuses on the collection of qualitative data. In this chapter, we focus on methods used to collect data for statistical analysis. If you have already studied Chapter 7, you will notice that some of the methods we describe are similar as they can also be adapted for use under a positivist paradigm. You will remember from Chapter 4 that the two main methodologies associated with positivism are experimental studies and surveys. Since experimental studies are not widely used in business research for practical and ethical reasons, we focus on the methods used to collect primary research data when a survey methodology is adopted. We start by examining the main issues, which include methods for selecting a sample. This is followed by a section that explains the different types of variables about which data will be collected. This paves the way for a detailed discussion of the use of self- completion questionnaires and interviews. We also describe critical incident technique, which can be incorporated in either method. The close relationship between collecting and analysing the research data means it is important to think ahead to the type of statistical analysis you will use when designing the actual questions for self-completion questionnaires and interviews. Therefore, we examine the issues relating to designing questions separately. Main issues in collecting data for statistical analysis Researchers are interested in collecting data about the phenomena they are studying. You will remember that in Chapter 1 we defined data as known facts or things used as a basis for inference or reckoning. Some authors distinguish between data and information, by defining information as the knowledge created by organizing data into a useful form. Data are known facts or things used as a basis for inference or reckoning. Information is the knowledge created by organizing data into a useful form. Secondary data are collected from an existing source, such as publica- tions, databases and internal records. Primary data are gener- ated from an original source, such as your own experiments, surveys, interviews or focus groups. This obviously depends on how items of data are perceived and how they are used. For example, if you are a positivist, you may have collected data relating to the variables under study via a questionnaire survey, which you subsequently analysed using statistics.You probably consider that this process allowed you to turn data into information that makes a small contribution to knowledge. On the other hand, your respondents may consider that what they gave was information in the first place. Your research data can be quantitative (in numerical form) or qualita- tive (in non-numerical form, such as text or images). Data can also be classified by source. Your study may be based on an analysis of secondary data (data collected from an existing source) or on an analysis of primary data (data you have generated by collecting them from an original source, such as an experiment or survey). Typical sources of secondary research data include archives, commercial databases, government and commercially produced statistics and industry data, • Multi-stage sampling is used where the groups selected in a cluster sample are so large that a sub-sample must be selected from each group. For example, first select a sample of companies. From each company, select a sample of departments and from each department select a sample of managers to survey. 10.3 Variables Once you have determined which method you will use to select a sample, you will need to turn your attention to the variables about which you will collect data. You will remember that under positivism, research is deductive and one of the purposes of the A theory is a set of interre- lated variables, definitions and propositions that specifies relationships among the variables. A variable is a charac- teristic of a phenomenon that can be observed or measured. Empirical evidence is data based on observation or experience. A hypothesis is a proposition that can be tested for association or causality against empirical evidence. literature review is to identify a theory or set of theories (a theoretical framework) for your study. As explained in Chapter 3, a theory is a set of interrelated variables, definitions and propositions that specifies relationships among the variables. A variable is an attribute or charac- teristic of the phenomenon under study that can be observed and measured. Researchers collect data relating to each variable and use this empirical evidence to test their hypotheses. Before you can collect any research data, you need to understand the properties of the variables relating to the phenomena you are studying. We have just described a variable as an attribute or characteristic of the phenomenon under study that can be observed and measured.You can see from this definition that variables are usually taken to be numerical and this is because any non-numerical observations can be quantified by allocating a numerical code (Upton and Cook, 2006). For example, the responses to open questions in a survey can be examined to identify the main themes and then a number given to each theme or category. 10.3.1 Measurement levels The level at which a variable is measured has important implications for your subsequent choice of statistical methods. 'A level of measurement is the scale that represents a hier- archy of precision on which a variable might be assessed' (Salkind, 2006, p. 100). There are four levels of measurement, which we will examine in decreasing order of precision: • A ratio variable is a quantitative variable measured on a mathematical scale with equal intervals between points and a fixed zero point. The fixed zero point permits the highest level of precision in the measurement and allows us to say how much of the variable exists (it could be none) and compare one value with another. For example, A ratio variable is meas- ured on a mathematical scale with equal intervals and a fixed zero point. An interval variable is measured on a math- ematical scale with equal intervals and an arbitrary zero point. using sea level as the fixed zero point, we can measure altitude in feet or metres.This means we can say that one aeroplane is flying at an altitude measured in metres that is twice as high as another aeroplane. If we use kilometres as the measurement scale, we can measure the distance by train from London to Brussels. If we use time as the measurement scale, we would designate the time of departure from London as the fixed zero point and compare the average time of the journey by high speed train with the time by air. This allows us to say that, the mean (average) train journey is only 10% longer than by air. 9780230301832_11_cha10.indd 201 29/10/2013 11:55 • An interval variable is a grouped quantitative variable measured on a mathematical scale that has equal intervals between points and an arbitrary zero point. This means you can place each data item precisely on the scale and compare the values. For example, can compare your results with others based on the same construct. Examples include social stratification categories, frequency categories, ranking and rating scales (see section 10.5.4). Dependent and independent variables In many statistical tests it is necessary to identify the dependent variable (DV) and the independent variable (IV). A dependent variable is a variable whose values are influenced by one or more independent variables. Conversely, an independent variable is a variable that influences the values of a dependent variable. For example, in an experimental study, A dependent variable is a variable whose values are influenced by one or more independent variables. An independent variable is a variable that influences the values of a dependent variable. the intensity of lighting (IV) in the workplace might be manipulated to observe the effect on the productivity levels (DV), or a stressful situa- tion might be created by generating random loud noises (IV) outside the workplace window to observe the effect on the completion of complex tasks (DV). An extraneous variable is any variable other than the independent vari- able that might have an effect on the dependent variable. For example, if your study involves an investigation of the relationship between productivity and motivation, you may find it difficult to exclude the effect of other factors, such as a heatwave, a work-to-rule, a takeover or anxiety caused by personal problems. A confounding variable is one that obscures the effect of another variable. For example, employees' behaviour may be affected by the novelty of being the centre of the researcher's attention or by working in an unfamiliar place for the purposes of a controlled experiment. 10.4 Data collection methods The two main data collection methods we discuss in this section are self-completion questionnaires and interviews.We also describe critical incident technique, which can be incorporated in either method. These are widely used methods in positivist studies, but you should also explore other methods mentioned in previous studies in your field. Before you start collecting any data, you need to have a list of the population of people or collection of items under consideration. If the population is too large to include them all in your questionnaire or interview survey, you will need to decide on a method for selecting a suitable sample. Remember that you must also obtain ethical approval if your study involves human participants. In Chapter 7, we drew attention to the importance of using rigorous methods for recording research data that also provide evidence of the source. If the participant is not providing written responses, you will need to jot down the main points in a notebook. This necessarily means leaving out items and all the details, which can lead to distor- tions, errors and bias. Even shorthand writers sometimes have a problem in deciphering their notes afterwards and you need to be aware that relying on your notes will be inade- quate. Audio and/or video recording overcomes these problems and leaves you free to concentrate on taking notes of other aspects, such as attitude, behaviour and body language, if these are relevant to your understanding of the phenomena under study.You can use a specific recording device or the facilities on your telephone or laptop. The important thing to remember is that you need to obtain the participant's agreement to being recorded. lishing the general aims of an activity, the training of the interviewers and the manner in which observations should be made are all predetermined. What is of prime interest to researchers is the way in which Flanagan concentrates on an observable activity (the incident), where the intended purpose seems to be clear and the effect appears to be logical; hence, the incident is critical. We showed Flanagan's example of a form for collected effective critical incidents in Chapter 7. In this chapter we will look at an example taken from a questionnaire survey of householders (MacKinlay, 1986), which contained six open questions. The questionnaire allowed a third of an A4 page per question for the reply, but some respondents added additional sheets. The questions were preceded by an explanation, as shown in Box 10.4. Box 10.4 Critical incident technique in a survey These questions are open-ended and I have kept them to a few vital areas of interest. All will require you to reflect back on decisions and reasons for decisions you have made. 1 Please think about an occasion when you improved your home. What improvements did you make? 2 On that occasion what made you do it? 3 Did you receive any help? If 'yes', please explain what help you received. 4 Have you wanted to improve your home in any other way but could not? 5 What improvements did you wish to make? 6 What stopped you from doing it? Source: MacKinlay (1986) cited in Easterby-Smith, Thorpe and Lowe (1991, p. 84). It is likely that many researchers use this approach without realizing it. One of the benefits is that it allows the researcher to collect data about events chosen by the respondent because they are memorable, rather than general impressions of events or vicarious knowledge of events. In interviews, it can be of considerable value in gener- ating data where there is a lack of focus or the interviewee has difficulty in expressing an opinion. One of the problems associated with methods based on memory is that the participant may have forgotten important facts. In addition, there is the problem of post-rationalization, where the interviewee recounts the events with a degree of logic and coherence that did not exist at the time. Designing questions Once you have decided on the method and you have identified the variables about which you need to collect data to test your hypotheses, you are ready to start designing the actual questions you will ask. In this section, we focus on designing questions for a posi- tivist study, where the research data generated will be analysed using statistical methods. Before you can decide what the most appropriate questions will be, you must gain a considerable amount of knowledge about your subject to allow you to develop a theoret- ical or conceptual framework and formulate the hypotheses you will test. Your subject knowledge will come from your taught and/or independent studies; your theoretical framework (sometimes referred to as a conceptual framework) that underpins the hypotheses you will test will be drawn from your literature review.The statistical methods you will use will be described in your methodology chapter. You must be alert to the possibility that some of the issues you wish to investigate may be offensive or embarrassing to the respondents. We do not recommend you ask any sensitive questions in a self-completion questionnaire. Not only is it likely to deter respond- ents from answering the sensitive question, but it may discourage them from partici- pating at all. 10.6 Coding questions Although coding is more closely related to data analysis than to data collection, it is important to consider at this stage how you will analyse your research data and what software is available to help you with this task (for example Microsoft Excel, Minitab and IBM® SPSS® Statistics). SPSS is widely used in business research because it can process large amounts of data and we will be introducing the principles of data entry and analysis using SPSS in the next chapter. 10.6.1 Coding closed questions Pre-coding questions for statistical analysis as part of the questionnaire design makes the subsequent data entry easier and less prone to error. Where this is not possible, it is important to remember to keep a record of the codes used for each question and what they signify. This is essential should you decide to use a third party to input your data, and also for when you start to interpret the analysed data. It is usual to reserve certain code numbers for particular purposes. For nominal vari- ables where only one can be selected, allocate a different code to each so that the answer can be identified. For nominal variables where more than one answer may apply, each variable is treated independently: use 1 to indicate the box has been ticked (the characteristic is present) and leave blank if it has not been ticked. This will be interpreted by SPSS as a 'missing' data, which means a non-response. Depending on your planned analysis, you may wish to use 0 if the box has not been ticked (the char- acteristic is not present). Similarly, it is usual to code the answer 'yes' as 1 and the answer 'no' as 0. There is no need to pre-code ordinal variables because they use a numerical rating scale. You may have noticed that the examples of questions we used in this chapter were pre-coded. Box 10.13 shows an example of a completed questionnaire. Look carefully at the way in which the potential answers have been coded. Each code is discretely shown in brackets next to the relevant box. There are no hard and fast rules about where to place the codes and you may find that it makes more sense to put the codes at the top of a column of boxes for some sets of variables.You simply need to adopt a location that improves the accuracy and efficiency of processing the data, while not confusing the respondent. In this example, a smaller, lighter font has been used to reduce the likeli- hood of the respondent becoming distracted by codes. Earlier in this chapter, we suggested that you should pilot your questions before commencing your data collection in earnest. We also recommend that once you have your test data, you also pilot your coding. Amending coding errors now will save you valuable time and effort later when errors can only be painstakingly corrected by hand on every record sheet or questionnaire. 10.6.2 Coding open questions Statistical analysis can only be conducted on quantitative data. Open questions where the answer takes a numerical value do not need to be coded (for example dates or finan- cial data). However, open questions where you are unable to anticipate the response (including those where you provide an 'Other' category) will result in qualitative data that cannot be coded until all the replies have been received. The task of recording and counting frequencies accurately and methodically can be helped by using tallies. A tally is just a simple stroke used to count the frequency of occurrence of a value or category in a variable. You jot down one upright stroke for each occurrence until you have four; the fifth is drawn horizontally across the group, like a five bar gate. You can then count in fives until you get to the single tallies. Box 10.14 shows tallies being used to help record the frequencies for the second part of question 3, which was designed as an open ques- tion to capture the respondents' reasons for a particular action. Box 10.14 Using tallies to count frequencies 3. Would you have the accounts audited if not legally required to do so? (Tick one box only) Yes, the accounts are already audited voluntarily Yes, the accounts would be audited voluntarily No Please give reasons for either answer Voluntary audit Assurance for third party 1111 1111 1111 1111 1111 1111 1111 Good practice 1111 1111 1111 1111 No audit No benefit/no need 1111 1111 1111 1111 1111 1111 1111 1 Cost savings 1111 1111 1111 1111 1111 1111 11 10.7 Conclusions  (1)  (2)  (0) 35 19 36 32 9780230301832_11_cha10.indd 221 29/10/2013 11:55 chapter 10 | collecting data for statistical analysis 221 In this chapter, we have discussed the methods you can use to select a sample under a positivist paradigm, if the population is too large to be used. If you want to generalize the results from the sample to the population, you must select a random sample of sufficient size to represent the population and allow you to address your research questions. We have also investigated the main methods for collecting primary data under a positivist paradigm.You should now be in a position to make an informed choice, bearing in mind that some methods can be adapted for use under either paradigm and you can use more than one method. You must obtain ethical approval from your university before you collect any data if your study involves human participants. We have also examined how you can classify variables according to their level of meas- urement, which has important implications for how you design your questions and the statistical tests you can use to analyse your research data. There are a number of ways in which questions can be designed, including the use of hypothetical constructs to measure abstract ideas.We have discussed these matters and explained how questions in question- naires and other data record sheets can be pre-coded for subsequent statistical analysis. There is considerable choice in methods for distributing questionnaires. If you are using interviews, you must use rigorous methods to record the research data that provide evidence of the source. The important thing to remember is that you need to obtain the participant's agreement if you intend to audio record the interview and take notes. References Alexander, D. (2006) 'The devil with Dawkins', Times Higher Education. London: TSL Education, 3 February, p. 20. Brenner, M. (1985) 'Survey Interviewing', in Brenner, M., Brown, J. and Canter, D. (eds) The Research Interview: Uses and Approaches. New York: Academic Press, pp. 9–36. Clegg, F. G. (1990) Simple Statistics. Cambridge: Cambridge University Press. Collis, J. (2003) Directors' Views on Exemption from Statutory Audit, URN 03/1342, October, London: DTI. [Online]. Available at: http://www.berr.gov.uk/files/ file25971.pdf (Accessed 20 February 2013). Coolican, H. (2009) Research Methods and Statistics in Psychology, 5th edn. London: Hodder Arnold. Czaja, R. and Blair, J. (1996) Designing Surveys: A Guide to Decisions and Procedures. Thousand Oaks, CA: Pine Forge Press. Flanagan, J. C. (1954) 'The critical incident technique', Psychological Bulletin, 51(4), July, pp. 327–58. Hall, R. H. (1968) 'Professionalism and bureaucratization', American Sociological Review, 33, pp. 92–104. Kervin, J. B. (1992) Methods for Business Research. New York: HarperCollins. Krejcie, R. V. and Morgan, D. W. (1970) 'Determining sample size for research activities', Educational and Psychological Measurement, 30, pp. 607–10. Lee, R. M. (1993) Doing Research on Sensitive Topics. London: SAGE. MacKinlay, T. (1986) The Development of a Personal Strategy of Management, M.Sc. thesis, Manchester Polytechnic, Department of Management, cited in Easterby-Smith, M., Thorpe, R. and Lowe, A. (1991) Management Research. London: SAGE. Salkind, N. J. (2006) Exploring Research. Upper Saddle River, NJ: Pearson International. Upton, G. and Cook, I. (2006) Oxford Dictionary of Statistics, 2nd edn. Oxford: Oxford University Press. 1 You are interested in environmental issues. Discuss the advantages and disadvantages of collecting secondary data, such as the newspaper or television news coverage, compared with primary data. 2 General lecture questionnaire Think about the last lecture you attended and complete the following questionnaire. This is just an exercise and you won't be asked to identify the lecture or the lecturer or reveal your ratings. When you've finished, jot down what you like or dislike about the questionnaire from your perspective as a 'respondent'. Then form a group to discuss your views on the instructions, the layout and the questions. GENERAL LECTURE QUESTIONNAIRE The purpose of this questionnaire is to obtain your views and opinions about the lectures you have been given during the course from this lecturer to help him evaluate his teaching. Please ring the response that you think is the most appropriate to each state- ment. If you wish to make any comments in addition to those ratings please do so on the back page. The lecturer 1. Encourages student participation in lectures 2. Allows opportunities for asking questions 3. Has a good lecture delivery 4. Has good rapport with students 5. Is approachable and friendly with students 6. Is respectful towards students 7. Is able to reach student level 8. Enables easy note taking 9. Provides useful printed notes* 10. Would help students by providing printed notes 11. Has a good knowledge of his subject 12. Maintains student interest during lectures 13. Gives varied, lively lectures 14. Is clear and comprehensible in lectures 15. Gives lectures which are too fast to take in 16. Gives audible lectures 17. Gives structured, organized lectures 18. Appears to be enthusiastic for his subject *Please answer if applicable 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 Source: Anon. 3 Now put on your researcher's hat and redesign the general lecture questionnaire and pilot it with two fellow students. Stay with them while they complete it so you can ask them how useful they found the instructions and how easy they found it to answer the questions. Ask them what they liked and did not like about it. 9780230301832_11_cha10.indd 223 29/10/2013 11:55 Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Chapter 11 analysing data using descriptive statistics learning objectives When you have studied this chapter, you should be able to: • differentiate between descriptive statistics and inferential statistics • enter data into SPSS, recode variables and create new variables • generate frequency tables, charts and other diagrams • generate measures of central tendency and dispersion • generate measures of normality. 11.1 Introduction If you have adopted a positivist paradigm, you will have collected quantitative data and you will need to quantify any qualitative research data. If your knowledge of statistics is somewhat rusty, you should find this chapter useful as it contains key formulae for some of the basic techniques, together with step-by-step instructions and worked examples. However, you may prefer to enter your data into a software program, such as Microsoft Excel, Minitab or SPSS. In this chapter, we introduce you to IBM® SPSS® Statistics soft- ware (SPSS), which is widely used in business research because it can process a large amount of data. SPSS provides a data file where data can be stored, which is similar to a spreadsheet. Once the data have been entered or imported into SPSS, frequency tables, charts, cross- tabulations and a range of statistical tests can be performed quickly and accurately. The resulting output can then be pasted into your dissertation or thesis. Whether you decide to calculate the statistics yourself or use software, you will need to determine which statistics are appropriate for the data you have collected and how to interpret the results. This chapter and the next will give you guidance. Key concepts in statistics The term statistics was introduced by Sir Ronald Fisher in 1922 (Upton and Cook, 2006) and refers to the body of methods and theory that is applied to quantitative data. Moore et al. (2009, p. 210) define a statistic as 'a number that describes a sample'. For example, you could calculate the mean number of employees in a sample of companies to describe the average size of the sample. A statistic can be used to estimate an unknown parameter, which is a number that describes a population. Thus, if you had a random sample that was a representative of the population, you could use the sample mean to estimate the average number of employees in the population of companies. A random sample is a representative subset of the population where observations are made and a population includes the totality of observations that might be made (as in a census). Research data can be secondary data (for example a survey of a sample of annual reports using content analysis), primary data (for example a survey of a A statistic is a number that describes a sample. Statistics is a body of methods and theory that is applied to quantitative data. A parameter is a number that describes a population. Descriptive statistics are a group of statistical methods used to summa- rize, describe or display quantitative data. Inferential statistics are a group of statistical methods and models used to draw conclusions about a population from quantitative data relating to a random sample. sample of companies using questionnaires) or both. In addition to quantitative data, you may have collected some qualitative data (for example themes you have identified in the narrative sections of the annual reports or categories you have identified from responses to open questions in the questionnaire survey). You can see from the definition of statistics that statistical methods can only be applied to quantitative data, so you will need to quantify any qualitative data beforehand.You can do this by identifying each nominal variable and recording the frequency of occurrence of each category it contains. You will remember that in the previous chapter we recommended using tallies to aid the counting of frequencies. Statisticians commonly draw a distinction between descriptive statistics and inferential statistics. Descriptive statistics are used to summarize the data in a more compact form and can be presented in tables, charts and other graphical forms.This allows patterns to be discerned that are not apparent in the raw data and 'positively aids subsequent hypothesis detection/confirmation' (Lovie, 1986, p. 165). Inferential statistics are 'statistical tests that lead to conclusions about a target population based on a random sample and the concept of sampling distribution' (Kervin, 1992, p. 727). In an undergraduate dissertation, the research may be designed as a small, descriptive study. If so, you may be able to address your research questions by using descriptive statistics to explore the data from individual variables (hence the term univariate analysis). However, at postgraduate level, you are likely to design an analytical study. Therefore, you are more likely to use descriptive statistics at the initial stage and then go on to use infer- ential statistics (or other techniques) in a bivariate and/or multivariate analysis. We will examine the statistics used in bivariate analysis (analysis of data relating to two variables) and multivariate analysis (analysis of data relating to three or more variables) in the next chapter. 11.3 Getting started with SPSS 11.3.1 The research data We are going to use real business data collected for a postal questionnaire survey of the directors of small private companies (Collis, 2003) that focused on their option to forgo the statutory audit of their accounts. Do not worry if you know nothing about this topic, as no prior knowledge is required (you may remember seeing extracts from the question- naire as some of the questions were used as examples in the previous chapter). The survey was commissioned by the government as part of the consultation on raising the turnover threshold for audit exemption in UK company law from £1 million to £4.8 million, which would extend this regulatory relaxation to a greater number of small companies. The literature showed that although some of the companies that already qualified for audit exemption made use of it, others apparently chose to continue having their accounts audited. This led to the following research question: What are the factors that have a significant influence on the directors' decision to have a voluntary audit? Very briefly, the theoretical framework for the study was that the emphasis on turnover in company law at that time implied a relationship between size and whether the cost of audit exceeded the benefits. Agency theory (Jensen and Meckling, 1976) suggests that audit would be required where there was information asymmetry between 'agent' and 'principal' (for example the directors managing the company and external owners, or between the directors and the company's lenders and creditors). Based on this framework, a number of hypotheses were formulated. Each hypothesis is a statement about a relationship between two variables. The null hypothesis (H0) states that the two variables are independent of one another (there is no relationship) and the alternative hypothesis (H1) states that the two variables are associated with one another (there is a relationship). Using inferential statistics, the hypotheses are tested against the empirical data and the alternative hypothesis is accepted if there is statistically significant evidence to reject the null hypothesis (in other words, the null hypothesis is the default). Here is the first hypothesis in the null and the alternative form: H0 Voluntary audit does not increase with company size, as measured by turnover. H1 Voluntary audit increases with company size as measured by turnover. You should ask your supervisor whether he or she would prefer you to state your hypotheses in the null or the alternative form. Box 11.1 lists the nine hypotheses, which are stated in the alternative form. 11.4 Frequency distributions A frequency is the number of observations for a particular data value in a variable. A frequency distribution is an array that summarizes the frequencies for all the data values in a particular variable. A percentage frequency is a descriptive statistic that summarizes a frequency as a proportion of 100. In statistics, the term frequency refers to the number of observations for a particular data value in a variable (the frequency of occurrence of a quantity in a ratio or interval variable and a category in an ordinal or nominal variable). A frequency distribution is an array that summarizes the frequencies for all the data values in a particular variable (Upton and Cook, 2006). For example, the data values in the survey for the variable TURNOVER were the figures reported in the companies' 2006 annual accounts. If no company had precisely the same figure for turn- over as another, the number of observations for each data value would be 1. If the variable is measured on an ordinal scale (for example CHECK, which is coded 1–5) or a nominal scale (for example FAMILY, which is coded 1 or 0), the data values are the codes and the number of 11.4.1 11.4.2 observations are the number of companies in each category. A frequency distribution can be presented for one variable (univariate analysis) or two variables (bivariate analysis) in a table, chart or other type of diagram. Even if you only have a very small data set (say, 20 data values or less), an examination of how the values are distributed will aid your interpretation of the data. Percentage frequencies A percentage frequency is a familiar statistical model, which summarizes frequencies as a proportion of 100. It is calculated by dividing the frequency by the sum of the frequen- cies and then multiplying the answer by 100.This can be expressed as a formula: Percentage frequency = f × 100 ∑f where f = the frequency ∑ = the sum of Example The survey found that 633 companies out of 790 in the sample had a turnover of less than £1 million. Putting these figures into the formula: 633 × 100 = 80% 790 The formula we have used is not difficult to understand, but if you are not a statistician, you may find the mathematical notation somewhat mysterious. However, it is merely a kind of shorthand that speeds up the process of writing the formulae and, once you know what the symbols represent, you can decipher the message. As we are going to show you how to use SPSS to generate the statistics you require, we will not examine the mathematical side. Creating interval variables In a large sample, you may find it useful to recode ratio variables into non-overlapping groups and create a new variable measured on an equal-interval scale. For example, the original variable TURNOVER was recoded into a different variable named TURNO- VERCAT with five groups containing equal intervals of £1m.You need to take care that 11.5 11.5.1 Measuring central tendency We are now going to look at a group of statistical models that are concerned with meas- uring the central tendency of a frequency distribution. Measures of central tendency provide a convenient way of summarizing a large frequency distribution by describing it with a single statistic. The three measures are the mean, the median and the mode. The mean The mean is a measure of central tendency based on the arithmetic average of a set of data values. The mean (xˉ) is the arithmetic average of a set of data in a sample and can only be calculated for ratio or interval variables. It is found by dividing the sum of the observations by the number of observations, as shown in the following formula: Mean = ∑x n where x = each observation n = the total number of observations ∑ = the sum of The advantages of the mean are: • it can be calculated exactly • it takes account of all the data • it can be used as the basis of other statistical models. The disadvantages of the mean are: • it is greatly affected by outliers (extreme values that are very high or very low) • it is a hypothetical value and may not be one of the actual values • it can give an impossible figure for discrete data (for example the average number of owners in the sample of small companies was 5.8) • it cannot be calculated for ordinal or nominal data. Example A student's exam marks were as follows: Module 1 Module 2 Module 3 Module 4 82% 78% 80% 64% Inserting the data into the formula: 82 + 78 + 80 + 64 + 70 + 64 = 438 = 73% 66 Module 5 70% Module 6 64% 11.5.2 The median The median is a measure of central tendency based on the mid-value of a set of data values arranged in size order. The median (M) is the mid-value of a set of data that has been arranged in size order (in other words, it has been ranked). It can be calculated for variables measured on a ratio, interval or ordinal scale and is found by adding 1 to the number of observations and dividing by 2. The formula is: The results shows that the mean for the grouped data in the interval variable TURNO- VERCAT is £0.94m compared to the mean of £0.69m that we calculated earlier using the precise data contained in the ratio variable TURNOVER.The grouped data can only give an approximation of this important statistic. Moreover, this approximation is larger than the actual mean because it is based on the median in each category rather than every data value (observation). This helps demonstrate the superiority of ratio data over interval or ordinal data when it comes to measuring the mean, which lies at the heart of the most powerful statistical models used in inferential statistics.We will discuss this further in Chapter 12. Measuring dispersion Measures of central tendency are useful for providing statistics that summarize the loca- tion of the 'middle' of the data, but they do not tell us anything about the spread of the data values. Therefore, we are now going to look at measures of dispersion, which should only be calculated for variables measured on a ratio or interval scale. The two measures are the range and the standard deviation. Range The range is a measure of dispersion that represents the difference between the maximum value and the minimum value in a frequency distribution arranged in size order. The interquartile range is a measure of dispersion that represents the differ- ence between the upper quartile and the lower quartile (the middle 50%) of a frequency distribution arranged in size order. The range is a simple measure of dispersion that describes the differ- ence between the maximum value (the upper extreme or EU) and the minimum value (the lower extreme or EL) in a frequency distribution arranged in size order. You will remember from the previous section that the median is the mid-point, but in a large set of data (say, 30 observations or more) it can be useful to divide the frequency distri- bution into quartiles, each containing 25% of the data values. This allows us to measure the interquartile range, which is the difference between the upper quartile (Q3) and the lower quartile (Q1), and the spread of the middle 50% of the data values. When comparing two distributions, the interquartile range is often preferred to the range, because the latter is more easily affected by outliers (extreme values). The formulae are: Example Inserting the data for Turnover (£k) into the formulae: Range = 4,738.271 – 0.054 = 4,738.217 Interquartile range = 742.76625 – 52.74525 = 690.021 11.6.2 Range = EU – EL Interquartile range = Q3 – Q1 Unfortunately, the drawback of using the range is that it only takes account of two items of data and the drawback of the interquartile range is that it only takes account of half the values. What we really want is a measure of dispersion that will take account of all the values and we discuss such an alternative next. Standard deviation The standard deviation (sd) should only be calculated for ratio or interval variables, but it overcomes the deficiencies of the range and the interquartile range discussed in the 11.7 Normal distribution A normal distribution is a theoretical frequency distribution that is bell- shaped and symmetrical, with tails extending indefinitely either side of the centre. The mean, median and mode coincide at the centre. Frequency We mentioned in the previous section that the standard deviation is related to the normal distribution. This term was introduced in the late 19th century by Sir Francis Galton, cousin of Charles Darwin who published The Origin of Species in 1859 (Upton and Cook, 2006), and refers to a theoretical frequency distribution that is bell-shaped and symmetrical, with tails extending indefinitely either side of the centre. In a normal distribution, the mean, the median and the mode coincide at the centre (see Figure 11.14). It is described as a theoretical frequency distribution because it is a mathematical model representing perfect symmetry, against which empirical data can be compared. 9780230301832_12_cha11.indd 251 29/10/2013 11:55 Figure 11.14 A normal frequency distribution Data values Mean Median Mode It may surprise you that the output files from SPSS often contain a large amount of information. This is because the program provides the entire analysis to allow you to make a full interpretation of the results.Your next task is to decide how to summarize all your results.You will have seen many examples of how researchers do this when reviewing previous studies for your literature review. Tables 11.8 and 11.9 show examples of tables that are suitable for summarizing descriptive statistics for continuous and categorical variables respectively. Table 11.8 Descriptive statistics for continuous variable Variable N Min Max Median Mode Mean Std dev Skewness Kurtosis Statistic Std Statistic Std TURNOVER 790 .054 4738.271 158.0645 8.000 691.07062 1119.44891 2.042 Table 11.9 Frequency distributions for categorical variables error error .087 3.170 .174 Variable N CHECK QUALITY CREDIBILITY CREDITSCORE FAMILY EXOWNERS BANK EDUCATION Number coded 5 Number coded 4 Number coded 3 Number coded 2 Number coded 1 Number coded 0 697 348 166 103 40 40 - 687 197 142 158 95 95 - 688 300 182 126 40 40 - 681 206 158 183 63 71 - 790 - - - - 537 253 785 - - - - 127 658 722 ----400 322 790 ----553 237 11.8 Conclusions In this chapter, we have demonstrated how to conduct a typical exploratory analysis of research data, how to generate tables, charts and other graphical forms, and how to summarize data using descriptive statistics. All students designing a study that includes the analysis of quantitative data need this knowledge to explore your data and decide how to summarize appropriate descriptive statistics. It does not matter whether you use IBM® SPSS® Statistics software (SPSS) or another software program to which you have access. If you have a relatively small data set, you could enter it into a Microsoft Excel spreadsheet, which also has facilities for generating statistics and charts. Although it is possible to calculate percentage frequencies, measures of central tendency and disper- sion using a calculator, when time and accuracy are at a premium you will find it invalu- able to learn how to use the statistical package at your disposal. These are transferable skills that will enhance your employability. Table 11.10 summarizes the descriptive statistics we have examined in this chapter and helps you select those that are appropriate for the measurement level of your variables. In addition to time constraints and your skills, your choice of statistics will depend on research questions, which may require the use of inferential statistics in addition to the descriptive statistics we have explained in this chapter. We discuss inferential statistics in the next chapter, but if these are not required for your study, you may find the checklist in Box 11.5 helps ensure the successful completion of your analysis. Table 11.10 Choosing appropriate descriptive statistics Box 11.5 Checklist for conducting quantitative data analysis 1 Are you confident that your research design was sound? 2 Have you been systematic and rigorous in the collection of your data? 3 Is your identification of variables adequate? 4 Are your measurements of the variables reliable? 5 Is the analysis suitable for the measurement scale (nominal, ordinal, interval or ratio)? Exploratory analysis Measurement level Frequency distribution Percentage frequency Ratio, interval, ordinal, nominal Measures of central tendency Mean Median Mode Ratio, interval Ratio, interval, ordinal Ratio, interval, ordinal, nominal Measures of dispersion Range Standard deviation Ratio, interval Ratio, interval Measures of normality Skewness Kurtosis Ratio, interval Ratio, interval References BIS (2012) Statistical Release, URN12/92, 17 October. [Online]. Available at: http://www.bis.gov.uk/analysis/ statistics/business-population-estimates (Accessed 20 February 2013). Collis, J. (2003) Directors' Views on Exemption from Statutory Audit, URN 03/1342, October, London: DTI. [Online]. Available at: http://www.berr.gov.uk/files/ file25971.pdf (Accessed 20 February 2013). Field, A. (2000) Discovering Statistics Using SPSS for Windows. London: SAGE. Jensen, M. C. and Meckling, W. H. (1976) 'Theory of the firm: Managerial behavior, agency costs and the ownership structure', Journal of Financial Economics, 3, pp. 305–60. Kervin, J. B. (1992) Methods for Business Research. New York: HarperCollins. Lovie, P. (1986) 'Identifying Outliers', in Lovie, A. D. (ed.) New Developments in Statistics for Psychology and the Social Sciences 1. London: Methuen. Moore, D., McCabe, G. P., Duckworth, W. M. and Alwan, L. C. (2009) The Practice of Business Statistics, 2nd edn. New York: W.H. Freeman and Company. Upton, G. and Cook, I. (2006) Oxford Dictionary of Statistics, 2nd edn. Oxford: Oxford University Press. This chapter is entirely activity-based. If you have access to SPSS, start at the beginning of the chapter and work your way through. If SPSS is not available, do the same activities using an alternative software package following the on-screen tutorials and help facilities. Visit the companion website to try the progress test and for access to the data file referred to in this chapter at www.palgrave.com/business/collis/br4/ Have a look at the chapter and sections 14.2, 14.5, 14.7, 14.10, 14.12, 14.13 in particular, which relate specifically to this chapter.