CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B Simple Linear Regression AnalysisCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B LEARNING OBJECTIVES Upon completing this session, you should be able to do the following: • Calculate and interpret the correlation between two variables. • Recognize regression analysis applications for purposes of description and prediction. • Calculate the simple linear regression equation for a set of data and know the basic assumptionsbehind regression analysis. • Determine whether aregression model issignificant. • Calculate and interpret confidence intervalsfor the regression analysis. • Recognize some potential problemsif regression analysis Is used incorrectly.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PURPOSE OF REGRESSION AND CORRELATION • Regression helps to explain or understand the variation in a (dependent) variable. • We do this by finding other (independent) variables that are related to the dependent variable. We wish to know: • The direction of that relationship • The strength of that relationship Explanation (Description) • We can make use of the explanatory (independent) variables to help predict the likely outcome of the dependent variable. • For example, knowing the number of customers a fast food restaurant has… may enable management to forecast sales. PredictionCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PURPOSE OF REGRESSION AND CORRELATION • In a situation where we have some control over the value of the independent variable, this in-turn, enables some form of control over the dependent variable. • For example, by varying advertising expenditure up or down, to a certain extent, we may be able to control the movement in sales. ControlCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CAUSATION V. CORRELATION • Correlation does NOT imply cause and effect. Just because two variables are correlation it does not mean one causes the other. Be wary of Causality v. CorrelationCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONCEPTS IN REGRESSION AND CORRELATION 1. Scatter diagram: Graphical representation of the possible relationship between two variables. 2. Correlation: Measures the strength and direction of a linear relationship between two variables. 3. Regression: Gives the mathematical model of the relationship.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B SIMPLE VS MULTIPLE LINEAR REGRESSION • Simple Linear regression: The model involves only one independent variable. • Multiple regression: Involves the use of more than one independent variable to help explain the variation in the dependent variable (covered next week).CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B SCATTER DIAGRAMS Scatter Plot A two-dimensional plot showing the values for the joint occurrence of two quantitative variables. The scatter plot may be used to graphically represent the relationship between two variables. It is also known as a scatter diagram. • The vertical (y) axis always contains the dependent variable. • Look For ─ No relationships ─ Linear relationships ─ Non-linear relationshipsCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B TWO-VARIABLE (BI-VARIATE) RELATIONSHIPSCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CORRELATION COEFFICIENT Correlation Coefficient r A quantitative measure of the strength of the linear relationship between two variables. The correlation ranges from -1.0 to + 1.0. A correlation of ±1.0 indicates a perfect linear relationship, whereas a correlation of 0 indicates no linear relationship. The sign of r provides the direction of the relationship.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CORRELATION BETWEEN TWO VARIABLESCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CAUTION: NON-LINEAR RELATIONSHIPS • Before interpreting r a scatter plot must always be drawn. • In the following, r would be poor indicatorsof the actual strengthsof each relationship.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B EXAMPLE: SALES V. YEARS OF EXPERIENCE • BLITZ is studying the relationship between sales (on which commissionsare paid) and number of yearsa sales person iswith the company. A random sample of 12 sales representativesiscollected. Sales 487 445 272 641 187 440 346 238 312 269 655 563 Years with BLITZ 3 5 2 8 2 6 7 1 4 2 9 6CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B SALES V. YEARS OF EXPERIENCE ARE SALES AND YEARS OF EXP RELATED? r =+0.832 There appears to be a strong positive linear relationship between years of experience and sales. 0 100 200 300 400 500 600 700 0 2 4 6 8 10 Sales ($.000) Years of ExperienceCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B REGRESSION: LINE OF BEST FIT • Is there one line or curve which is closest to the set of data? • The “method of least squares” provides us with the line of best fit through the points on ascatter diagram. 0 100 200 300 400 500 600 700 0 2 4 6 8 10 Sales ($,000) Years of ExperienceCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B THE REGRESSION MODEL The estimated simple linear regression equation, is given by: Where: 𝐠� is the dependent variable xisthe independent variable b 0 is the Y-intercept (i.e. where the line cuts the vertical axis) b1istheslope of theline y ˆ = b0 + b1xCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B MEANING OF REGRESSION COEFFICIENTS The regression coefficients “b0" and "b1" can be interpreted in three ways. • Geometrically(i.e. graphically) • Algebraically(i.e. in equation form) • Practically(i.e. practical interpretation) We explain using the previous example: y=175.83 + 49.91xCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INTERPRETING INTERCEPT COEFFICIENT Geometrically: On the graph, b0 is where the line cuts the vertical axis. Our example: The line cuts the Yaxis at 175.83. Algebraically: b 0 is the value of Ywhen X= 0. Our example: Y= 175.83 when X= 0 years of experience. Practically: b 0 will not always have a useful interpretation as X= 0 may be well outside the range of X values used for the regression equation. Sometimes it is useful. Our example: The average sales for an agent with no years of experience is 175.83 ($,000)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INTERPRETING REGRESSION COEFFICIENT Geometrically: On the graph, b1 is the slope of the regression line. Our example: The slop is 49.91. Algebraically: b 1 is the change in the value of Ywhen Xchanges by one unit. Our example: If Xincreases by1, Yincreases by 49.91. Practically: b 1 indicates the impact on Yfrom a change in X. Our example: For each year of experience gained by an agent, the amount of sales increases by an average of 49.91 ($,000)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B HOW WELL DOES THE LINE FIT THE DATA? In all practical situations the regression line does not perfectly fit to the data. There will be small variations (errors) between the line 𝐠� and the actual points y i 𝐠− 𝐠� These variations are called residuals (error terms).CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B RESIDUALS 0 100 200 300 400 500 600 700 0 2 4 6 8 10 Sales ($.000) Years of ExperienceCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B GOODNESS OF FIT • We need to obtain measures of these residuals and hence how well the line fits the data. • To measure the variation around the line. We use Standard Error of the Estimate, s yx. • For how well the line fits the data we use the Coefficient of Determination, R 2.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B STANDARD ERROR Large Standard Error Small Standard ErrorCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B STANDARD ERROR OF THE ESTIMATE Example: Sales vs. Years of Experience s yx = 92.10 • Interpretation: “We estimate that the average variation of each individual Sales figures around the regression line is 92.10 ($,000)”. • As a rough approximation using the empirical rule, we could also say that the maximum deviation from the line will be: ± (3 × 92.10) or ± 276.3 ($,000)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B COEFFICIENT OF DETERMINATION • The portion of the total variation in the dependent variable that is explained by its relationship with the independent variable. • Normally expressed as a percentage. • It provides an absolute measure of the strength of the relationship. NOTE: Coefficient of Determination for the Single Independent Variable Case R2 = r2CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B COEFFICIENT OF DETERMINATION Example: Sales vs. Years of Experience R2 = 0.693 or 69.3 % • Interpretation: “Approximately 69% of the variation in sales is explained by or attributed to variation in the years of experience. The remaining 31% of variation would be the result of other factors not included in the model, e.g. negotiation skills, education, gender etc.”CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B LINEAR REGRESSION ASSUMPTIONS • Linearity The underlying relationship between Xand Yis linear. The error ε is a normallydistributed. • Homoscedasticity(Constant Variance) The variance of ε, is the same for all values of the independent variable. • Independence of error terms The values of ε are independent.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B LINEAR REGRESSION ASSUMPTIONSCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONSEQUENCES OF VIOLATING ASSUMPTIONS • Non-normality Error not normallydistributed. • Heteroscedasticity Variance not constant. Usually happens in crosssectional data • Autocorrelation Errors are not independent. Usually happens in time-series data • Consequences of Any Violation of the Assumptions Predictions and estimations obtained from the sample regression line will not be accurate. Hypothesis testing results will not be reliable. • It is Important to Verify the AssumptionsCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B RESIDUAL ANALYSIS • Purposes ─ Examine linearity ─ Evaluate violations of assumptions • Graphical Analysis of Residuals ─ Plot residuals vs. X and timeCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CHECKING FOR LINEARITY • Nonlinear Pattern: • Linear Pattern:CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONSTANT AND NON-CONSTANT VARIANCES • Constant Variances: • Non-constant Variances:CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B ARE THE RESIDUALS INDEPENDENT? • Independent Residuals: • Residuals NOT independent:CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B NORMALITY ASSUMPTION Histogram of Standardized Residuals Normal Probability Plot of ResidualsCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CORRECTIVE ACTIONS • Approachesthat may work if the model is not appropriate: ─ Transforming some of the independent variable o Raising xto apower o Taking the square root of x o Taking the log of x o If the normality assumption isnot met, transforming the dependent variable (y) mayhelp. ─ Removesomevariablesfromthemodel (onlywhen performing multipleregression)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B STANDARDISED RESIDUALS • Standardized Residual for Observation i 𝐠− 𝐠� 𝐵(𝐠−𝐠�) • Standardized residuals can also be used to detect bivariate outliers as well as to examine the assumption of regression. • If the error term is normally distributed, 95% of the standardized residuals will be between -2 and +2.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B USING THE REGRESSION EQUATION • The regression equation coefficients (b0 and b1) define the nature of the relationship between the variables. • The regression equation is also used for estimation or prediction. • Example: For an agent with 5 years of work experience, the estimated salesfigure would be: y=175.83 + 49.91× (5) This is a point estimate = 423.35 ($,000)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PREDICTIONS VS. EXTRAPOLATION • Prediction is when we use the regression model with a value of X contained in the range of X values from the sample. • Extrapolation iswhen we use the model with a value of X outside the range of original Xvalues. • Extrapolation should be used with caution asthere isno guarantee that the same model holdsoutside the original range of data.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INFERENCES IN CORRELATION AND REGRESSION • From sample data we obtain a sample regression line and associated results. • This provides an estimate of the population regression line and other parameters. • We use hypothesis tests and confidence intervals to make inferencesabout these population parameters.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INFERENCES IN CORRELATION AND REGRESSION • The sample correlation coefficient isr. The corresponding population value isρ “rho”. • The estimated simple linear regression equation based on sample data is given by: • The corresponding equation for the population is given by: y ˆ = b0 + b1x 𝐠= 𝐱 + 𝐲𝑃RICOS Provider Code: 00113B CRICOS Provider Code: 00113B INFERENCES IN CORRELATION AND REGRESSION • Aswe did in topicstwo and three for μ and π, we can use confidence intervals and hypothesis tests to estimate/test the corresponding population parameters.CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INFERENCES ABOUT THE SLOPE T-TEST • We want to determine whether the population parameter slope β1 could be different from zero. • That is, a linear relationship exists between Y(Sales) and X(Years of Experience) in the population. H 0: β1 =0 (NOlinear relationship) H 1: β1 ≠ 0 (linear relationship doesexist)CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B INFERENCES ABOUT THE SLOPE T-TEST Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 175.82 54.98 3.19 0.009 53.30 298.35 Yearsof Experience 49.910 10.50 4.75 0.000 26.50 73.31 This is a two-tail test, the p-value is = 0.000 Decision: P-value <α (.05) so reject H0 Conclusion: There is sufficient evidence that the years of experience predicts the amount of sales ($,000) i.e. linear relationship exists, as β1 ≠ 0 .CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B USES FOR REGRESSION ANALYSIS • Earlier we calculated point estimates • We can improve on the point estimate by calculating intervals ─ Confidence intervals for an Average value of Y, given X ─ Prediction interval for a Particular value of Y, given X Question: Which of the interval estimates would you expect to be wider?CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONFIDENCE INTERVAL FOR E (Y) | XP 𝐠� ± 𝐵𝐍 1 𝐍 + (𝐵 − 𝜅)2 ∑ 𝐵 − 𝜅 2 Where: 𝐠� = Point estimate of the DV t =Critical valuewith n-2 df n =sample size x p=specific valueof the IV 𝜅 =Mean of the IVobservationsin the sample se =Estimate of the standard error of the estimate A rough approximation to the confidence interval estimate can be obtained by: n S Y ˆ ± t × eCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONFIDENCE INTERVAL FOR E (Y) | XP EXAMPLE • Calculate the 95% confidence interval for average, or expected value - E(Y) - of the amount of salesfor all agents with 5 yearsof work experience : se =10.502 ($,000) n =12 t (df =10) =2.228 Point estimate → 𝐠� = 175.83 + 49.91 × (5) OR418.618 to 432.14 ($,000) n S Y ˆ ± t × e 12 10.502 425.38 ± 2.228×CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PREDICTION INTERVAL FOR A PARTICULAR Y | XP 𝐠� ± 𝐵𝐠𝐠+ 1 𝐍 + (𝐵 − 𝜅)2 ∑ 𝐵 − 𝜅 2 Where: 𝐠� = Point estimate of the DV t =Critical valuewith n-2 df n =sample size x p=specific valueof the IV 𝜅 =Mean of the IVobservationsin the sample se =Estimate of the standard error of the estimate A rough approximation to the confidence interval estimate can be obtained by: e Y ˆ ± t × SCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PREDICTION INTERVAL FOR A PARTICULAR Y | XP EXAMPLE • Calculate the 95% confidence interval for of the amount of sales made by a particular agent with 5 years of work experience : se =10.502 ($,000) n =12 t (df =10) =2.228 Point estimate → 𝐠� = 175.83 + 49.91 × (5) OR401.98 to 448.77 ($,000) e Y ˆ ± t × S 425.38 ± 2.228×10.502CRICOS Provider Code: 00113B CRICOS Provider Code: 00113B CONFIDENCE AND PREDICTION INTERVALS Confidence Interval Prediction IntervalCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B PITFALLS OF REGRESSION ANALYSIS • Causation vs. Correlation • Extrapolation • Lacking an awareness of the assumptions underlining least-squaresregression • Not knowing how to evaluate the assumptions • Not knowing what the alternatives to least-squares regression are if aparticular assumption isviolated • Using a regression model without knowledge of the subject matterCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B STRATEGY FOR AVOIDING THE PITFALLS • Start with a scatter plot to observe possible relationship between Xon Y • Performresidual analysisto check the assumptions • If there is violation of any assumption, use alternative methodsor take corrective actions • If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervalsand prediction intervals • NOTE: Confidence and prediction errorsmaysimplybe too wide for the model to be used in manysituationsCRICOS Provider Code: 00113B CRICOS Provider Code: 00113B QUESTIONS?