CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
Simple Linear Regression
AnalysisCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
LEARNING OBJECTIVES
Upon completing this session, you should be able to do the
following:
• Calculate and interpret the correlation between two variables.
• Recognize regression analysis applications for purposes of
description and prediction.
• Calculate the simple linear regression equation for a set of data and
know the basic assumptionsbehind regression analysis.
• Determine whether aregression model issignificant.
• Calculate and interpret confidence intervalsfor the regression analysis.
• Recognize some potential problemsif regression analysis
Is used incorrectly.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PURPOSE OF REGRESSION AND
CORRELATION
• Regression helps to explain or
understand the variation in a
(dependent) variable.
• We do this by finding other
(independent) variables that are
related to the dependent
variable.
We wish to know:
• The direction of that relationship
• The strength of that relationship
Explanation (Description)
• We can make use of the
explanatory (independent)
variables to help predict the
likely outcome of the dependent
variable.
• For example, knowing the
number of customers a fast food
restaurant has… may enable
management to forecast sales.
PredictionCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PURPOSE OF REGRESSION AND
CORRELATION
• In a situation where we have some control over the value
of the independent variable, this in-turn, enables some
form of control over the dependent variable.
• For example, by varying advertising expenditure up or
down, to a certain extent, we may be able to control the
movement in sales.
ControlCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CAUSATION V. CORRELATION
• Correlation does NOT imply
cause and effect. Just because
two variables are correlation it
does not mean one causes the
other.
Be wary of Causality v.
CorrelationCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONCEPTS IN REGRESSION AND
CORRELATION
1. Scatter diagram:
Graphical representation of the possible relationship between
two variables.
2. Correlation:
Measures the strength and direction of a linear relationship
between two variables.
3. Regression:
Gives the mathematical model of the relationship.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
SIMPLE VS MULTIPLE
LINEAR REGRESSION
• Simple Linear regression:
The model involves only one independent variable.
• Multiple regression:
Involves the use of more than one independent variable
to help explain the variation in the dependent variable
(covered next week).CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
SCATTER DIAGRAMS
Scatter Plot
A two-dimensional plot showing the values for the joint
occurrence of two quantitative variables. The scatter plot may
be used to graphically represent the relationship between two
variables. It is also known as a scatter diagram.
• The vertical (y) axis always contains the dependent
variable.
• Look For
─ No relationships
─ Linear relationships
─ Non-linear relationshipsCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
TWO-VARIABLE (BI-VARIATE)
RELATIONSHIPSCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CORRELATION COEFFICIENT
Correlation Coefficient r
A quantitative measure of the strength of the linear
relationship between two variables. The correlation ranges
from -1.0 to + 1.0. A correlation of ±1.0 indicates a perfect
linear relationship, whereas a correlation of 0 indicates no
linear relationship.
The sign of r provides the direction of the relationship.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CORRELATION BETWEEN TWO
VARIABLESCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CAUTION: NON-LINEAR RELATIONSHIPS
• Before interpreting r a scatter plot must always be
drawn.
• In the following, r would be poor indicatorsof the actual
strengthsof each relationship.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
EXAMPLE:
SALES V. YEARS OF EXPERIENCE
• BLITZ is studying the relationship between sales (on
which commissionsare paid) and number of yearsa sales
person iswith the company. A random sample of 12 sales
representativesiscollected.
Sales 487 445 272 641 187 440 346 238 312 269 655 563
Years with
BLITZ 3 5 2 8 2 6 7 1 4 2 9 6CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
SALES V. YEARS OF EXPERIENCE
ARE SALES AND YEARS OF EXP RELATED?
r =+0.832
There appears to be a
strong positive linear
relationship between
years of experience and
sales.
0
100
200
300
400
500
600
700
0 2 4 6 8 10
Sales ($.000)
Years of ExperienceCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
REGRESSION: LINE OF BEST FIT
• Is there one line or
curve which is closest
to the set of data?
• The “method of least
squares” provides us
with the line of best fit
through the points on
ascatter diagram.
0
100
200
300
400
500
600
700
0 2 4 6 8 10
Sales ($,000)
Years of ExperienceCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
THE REGRESSION MODEL
The estimated simple linear regression equation, is given by:
Where:
𝐠� is the dependent variable
xisthe independent variable
b
0 is the Y-intercept (i.e. where the line cuts the
vertical axis)
b1istheslope of theline
y ˆ = b0 + b1xCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
MEANING OF REGRESSION COEFFICIENTS
The regression coefficients “b0" and "b1" can be interpreted
in three ways.
• Geometrically(i.e. graphically)
• Algebraically(i.e. in equation form)
• Practically(i.e. practical interpretation)
We explain using the previous example:
y=175.83 + 49.91xCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INTERPRETING INTERCEPT COEFFICIENT
Geometrically:
On the graph, b0 is where the line cuts the vertical axis.
Our example: The line cuts the Yaxis at 175.83.
Algebraically:
b
0 is the value of Ywhen X= 0.
Our example: Y= 175.83 when X= 0 years of experience.
Practically:
b
0 will not always have a useful interpretation as X= 0 may be well
outside the range of X values used for the regression equation.
Sometimes it is useful.
Our example: The average sales for an agent with no years of
experience is 175.83 ($,000)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INTERPRETING REGRESSION COEFFICIENT
Geometrically:
On the graph, b1 is the slope of the regression line.
Our example: The slop is 49.91.
Algebraically:
b
1 is the change in the value of Ywhen Xchanges by one unit.
Our example: If Xincreases by1, Yincreases by 49.91.
Practically:
b
1 indicates the impact on Yfrom a change in X.
Our example: For each year of experience gained by an agent, the
amount of sales increases by an average of 49.91
($,000)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
HOW WELL DOES THE LINE FIT THE DATA?
In all practical situations the regression line does not perfectly fit
to the data.
There will be small variations (errors) between the line 𝐠� and the
actual points y
i
𝐠− 𝐠�
These variations are called residuals (error terms).CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
RESIDUALS
0
100
200
300
400
500
600
700
0 2 4 6 8 10
Sales ($.000)
Years of ExperienceCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
GOODNESS OF FIT
• We need to obtain measures of these residuals and hence
how well the line fits the data.
• To measure the variation around the line. We use
Standard Error of the Estimate, s
yx.
• For how well the line fits the data we use the Coefficient
of Determination, R 2.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
STANDARD ERROR
Large Standard Error Small Standard ErrorCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
STANDARD ERROR OF THE ESTIMATE
Example:
Sales vs. Years of Experience
s
yx = 92.10
• Interpretation:
“We estimate that the average variation of each individual Sales
figures around the regression line is 92.10 ($,000)”.
• As a rough approximation using the empirical rule, we could also
say that the maximum deviation from the line will be:
± (3 × 92.10) or ± 276.3 ($,000)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
COEFFICIENT OF DETERMINATION
• The portion of the total variation in the dependent
variable that is explained by its relationship with the
independent variable.
• Normally expressed as a percentage.
• It provides an absolute measure of the strength of the
relationship.
NOTE: Coefficient of Determination for the Single
Independent Variable Case
R2 = r2CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
COEFFICIENT OF DETERMINATION
Example:
Sales vs. Years of Experience
R2 = 0.693 or 69.3 %
• Interpretation:
“Approximately 69% of the variation in sales is explained by or
attributed to variation in the years of experience.
The remaining 31% of variation would be the result of other factors
not included in the model, e.g. negotiation skills, education, gender
etc.”CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
LINEAR REGRESSION ASSUMPTIONS
• Linearity
The underlying relationship between Xand Yis linear.
The error ε is a normallydistributed.
• Homoscedasticity(Constant Variance)
The variance of ε, is the same for all values of the independent
variable.
• Independence of error terms
The values of ε are independent.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
LINEAR REGRESSION ASSUMPTIONSCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONSEQUENCES OF VIOLATING
ASSUMPTIONS
• Non-normality
Error not normallydistributed.
• Heteroscedasticity
Variance not constant.
Usually happens in crosssectional data
• Autocorrelation
Errors are not independent.
Usually happens in time-series
data
• Consequences of Any Violation
of the Assumptions
Predictions and estimations
obtained from the sample
regression line will not be
accurate.
Hypothesis testing results will not
be reliable.
• It is Important to Verify the
AssumptionsCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
RESIDUAL ANALYSIS
• Purposes
─ Examine linearity
─ Evaluate violations of assumptions
• Graphical Analysis of Residuals
─ Plot residuals vs. X and timeCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CHECKING FOR LINEARITY
• Nonlinear Pattern:
• Linear Pattern:CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONSTANT AND NON-CONSTANT
VARIANCES
• Constant Variances:
• Non-constant
Variances:CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
ARE THE RESIDUALS INDEPENDENT?
• Independent Residuals:
• Residuals NOT
independent:CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
NORMALITY ASSUMPTION
Histogram of Standardized
Residuals
Normal Probability Plot of
ResidualsCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CORRECTIVE ACTIONS
• Approachesthat may work if the model is not appropriate:
─ Transforming some of the independent variable
o Raising xto apower
o Taking the square root of x
o Taking the log of x
o If the normality assumption isnot met, transforming the dependent
variable (y) mayhelp.
─ Removesomevariablesfromthemodel
(onlywhen performing multipleregression)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
STANDARDISED RESIDUALS
• Standardized Residual for Observation i
𝐠− 𝐠�
𝐵(𝐠−𝐠�)
• Standardized residuals can also be used to detect bivariate outliers as well as to examine the assumption of
regression.
• If the error term is normally distributed, 95% of the
standardized residuals will be between -2 and +2.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
USING THE REGRESSION EQUATION
• The regression equation coefficients (b0 and b1) define
the nature of the relationship between the variables.
• The regression equation is also used for estimation or
prediction.
• Example:
For an agent with 5 years of work experience, the
estimated salesfigure would be:
y=175.83 + 49.91× (5)
This is a point estimate = 423.35 ($,000)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PREDICTIONS VS. EXTRAPOLATION
• Prediction is when we use the regression model with a
value of X contained in the range of X values from the
sample.
• Extrapolation iswhen we use the model with a value of X
outside the range of original Xvalues.
• Extrapolation should be used with caution asthere isno
guarantee that the same model holdsoutside the original
range of data.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INFERENCES IN CORRELATION AND
REGRESSION
• From sample data we obtain a sample regression line and
associated results.
• This provides an estimate of the population regression
line and other parameters.
• We use hypothesis tests and confidence intervals to
make inferencesabout these population parameters.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INFERENCES IN CORRELATION AND
REGRESSION
• The sample correlation coefficient isr. The corresponding
population value isρ “rho”.
• The estimated simple linear regression equation based on
sample data is given by:
• The corresponding equation for the population is given
by:
y ˆ = b0 + b1x
𝐠= 𝐱 + 𝐲𝑃RICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INFERENCES IN CORRELATION AND
REGRESSION
• Aswe did in topicstwo and three for μ and π, we can use
confidence intervals and hypothesis tests to estimate/test
the corresponding population parameters.CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INFERENCES ABOUT THE SLOPE
T-TEST
• We want to determine whether the population
parameter slope β1 could be different from zero.
• That is, a linear relationship exists between Y(Sales) and
X(Years of Experience) in the population.
H
0: β1 =0 (NOlinear relationship)
H
1: β1 ≠ 0 (linear relationship doesexist)CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
INFERENCES ABOUT THE SLOPE
T-TEST
Coefficients Standard
Error t Stat P-value Lower 95% Upper 95%
Intercept 175.82 54.98 3.19 0.009 53.30 298.35
Yearsof Experience 49.910 10.50 4.75 0.000 26.50 73.31
This is a two-tail test, the p-value is = 0.000
Decision:
P-value <α (.05) so reject H0
Conclusion:
There is sufficient evidence that the years of experience predicts the
amount of sales ($,000) i.e. linear relationship exists, as β1 ≠ 0 .CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
USES FOR REGRESSION ANALYSIS
• Earlier we calculated point estimates
• We can improve on the point estimate by calculating
intervals
─ Confidence intervals for an Average value of Y, given X
─ Prediction interval for a Particular value of Y, given X
Question: Which of the interval estimates would you expect
to be wider?CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONFIDENCE INTERVAL FOR E (Y) | XP
𝐠� ± 𝐵𝐍
1 𝐍
+
(𝐵 − 𝜅)2
∑ 𝐵 − 𝜅 2
Where:
𝐠� = Point estimate of the DV
t =Critical valuewith n-2 df
n =sample size
x
p=specific valueof the IV
𝜅 =Mean of the IVobservationsin the sample
se
=Estimate of the standard error of the estimate
A rough approximation to the confidence interval estimate can be
obtained by:
n
S
Y ˆ ± t × eCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONFIDENCE INTERVAL FOR E (Y) | XP
EXAMPLE
• Calculate the 95% confidence interval for average, or
expected value - E(Y) - of the amount of salesfor all agents
with 5 yearsof work experience :
se
=10.502 ($,000)
n =12
t
(df =10) =2.228
Point estimate → 𝐠� = 175.83 + 49.91 × (5)
OR418.618 to 432.14 ($,000)
n
S
Y ˆ ± t × e
12
10.502
425.38 ± 2.228×CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PREDICTION INTERVAL FOR A
PARTICULAR Y | XP
𝐠� ± 𝐵𝐠𝐠+
1 𝐍
+
(𝐵 − 𝜅)2
∑ 𝐵 − 𝜅 2
Where:
𝐠� = Point estimate of the DV
t =Critical valuewith n-2 df
n =sample size
x
p=specific valueof the IV
𝜅 =Mean of the IVobservationsin the sample
se
=Estimate of the standard error of the estimate
A rough approximation to the confidence interval estimate can be
obtained by:
e
Y ˆ ± t × SCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PREDICTION INTERVAL FOR A PARTICULAR Y | XP
EXAMPLE
• Calculate the 95% confidence interval for of the amount of
sales made by a particular agent with 5 years of work
experience :
se
=10.502 ($,000)
n =12
t
(df =10) =2.228
Point estimate → 𝐠� = 175.83 + 49.91 × (5)
OR401.98 to 448.77 ($,000)
e
Y ˆ ± t × S 425.38 ± 2.228×10.502CRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
CONFIDENCE AND PREDICTION
INTERVALS
Confidence Interval
Prediction IntervalCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
PITFALLS OF REGRESSION ANALYSIS
• Causation vs. Correlation
• Extrapolation
• Lacking an awareness of the assumptions underlining
least-squaresregression
• Not knowing how to evaluate the assumptions
• Not knowing what the alternatives to least-squares
regression are if aparticular assumption isviolated
• Using a regression model without knowledge of the
subject matterCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
STRATEGY FOR AVOIDING THE
PITFALLS
• Start with a scatter plot to observe possible relationship
between Xon Y
• Performresidual analysisto check the assumptions
• If there is violation of any assumption, use alternative
methodsor take corrective actions
• If there is no evidence of assumption violation, then test
for the significance of the regression coefficients and
construct confidence intervalsand prediction intervals
• NOTE:
Confidence and prediction errorsmaysimplybe too
wide for the model to be used in manysituationsCRICOS Provider Code: 00113B
CRICOS Provider Code: 00113B
QUESTIONS?