Part D : Regression / Regressing with categorical variables/ Logistic Regression Question 1 A law firm has been asked to investigate whether employees’ experience and their quality of work were taken into consideration when Salary Increase was made by the supervisor. A random sample of 17 employees was drawn from the company file and based on the three variables ⦁ Salary increase of the employee (in 100 AED) ⦁ Quality of work : Employee performance rating measure ( scale 1-100) ⦁ Experience: the number of years since the employee was hired (in years) Based on the data collected, Multiple regression was conducted of Salary increase on the two predictors namely Quality of work and experience. The regression results are provided below: ⦁ Write the multiple regression equation: [2 marks] Salary increase ^ = -117.85 +4.4352 Quality of work + 24.4403 Experience p_value (0.004) (0.094) b) Interpret the meaning of the coefficient of determination R2= .4665 , 46.65 % of the variation in the salary increase between the employees (in the sample) is explained By the two independent variables (quality of work and Experience) c) Interpret the two partial slope coefficients - As Quality of work increase by one unit scale Salary increase would increase on average by 4.4352 * 100 (AED) holding experience constant - As Years of Experience increase by one year, Salary increase would increase on average by 24.4403 * 100 (AED) holding all other variable fixed at the same level (in this case Quality of work) d) Write down another potential variable that could be included in the analysis Education level, position, total years Experience, e) Predict the salary increase of a person given his/her Quality of work =60 and with 6 years of experience Predicted Salary Increase^ = -117.85 +4.4352 (60) +24.4403(6)=294.89 units or 294.89*100 AED ⦁ Determine which regression coefficients are significant at the 5% level of significance. Quality of work is a significant variable at =0.05 Experience is not a significant variable at =0.05 ⦁ Interpret the lower 95% and the higher 95% corresponding to quality of work This is the range of 95% CI is (1.713 and 7.157) which can be interpreted that we would be 95% confident that the true parameter is within the interval Question2 What are the assumptions of the regression? See notes What is the role of the disturbance term in regression analysis? The error term captures the effects of all omitted variables. It includes non-linearity-(e.g. quadratic terms). It also includes all of the variation in Y that cannot be explained by the included Xs. It also captures measurements errors. Distinguish between Types of errors that can be made during hypothesis decision. Type I or alpha error is the probably of rejecting the null hypothesis when it is true. Type II or beta error is the probability of failing to reject the null hypothesis when it is false. Define what is meant by perfect multi-collinearity and what are the consequences? Perfect Multi-collinearity is when there is a perfect relationship between the explanatory variables such as 2x1 - 3.5 x2- 5 x3=12 Consequences: Case: Perfect Multi-collinearity => Conducting regression is not possible (will not run) in SPSS may remove some of the variables which are highly correlated. Define what is meant by severe multi-collinearity and what are the consequences? a very close linear relationship between the explanatory variables That is the same variance in the dependent variables is shared between more than one variable. • Improper use of dummy variables (e.g. failure to exclude one category) Perfect collinearity • Including a variable that is computed from other variables in the equation (e.g. family income = husband’s income + wife’s income, and the regression includes all 3 income measures) . Including quadratic models . Including dummy variables . Including interaction variables • In effect, including the same or almost the same variable twice (height in feet and height in inches; or, more commonly, two different operations of the same identical concept) Consequences: one or more of the following ⦁ If the sign of a variable is not as expected ⦁ If the p-value of an expected significant variable is too high ⦁ If the p-values of the explanatory variables are too high while the F statistics is significant State the various techniques for detecting multi-collinearity. ⦁ Expected sign of the variables should be checked against the model sign ⦁ Correlation between the various independent variables ⦁ VIF ⦁ Stepwise regression In which case when conducting multiple regression dummy trap situation may occur? This is a case of a categorical with m-levels and you introduce m dummy variables in the regression. If this is the case there will be a perfect relationship between the m-dummy variables => perfect multi-collinearity would exist when conducting the regression=> regression would fail to run. Question 3. Working with categorical variables Interaction models Model without interaction including a binary variable Y= β0 + β1 colgpa + β2 Athelete Model with interaction Y= β0 + β1 colgpa + β2 Athelete + β3 Athelete * colgpa Y is the response variable not specified in this problem Explain in which case, an interaction model is necessary. An interaction model is necessary when the colgpa slope would be significantly different for the case of an athlete and the case of a non-athelete. Explain how you would interpret the interaction model. Y^athelete = (820.49 – 269.12) + (80.86+71.42) colgpa Y^not_Athelete= (820.49) + (80.86) colgpa As colgpa increases by one unit the response variable Y (not specified, eg. SAT score see question 4) would increase by 80.86 units for a non-athlete As colgpa increases by one unit the response variable (not specified) would increase by (80.86 +71.42) units for an athlete All the variables are significant based on a t-test and their corresponding p-values=0 < =0.05) The overall model is significant based on an F-test (p-value =0) The model explains (that includes two variables and the interaction term) 19 % of the variation in the dependent variable. Q4. Interpreting Regression results A research study aims to find the relationship between SAT scores, and two other factors namely gender and athlete. To this end, based on past data of 4136 students who took SAT, multiple regression of the SAT on all the variables. The regression results are given below: 1. sat combined SAT score 2. tothrs total hours through fall semester 3. colgpa GPA after fall semester 4. verbmath verbal/math SAT score 5. Athlete =1 if athlete 6. hsize size graduating class, 100s 7. hsrank rank in graduating class a) Conduct Colgpa significance using confidence interval approach at α=0.05 Step 1: H0: βCol_gpa = 0 ( colgpa is not significant) Ha: βCol_gpa 0 ( colgpa is significant) Step 2: Significance level at α=0.05 Step 3: this is a t-test. We can use either p-value approach (p=0) or the 95% confidence interval approach (79.257 to 91.018) => Since the interval does not contain the 0 therefore we reject the null Hypothesis Step 4: Conclusion ColGpa is a significant variable b) Which variables are significant (explain) (use α=0.05). Interpret your model. Based on p-values Tothrs, colgpa and the binary variable Athlete are significant at α=0.05 . The variable verbmath is not significant at α=0.05 ( Note: no need to conduct the steps as it is not requested) c) Interpret the coefficient corresponding to the independent variable verbmath as the vermath score increase by one unit, the dependent variable sat score would decrease on average by 19.817 holding all other variables constant. d) Interpret the coefficient corresponding to the independent variable athelete On average an athelete would scores on sat 98.23 less than a non-athlete holding the other 3 variables constant. e) Interpret the R-square. The model explains (that includes 4 variables) 19.2 % of the variation in the dependent variable sat. g) Predict the SAT score for an athlete student with colgpa =3.4, tothrs =78 and Verb/Math =0.95 SAT=837.697 -0.211(78) +85.138(3.4) -98.231(1)-19.817(0.95) = … Q5. Answer Q6. ⦁ Is it true to meet the equal variance assumption you want Levene’s test to be significant. Answer is False ⦁ Write the hypothesis related to Levene Test What is the purpose of the Logistic Regression? Explain what the response variable is in a logistic regression