# The data that we will use for this analysis will be from our labor dataset and t

The data that we will use for this analysis will be from our labor dataset and the analysis we will do is called a linear regression. But first you must filter it to include both Position 1 and Position 2 employees.We have explored the dataset and have developed a question: What factors influence \$/hr of position 1 and 2 employees?To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the \$/hr as the dependent variable (DV), and explore other variables of the employee as independent variables (IV). First, a brief refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL 5 and review other statistical methods and linear regression materials as needed.A linear regression is a way to model the statistical relationship between a response (or dependent variable) and one or more explanatory (or independent) variables. The linear relationship between the two variables may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.In Figure 1 we can see that as the independent variableincreases on the horizontal axis from left to right, the dependent variable tends to increase in value, although the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and DV.It also tests for the linearity of the relationship.Positive Relationship6040200051015202530Independent VariableFigure 1. Positive relationship between IV and DVIn Figure 2 we can see that as the independent variablevalue of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV) that have a stronger relationship with the response variable (DV).Neutral Relationship403020100051015202530Independent VariableFigure 2. Neutral relationship between IV and DVIn Figure 3 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on our observation of this scatterplot between the IV and DV.Negative Relationship6040200051015202530Independent VariableFigure 3. Negative relationship between IV and DVAgain, a linear regression is way to model a hypothesized statistical relationship between a predictor variableand a response variable (DV). What is the difference between a deterministic relationship and a statistical relationship?
In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.Y value = constant + (slope * X value)Figure 4. Deterministic linear equationIn a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best fitting line that best “describes” the relationship between the predictor variable and the response variable. Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.ŷi = the predicted response (or fitted value)bo = the estimated Y axis intercept of the best fitting lineb1 = the estimated slope of the best fitting linexi = the predictor variable value (IV value)yi = the observed response value (DV value)β0 = estimated population regression line constantβ1 = estimated population regression line slopeεi = error term (difference between ŷi and yi) aka residualsŷi = bo + b1 xiyi = β0 + β1 xi + εiFigure 5. Statistical linear equationSo, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).In testing the null hypothesis for a simple linear regression, we should generally follow these steps:1. State the plain language research question: e.g. What factors influence \$/hr for position 1 and 2 employees?2.State the hypotheses:3.State the criteria for rejecting HO:α = 0.05
4.Consider the assumptions for linear regression:i.Is your data a “snap shot” or a “video” of your observations? If your data is more of a“video”, consider a time series analysis.Non-significant Chi Square
No triangular looking patterns between the response variable and the standardized residuals, and
Non-significant Chi Square
i.Outliers can cause erroneous results (Cook’s D > ±2)The linear regression may not be the best fit (curvilinear, quadratic, etc.)
Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.
iii.Large data sets can result in significance (P value) but not really different from 05.Compute the appropriate statistics:6.Decide whether to retain or reject your null hypothesis:7.Interpret the parameters (β0 and β1):Please watch the videos for detailed instructions.Filtered Dataset VideoRunning CorrelationInterpreting CorrelationRunning the RegressionInterpreting the RegressionThe data that we will use for this analysis will be from our labor datasetand the analysis we will do is calleda linear regression. But first you must filter it toinclude both Position 1 and Position 2 employees.We have explored the dataset and have developed a question: What factors influence \$/hr of position 1 and 2employees? To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the \$/hr as the dependent variable (DV), and explore other variables of the employeeas independent variables (IV). First, a briefrefresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL5 and review other statistical methods and linear regression materials as needed.A linear regression is a way to model thestatistical relationship between a response (or dependent variable)and one or more explanatory (or independent) variables. The linear relationship between the two variablesmay be represented by a straight line, often called the regression line. Simply put, we want to see if the IVvalue can be used to accurately predict the DV value. Often, we can visually see if there is a relationshipbetween the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.In Figure 1 we can see that as the independent variableincreases on the horizontal axis from left to right, the dependent variable tends to increase in value, althoughthe increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV andDV. It also tests for the linearity ofthe relationship.Positive Relationship60402000 5 10 15 20 25 30Independent VariableFigure 1. Positive relationship between IV and DVIn Figure 2 we can see that as the independent variablevalue of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV)that have a stronger relationship with the response variable (DV).Neutral Relationship4030201000 5 10 15 20 25 30Independent VariableFigure 2. Neutral relationship between IV and DVIn Figure 3 we can see that as the independent variable increaseson the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We wouldneed to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based onour observation of this scatterplot between the IV andDV.Negative Relationship60402000 5 10 15 20 25 30Independent VariableFigure 3. Negative relationship between IV and DVAgain, a linear regression is way to model a hypothesized statistical relationship between a predictor variable(IV) and a response variable (DV). What is the difference between a deterministic relationship and a statisticalrelationship?In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.Y value = constant +(slope * X value)Figure 4. Deterministic linear equationIn a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in ascatterplot. So, we must modifyour linear equation to find the best fitting line that best“describes” the relationship between the predictor variable andthe response variable.Data analysis software like SASand Excel do this by adjusting the position of theline and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.ŷi = the predicted response (or fitted value)bo = the estimated Y axis intercept of the best fitting lineb1 = the estimatedslope of the best fitting linexi = the predictor variable value (IV value)yi = the observed response value (DV value)β0 = estimated population regression line constantβ1 = estimated population regression line slopeεi = error term (difference between ŷi and yi) aka residualsŷi = bo + b1 xiyi = β0 + β1 xi + εiFigure 5. Statistical linear equationSo, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship describedin a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).In testing the null hypothesis for a simple linear regression, we should generally follow these steps:1.State the plain language research question: e.g. What factors influence \$/hr for position 1 and 2 employees?2.State the hypotheses:•Null hypothesis – HO: βPerformance = 0•Alternative hypothesis – HA: βperformance ≠ 03.State the criteria for rejecting HO:• α = 0.054.Consider the assumptions for linear regression:•Assumption that there is a linear relationship between response variable and predictor variable (You should use scatter plots of theindividual continuous independent variables compared to the dependent).•Assumption that the errors, εi, are independent (research design)i.Is your data a “snap shot” or a “video” of your observations? Ifyour data is more of a“video”, consider a time series analysis.ii. Non-significant Chi Square•Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normaldistribution of error terms).(Examine the residual plots)•Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2)i. No triangular looking patterns between the response variable and the standardizedresiduals, andii.Non-significant Chi Square•Other items to consider:i.Outliers cancause erroneous results (Cook’s D > ±2)ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.)iii.Large data sets can result in significance (P value) but not really different from 0iv. Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.5.Compute the appropriate statistics:•Pearson correlation coefficient (remember that correlation does not imply causation!)•F-Value•Prob >F•Did you observe any problematic outliers? What (if anything) can you do about them?6.Decide whether to retain or reject your null hypothesis:•If p > α, thenretain the nullhypothesis•If p < α, then reject the null hypothesis,and accept the alternative hypothesis•Remember, that statistical significance does not imply practical or meaningful significance!7.Interpret the parameters (β0 and β1):•What does a one unit increase in the predictor variable result in the expected response variable(what is the slope of the regression line)? Is it positive or negative? Is it meaningful?•Is zero within your predictor variable (IV) value range? What does that mean?Please watch the videos for detailed instructions. Requirements: 7 questions