Quick Look: Logistic Regression
This article discusses Generalized Linear Regression Models (GLMs), focusing specifically on one common example: Logistic regression.
When to use GLMs
A generalized linear regression model is preferred over a standard linear regression model when we want to analyze a linear relationship between our independent and dependent variables but the response variable, Y, does not follow a normal distribution. This is the case when the response variable has an exponential distribution, for example, Poisson (count) or Logistic (binary) distributions.
What’s a link function?
When modeling a GLM, you will set the link function in R. The link function says how the expected value of the response relates to the linear predictor of explanatory variables.
The link function is necesary because we are not predicting the values of Y
, but instead describing it in terms of probabilistic model and estimating parameters of conditional distribution of Y
given X
.
Example
Using R, let’s walk throguh an example where we can estimate the effect of study hours on student performance. We will use the following parameters:
Random Component: Binomial (family=binomal
)
Systematic Component: X’s are explanatory variables and are linear in the parameters ß0 + ßxi
Link Function: logit (link=logit
)
data <- read.csv("data/Data2.csv")
model <- glm(Student.performance ~ Weekly.study.hours,family=binomial(link='logit'),data=data)
#Show the results
summary(model)
##
## Call:
## glm(formula = Student.performance ~ Weekly.study.hours, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7751 -0.9146 -0.6732 0.8978 1.7865
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.48656 0.59174 2.512 0.01200 *
## Weekly.study.hours -0.14279 0.04618 -3.092 0.00199 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87.194 on 62 degrees of freedom
## Residual deviance: 76.013 on 61 degrees of freedom
## AIC: 80.013
##
## Number of Fisher Scoring iterations: 4
To intepret the coefficients, we will find the exponential of the coefficients:
exp(coef(model))
## (Intercept) Weekly.study.hours
## 4.421854 0.866938
Intercept interpretation: The estimated coefficient for the intercept is the log odds of a student with zero study hours passing the exam. In other words, the odds of passing the exam when study hours are zero is exp(1.48656) = 4.421. But looking at the data, there are no instances of students with zero study hours, so this is the hypothetical value of zero.
ß1 interpretation: With a one unit increase in study hours, the odds of passing is lower by a factor or exp(-.14279) or .8669
Generally speaking, as students study more, they are less likely to pass the exam.
Thus, our regression equation looks like this (where p represents probability):
logit(p) = ln (p/(1-p)) = 1.486 + -0.14279*Hours + e
Plotting the regression line
Let’s plot our results:
attach(data)
plot(Weekly.study.hours,Student.performance,xlab="Study Hours", ylab="Probability of Success")
curve(predict(model,data.frame(Weekly.study.hours=x),type="resp"),col="red",add=TRUE)
Comparing linear and logistic regressions
In linear regression, the line of best fit represents the best-fitting straight line for the predicted score of Y for each possible value of X. For greater levels of study hours, the model for linear regression predicts a probability of passing the exam is less than zero, which is not possible.
For logistic regression, we plot the sigmoid (logistic) function, which is S-shaped and not a straight line.
There are two main differences to note in the logistic regression graph, besides the s-shape vs. straight line:
The function outputs the probability that y = 1, and
For any values of x, the plot ensures that the Y values will fall between 0 and 1.
Here we can see the results if we used a linear regression to model the difference, and can compare it to our graph above:
plot(Weekly.study.hours,Student.performance,xlab="Study Hours", ylab="Probability of Success");abline(lm(Student.performance~Weekly.study.hours,data=data),col="red")
Wrap Up
Generalized Linear Models are an essential tool in a data scientist’s toolkit. Logistic regression is applied in all types of statistical analysis, including machine learning. Once you learn the parameters to model these equations, there’s no limit to what you can explore.