In the past couple of labs we’ve learned how to run and interpret linear regression models. Linear regression models are used when your outcome is a continuous numeric variable. But what if you’re outcome is categorical? Specifically, it has two classes, Yes or No. In this case, we use the logistic regression model. The objectives of this lab are as follows
To help us accomplish these learning objectives, we will examine the association between landslide occurrence and various environmental factors in the San Pedro Creek Watershed in Pacifica, CA.
We introduce one new package in today’s lab
install.package("lmtest")
Load this package along with others we need.
library(tidyverse)
library(lmtest)
The reasons why you run a logistic regression are the same as the reasons for running a regular linear regression. However, there is one additional important motivation: we want to model the relationship between a set of independent variables and a binary outcome.
In many situations in your work as a Geographer, your outcome will be of a qualitative nature. When we speak of qualitative outcomes, we generally are concerned with the observation of:
In general, each of these outcomes would be coded as a binary variable (1 or 0; Yes or No). You can use a linear regression to model a binary outcome, but you’ll typically break the assumption of homoskedastic residuals and you may get predictions outside of 1 or 0. That’s why you’ll need to turn to logistic regression to model the relationship.
Download the file landslides.csv from Canvas in the Week 4 Lab and Assignment folder. Bring in the file in
landslides <- read_csv("landslides.csv")
The data contain various locations in the San Pedro Creek Watershed in Pacifica, CA. The main goal of the analysis is to examine the characteristics that are associated with landslide occurrence. Landslides were detected on aerial photography from 1941, 1955, 1975, 1983, and 1997, and confirmed in the field during a study in the early 2000’s. We’ll create a logistic regression to compare landslide probability to various environmental factors. We’ll use elevation elev, slope slope, hillshade hillshade, distance to nearest stream stD, distance to nearest trail trD, and the results of a physical model predicting slope stability as a stability index SI.
Our research question is: What environmental characteristics are associated with landslide occurrence?
We will be diving into spatial data next lab, but that should not
stop us from doing some rudimentary mapping. Specifically, we have
latitude and longitude points for each location in the dataset, so we
can map the data using our good friend ggplot()
. We’ll
indicate which area experienced a landslide or not using the variable
IsSlide. We need to convert it to a factor to “color” it as a
categorical.
ggplot(landslides) +
geom_point(mapping = aes(x = lon,
y = lat, col = as.factor(IsSlide))) +
xlab("Longitude") +
ylab("Latitude") +
labs(color = "Landslide")
We first examine the distribution of our binary dependent variable IsSlide. We create a bar chart showing the distribution of the landslide indicator.
landslides %>%
group_by(IsSlide) %>%
summarize (n = n()) %>%
mutate(freq = n / sum(n)) %>%
ggplot() +
geom_bar(mapping=aes(x=as.factor(IsSlide), y=freq),stat="identity") +
xlab("Landslide")
Let’s now run a logistic regression model. We’ll start simple,
regressing IsSlide on slope. Instead of using the
function lm()
to run a logistic regression model as we did
when running linear regression models, we will use the function
glm()
, which stands for Generalized Linear Models.
glm()
is similar to lm()
, but gives us the
option of a variety of families to use in fitting the model (the shape
that we hypothesize represents the shape of the function f that
defines the relationship between Y and X).
We specify a family by using the argument family =
. If
we wanted a standard linear regression, which assumes a normal
distribution, family
will equal gaussian
(fancy word for normal). For a list of glm
families, check
the help documentation ? glm
. We use
family = binomial
for a logistic regression.
logit1.fit <- glm(IsSlide ~ slope,
family = binomial,
data = landslides)
We can summarize the modelling results using summary()
.
The resulting output is very similar to the output from
lm()
.
summary(logit1.fit)
##
## Call:
## glm(formula = IsSlide ~ slope, family = binomial, data = landslides)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.39142 0.37008 -9.164 <2e-16 ***
## slope 0.13471 0.01387 9.715 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 668.29 on 483 degrees of freedom
## Residual deviance: 513.54 on 482 degrees of freedom
## AIC: 517.54
##
## Number of Fisher Scoring iterations: 5
Question 1: What is the interpretation of the slope coefficient?
Let’s compare our results to those from an OLS regression model. An
OLS for a binary response variable is known as a linear probability
model. We use glm()
again, but this time use the (default)
Gaussian distribution.
ols.fit <-glm(IsSlide ~ slope,
family = gaussian,
data = landslides)
and a summary
summary(ols.fit)
##
## Call:
## glm(formula = IsSlide ~ slope, family = gaussian, data = landslides)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.076552 0.044594 -1.717 0.0867 .
## slope 0.023734 0.001767 13.429 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.181679)
##
## Null deviance: 120.331 on 483 degrees of freedom
## Residual deviance: 87.569 on 482 degrees of freedom
## AIC: 552.06
##
## Number of Fisher Scoring iterations: 2
Question 2: What is the interpretation of the slope coefficient in ols.fit?
You can create a plot like the one showed in Handout 5 (right hand
plot) by predicting the probability of a landslide for given values of
slope. The minimum and maximum slope for our data set are 0 and
42.29, respectively, so let’s predict landslide occurence for slopes
between 0 to 43 using the predict()
function. The function
below tells R to give predicted landslide occurrence for values of
slope between 0 and 43.
pfit1 <- predict(logit1.fit,
slope = c(0:43))
In predicting using a regression model, you can either predict landslides for the 484 observations in the original data set or predict for a new set of observations. In the code above, we are predicting for a new set of observations - areas with slopes between 0 and 43 - i.e. 0, 1, 2, 3 … 41, 42, and 43.
Let’s get a summary of our predicted values
summary(pfit1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.3914 -1.1843 0.0494 -0.3301 0.7812 2.3059
We get values ranging from -3.4 to 2.3. But, these are not
probabilities. Remember, as described in the handout, the response
variable is modelled as a logit, so R will give us logits in return. To
convert the logit to a probability, use the argument
type = "response"
inside predict()
pfit1 <- predict(logit1.fit,
data.frame(slope = c(0:43)),
type = "response")
summary(pfit1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03256 0.12543 0.37882 0.42496 0.72158 0.91692
The predicted probability of a landslide ranges from 3.3% to 91.7%.
Question 3: Create a plot similar to the one shown in this week’s handout (right hand plot) showing the predicted probabilities from logit1.fit and the observed data.
Question 4: Create a plot similar to the one shown in this week’s handout (left hand plot) showing the predicted probabilities from ols.fit and the observed data.
We now move to the multiple logistic regression framework by adding more than one independent variable. Let’s add the variable hillshade, which is a categorical variable (High, Mid, and Low).
logit2.fit <- glm(IsSlide ~ slope + hillshade,
family = binomial,
data = landslides)
summary(logit2.fit)
##
## Call:
## glm(formula = IsSlide ~ slope + hillshade, family = binomial,
## data = landslides)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.18010 0.44451 -7.154 8.42e-13 ***
## slope 0.14545 0.01582 9.195 < 2e-16 ***
## hillshadeLow -0.92891 0.30165 -3.079 0.00207 **
## hillshadeMid -0.49244 0.27044 -1.821 0.06862 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 668.29 on 483 degrees of freedom
## Residual deviance: 503.68 on 480 degrees of freedom
## AIC: 511.68
##
## Number of Fisher Scoring iterations: 5
Let’s calculate the predicted probability of a landslide at each value of hillshade, holding the slope at its mean. We need to create a data frame containing the values for slope and hillshade that we want to predict for. What are the categories of hillshade?
table(landslides$hillshade)
##
## High Low Mid
## 121 121 242
Let’s save these categories in a vector.
hillshade <- c("High", "Mid", "Low")
Now you need to create a data frame containing the vector
hillshade we created above as one column and the overall mean
of slope as another column (name the column slope
because it needs to match the variable name used in the prediction
model). So you should have a 3 x 2 data frame. Then plug this data frame
into the predict()
function following what we did
earlier.
Question 5: What is the difference in the probability of a landslide between a Low hillshade area and a High hillshade area holding the slope at its mean?
Handout 5 goes through the various ways we can interpret logistic regression coefficients. We already went through a few above. What about the odds ratio interpretation?
Question 6: Convert the logit2.fit coefficients to interpret them as the change in the odds ratio with a one unit increase in the independent variables. For a one unit increase in slope, the odds of a landslide (versus no landslide) increase by a factor of what amount?
The Handout goes through measures of best fit for a logistic regression model. Fortunately, some of these measures are reported in the model summary. Let’s run a multiple logistic regression model adding more variables to the model. Below, we’ve added curvature, elevation, distance to the nearest stream, and distance to the nearest trail.
logit3.fit<- glm(IsSlide ~ slope + hillshade + curv + elev + stD + trD,
family = binomial,
data = landslides)
How does this model compare to one that also includes the slope stability index?
logit4.fit<- glm(IsSlide ~ slope + hillshade + curv + elev + stD + trD + SI,
family = binomial,
data = landslides)
Executing summary()
on these results will give us some
but not all of the fit measures discussed in the handout. The likelihood
ratio test can be run using the function lrtest()
, which is
part of the lmtest package. The function allows you
compare the fit between two different models.
Question 7: What are the null and alternative hypotheses for the Likelihood ratio test?
Question 8: Based on the fit measures discussed in the handout, which model, logit3.fit and logit4.fit, provides the best fit? Explain why.
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil