10. Multiple linear regression models#
本节需要的包:
require(s20x)
Show code cell output
Loading required package: s20x
10.1. Example: Modelling birth weights using several explanatory variables#
我们学习了如何使用线性模型来建模数值和/或因子解释变量的影响。更一般地,原则上我们可以拟合任意数量的解释变量。然而,我们将看到这并不总是一个好主意。需要谨慎处理。举例来说,让我们研究可能解释婴儿出生体重的变量是什么。
10.2. Exploring relationships between the variables#
Let us first inspect the relationships between the numerical explanatory variables and the response variable.
The five variables are in columns 1,2,4,5 and 6 in the data frame Babies.df.
## Invoke the s20x library
library(s20x)
## Importing data into R
Babies.df <- read.table("../data/babies_data.txt", header = T)
## Create the pairs plot of the five numeric variables
pairs20x(Babies.df[, c(1, 2, 4, 5, 6)])
plot(bwt ~ height,
data = Babies.df,
xlab = "Mother’s height (inch)", ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$height, Babies.df$bwt))
summary(lm(bwt ~ height, data = Babies.df))$r.squared
cor(Babies.df$bwt, Babies.df$height)^2 # R 方是差异的平方,所以跟上边那个是一样的
Looking at the pairs plot again, we also see a somewhat weak relationship between bwt
and mother’s weight
.
plot(bwt ~ weight,
data = Babies.df,
xlab = "Mother’s weight (lb)",
ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$weight, Babies.df$bwt), col = "red")
胎儿孕育时间与其出生体重之间存在更强的关系,这并不令人意外,因为孩子在母亲子宫内的时间越长,孩子就有更多的时间来获得营养和生长。但是,在某个特定的胎龄后,这种关系显然会变得平缓起来 - 有些人称其为“曲棍球杆形状的曲线”。
plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)")
lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red")
There does not seem to be any relationship between a mother’s age and her child’s bwt
.
plot(bwt ~ age,
data = Babies.df,
xlab = "Mother’s age",
ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$age, Babies.df$bwt), col = "red")
Note: There seem to be some outlying data points(一些偏远的数据点) in these plots. There does not appear to be much of a relationship between the x variables, except between height
and weight
.
pairs20x(Babies.df[, c(1, 3, 7)])
让我们从理解“解释”的角度开始,因为它是最强大的关系之一。那些不典型的数据点已经用问号标记了。我们稍后会添加其他的解释变量。
plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)")
lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red")
text(c(152, 185), c(120, 115), "?", col = "red")
abline(v = 294, lty = 3)
They look extremely implausible as they have typical birth-weight but have a gestational age that is extremely low for these data. 他们看起来非常难以置信,因为他们典型的出生体重但有孕龄,对这些数据是极低的。
id <- (Babies.df$gestation < 200)
Babies.df[id, ]
bwt | gestation | not.first.born | age | height | weight | smokes | |
---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <int> | |
239 | 116 | 148 | 0 | 28 | 66 | 135 | 0 |
820 | 110 | 181 | 0 | 27 | 64 | 133 | 0 |
Relationship between birth weight and gestational age...
For gestation
\(\leq 294\) days we’ll use the familiar simple linear regression model
We’d like to extend this model by adding an extra term so that the slope changes when gestation
\(> 294\). That is,
where v is some suitable explanatory variable. What should v be?
For
gestation
\(\leq 294\) the extended model is just the simple linear regression model, so that means v = 0 when gestation ≤ 294.For
gestation
\(> 294\) we need another slope effect for gestational age. In fact, we need \(v =\)gestation
\(- 294\).
Let’s create the new explanatory v = 294 that is described gestation above. We’ll give it the name because it is the number of days ODdays that the baby is overdue.
Babies.df$ODdays <- ifelse(
Babies.df$gestation < 294,
0,
Babies.df$gestation - 294
)
head(Babies.df, 12) # Print first 12 lines of dataframe
bwt | gestation | not.first.born | age | height | weight | smokes | ODdays | |
---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | |
1 | 120 | 284 | 0 | 27 | 62 | 100 | 0 | 0 |
2 | 113 | 282 | 0 | 33 | 64 | 135 | 0 | 0 |
3 | 128 | 279 | 0 | 28 | 64 | 115 | 1 | 0 |
4 | 108 | 282 | 0 | 23 | 67 | 125 | 1 | 0 |
5 | 136 | 286 | 0 | 25 | 62 | 93 | 0 | 0 |
6 | 138 | 244 | 0 | 33 | 62 | 178 | 0 | 0 |
7 | 132 | 245 | 0 | 23 | 65 | 140 | 0 | 0 |
8 | 120 | 289 | 0 | 25 | 62 | 125 | 0 | 0 |
9 | 143 | 299 | 0 | 30 | 66 | 136 | 1 | 5 |
10 | 140 | 351 | 0 | 27 | 68 | 120 | 0 | 57 |
11 | 144 | 282 | 0 | 32 | 64 | 124 | 1 | 0 |
12 | 141 | 279 | 0 | 23 | 63 | 128 | 1 | 0 |
10.3. Fitting the initial model#
bwt.fit <- lm(bwt ~ gestation + ODdays, data = Babies.df)
plot(bwt.fit, which = 1, add.smooth = FALSE)
normcheck(bwt.fit)
cooks20x(bwt.fit)
Let us refit with observation 239 removed.
bwt.fit2 <- lm(bwt ~ gestation + ODdays, data = Babies.df[-239, ])
cooks20x(bwt.fit2)
We refit the model using the reduced data.
# This time we demonstrate using the subset argument to remove points
bwt.fit3 <- lm(bwt ~ gestation + ODdays,
data = Babies.df,
subset = -c(239, 820)
)
cooks20x(bwt.fit3)
plot(bwt.fit3, which = 1)
Let’s take a look at our fitted hockey stick model.
gestation.seq <- 201:360 # Explanatory values at which to get predictions
ODdays.seq <- ifelse(gestation.seq <= 294, 0, gestation.seq - 294)
fit.seq <- predict(bwt.fit3, new = data.frame(
gestation = gestation.seq,
ODdays = ODdays.seq
))
plot(bwt ~ gestation,
data = Babies.df[-c(239, 820), ],
ylab = "Birth weight (oz)"
)
lines(gestation.seq, fit.seq, col = "red")
abline(v = 294, lty = 2, col = "blue")
模型检查是好的,没有影响力的点依然存在,所以我们可以相信这个。让我们解释输出。
summary(bwt.fit3)
Call:
lm(formula = bwt ~ gestation + ODdays, data = Babies.df, subset = -c(239,
820))
Residuals:
Min 1Q Median 3Q Max
-50.664 -10.993 -0.308 9.795 52.336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -66.95336 10.42810 -6.42 1.97e-10 ***
gestation 0.67124 0.03757 17.87 < 2e-16 ***
ODdays -0.90783 0.11745 -7.73 2.31e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.23 on 1169 degrees of freedom
Multiple R-squared: 0.2188, Adjusted R-squared: 0.2174
F-statistic: 163.7 on 2 and 1169 DF, p-value: < 2.2e-16
The fitted model is:
10.4. Multiple linear regression model: Adding more terms to the model and the peril of multi-collinearity#
bwt.fit4 = lm(bwt ~ gestation + ODdays + height,
data = Babies.df,
subset = -c(239, 820)
)
plot(bwt.fit4, which = 1)
All seems okay. Let us make sure that this makes sense in terms of output.
summary(bwt.fit4)
Call:
lm(formula = bwt ~ gestation + ODdays + height, data = Babies.df,
subset = -c(239, 820))
Residuals:
Min 1Q Median 3Q Max
-53.999 -10.393 -0.050 9.772 51.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.20571 15.05961 -9.244 < 2e-16 ***
gestation 0.65219 0.03703 17.613 < 2e-16 ***
ODdays -0.89039 0.11543 -7.714 2.61e-14 ***
height 1.21083 0.18495 6.547 8.79e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.94 on 1168 degrees of freedom
Multiple R-squared: 0.2464, Adjusted R-squared: 0.2445
F-statistic: 127.3 on 3 and 1168 DF, p-value: < 2.2e-16
Let us add weight
to the model. We’re going to save some typing and use the `update`` function to update our model.
bwt.fit5 <- update(bwt.fit4, ~ . + weight)
summary(bwt.fit5)
Call:
lm(formula = bwt ~ gestation + ODdays + height + weight, data = Babies.df,
subset = -c(239, 820))
Residuals:
Min 1Q Median 3Q Max
-53.053 -10.540 0.121 10.076 47.746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -131.68169 15.14974 -8.692 < 2e-16 ***
gestation 0.65624 0.03688 17.795 < 2e-16 ***
ODdays -0.90868 0.11502 -7.900 6.41e-15 ***
height 0.90486 0.20453 4.424 1.06e-05 ***
weight 0.08535 0.02485 3.434 0.000615 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.87 on 1167 degrees of freedom
Multiple R-squared: 0.254, Adjusted R-squared: 0.2514
F-statistic: 99.32 on 4 and 1167 DF, p-value: < 2.2e-16