Multiple linear regression models

10. Multiple linear regression models#

本节需要的包：

require(s20x)

10.1. Example: Modelling birth weights using several explanatory variables#

我们学习了如何使用线性模型来建模数值和/或因子解释变量的影响。更一般地，原则上我们可以拟合任意数量的解释变量。然而，我们将看到这并不总是一个好主意。需要谨慎处理。举例来说，让我们研究可能解释婴儿出生体重的变量是什么。

10.2. Exploring relationships between the variables#

Let us first inspect the relationships between the numerical explanatory variables and the response variable.

The five variables are in columns 1,2,4,5 and 6 in the data frame Babies.df.

## Invoke the s20x library
library(s20x)
## Importing data into R
Babies.df <- read.table("../data/babies_data.txt", header = T)
## Create the pairs plot of the five numeric variables
pairs20x(Babies.df[, c(1, 2, 4, 5, 6)])

_images/dfa599f5f3972935789bd7d8c3894749faa31d288c3b25ccd268abf9d594f505.png

plot(bwt ~ height,
    data = Babies.df,
    xlab = "Mother’s height (inch)", ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$height, Babies.df$bwt))

_images/5d5cd6c8128a3f42953a63b64ef39fe7eefe14293a490a1425b1e30ad839402e.png

summary(lm(bwt ~ height, data = Babies.df))$r.squared
cor(Babies.df$bwt, Babies.df$height)^2 # R 方是差异的平方，所以跟上边那个是一样的

0.0414953918045247

Looking at the pairs plot again, we also see a somewhat weak relationship between bwt and mother’s weight.

plot(bwt ~ weight,
    data = Babies.df,
    xlab = "Mother’s weight (lb)",
    ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$weight, Babies.df$bwt), col = "red")

_images/9cf8f6c21aa0c2287d82c288edcb2959ad6bdd32063c4072912c4187e95d581d.png

胎儿孕育时间与其出生体重之间存在更强的关系，这并不令人意外，因为孩子在母亲子宫内的时间越长，孩子就有更多的时间来获得营养和生长。但是，在某个特定的胎龄后，这种关系显然会变得平缓起来 - 有些人称其为“曲棍球杆形状的曲线”。

plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)")
lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red")

_images/84b2ba099a15de516edda9d6883ffc9e8a139ce7b5ccd75b90c1fcf2335c1dc8.png

There does not seem to be any relationship between a mother’s age and her child’s bwt.

plot(bwt ~ age,
    data = Babies.df,
    xlab = "Mother’s age",
    ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$age, Babies.df$bwt), col = "red")

_images/25c365bb59b4db8858e87d38685ad992d6bd2a33f9af22b3cd59de7ad4de942d.png

Note: There seem to be some outlying data points(一些偏远的数据点) in these plots. There does not appear to be much of a relationship between the x variables, except between height and weight.

pairs20x(Babies.df[, c(1, 3, 7)])

_images/e8e3ed07d30d8b1443e20cb24b5f9be184304844ad4550fd7748ff51699e45b9.png

让我们从理解“解释”的角度开始，因为它是最强大的关系之一。那些不典型的数据点已经用问号标记了。我们稍后会添加其他的解释变量。

plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)")
lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red")
text(c(152, 185), c(120, 115), "?", col = "red")
abline(v = 294, lty = 3)

_images/7c04e5161e4511202bb9b9f0ec31c7bd6395443ada133dde8d38e6d31e1e860d.png

They look extremely implausible as they have typical birth-weight but have a gestational age that is extremely low for these data. 他们看起来非常难以置信,因为他们典型的出生体重但有孕龄,对这些数据是极低的。

id <- (Babies.df$gestation < 200)
Babies.df[id, ]

A data.frame: 2 × 7
	bwt	gestation	not.first.born	age	height	weight	smokes
	<int>	<int>	<int>	<int>	<int>	<int>	<int>
239	116	148	0	28	66	135	0
820	110	181	0	27	64	133	0

Relationship between birth weight and gestational age...

For gestation \(\leq 294\) days we’ll use the familiar simple linear regression model

\[ \text{E[bwt]}=\beta_0+\text{gestation}\times\beta_1 \]

We’d like to extend this model by adding an extra term so that the slope changes when gestation \(> 294\). That is,

\[ \text{E[bwt]}=\beta_0+\text{gestation}\times\beta_1+\text{v}\times\beta_2 \]

where v is some suitable explanatory variable. What should v be?

For gestation \(\leq 294\) the extended model is just the simple linear regression model, so that means v = 0 when gestation ≤ 294.
For gestation \(> 294\) we need another slope effect for gestational age. In fact, we need \(v =\) gestation \(- 294\).

Let’s create the new explanatory v = 294 that is described gestation above. We’ll give it the name because it is the number of days ODdays that the baby is overdue.

Babies.df$ODdays <- ifelse(
    Babies.df$gestation < 294,
    0,
    Babies.df$gestation - 294
)
head(Babies.df, 12) # Print first 12 lines of dataframe

A data.frame: 12 × 8
	bwt	gestation	not.first.born	age	height	weight	smokes	ODdays
	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>
1	120	284	0	27	62	100	0	0
2	113	282	0	33	64	135	0	0
3	128	279	0	28	64	115	1	0
4	108	282	0	23	67	125	1	0
5	136	286	0	25	62	93	0	0
6	138	244	0	33	62	178	0	0
7	132	245	0	23	65	140	0	0
8	120	289	0	25	62	125	0	0
9	143	299	0	30	66	136	1	5
10	140	351	0	27	68	120	0	57
11	144	282	0	32	64	124	1	0
12	141	279	0	23	63	128	1	0

10.3. Fitting the initial model#

bwt.fit <- lm(bwt ~ gestation + ODdays, data = Babies.df)
plot(bwt.fit, which = 1, add.smooth = FALSE)
normcheck(bwt.fit)
cooks20x(bwt.fit)

_images/6790b2198eeaf58a09e7cf99a37f166c0be4f603fd48cf8ed4063d94830d0a55.png

_images/988b56f423edf4507ccb567740e18a64442add6075d489741eb1a3093c7e7d12.png

_images/2c8dca382555137014bd31383a6b0cabdccaf4044f42efa301ed025ab7ec8fd8.png

Let us refit with observation 239 removed.

bwt.fit2 <- lm(bwt ~ gestation + ODdays, data = Babies.df[-239, ])
cooks20x(bwt.fit2)

_images/bf9ef019d4f783b0e36d59e4bc82e60f073b27c57eae1209af81aa5ec4601d9e.png

We refit the model using the reduced data.

# This time we demonstrate using the subset argument to remove points
bwt.fit3 <- lm(bwt ~ gestation + ODdays,
    data = Babies.df,
    subset = -c(239, 820)
)
cooks20x(bwt.fit3)
plot(bwt.fit3, which = 1)

_images/8b0b640a772f6df24cc7128502f029badf4da20cef4a170361670cd6e7d89c81.png

_images/b23b7a9d96f382f6d896d4d911e25d87590678bad5796195385659c4f8431058.png

Let’s take a look at our fitted hockey stick model.

gestation.seq <- 201:360 # Explanatory values at which to get predictions
ODdays.seq <- ifelse(gestation.seq <= 294, 0, gestation.seq - 294)
fit.seq <- predict(bwt.fit3, new = data.frame(
    gestation = gestation.seq,
    ODdays = ODdays.seq
))
plot(bwt ~ gestation,
    data = Babies.df[-c(239, 820), ],
    ylab = "Birth weight (oz)"
)
lines(gestation.seq, fit.seq, col = "red")
abline(v = 294, lty = 2, col = "blue")

_images/6bd768c11d50e17fb0ee1fa64ce106ce24e92ced9fa8d94b8de3306fa8f0c35d.png

模型检查是好的，没有影响力的点依然存在，所以我们可以相信这个。让我们解释输出。

summary(bwt.fit3)

Call:
lm(formula = bwt ~ gestation + ODdays, data = Babies.df, subset = -c(239, 
    820))

Residuals:
    Min      1Q  Median      3Q     Max 
-50.664 -10.993  -0.308   9.795  52.336 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -66.95336   10.42810   -6.42 1.97e-10 ***
gestation     0.67124    0.03757   17.87  < 2e-16 ***
ODdays       -0.90783    0.11745   -7.73 2.31e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.23 on 1169 degrees of freedom
Multiple R-squared:  0.2188,	Adjusted R-squared:  0.2174 
F-statistic: 163.7 on 2 and 1169 DF,  p-value: < 2.2e-16

The fitted model is:

\[ \text{E[bwt]}=-66.95+0.67\times\text{gestation}-0.91\times\text{ODdays} \]

10.4. Multiple linear regression model: Adding more terms to the model and the peril of multi-collinearity#

bwt.fit4 = lm(bwt ~ gestation + ODdays + height,
    data = Babies.df,
    subset = -c(239, 820)
)
plot(bwt.fit4, which = 1)

_images/d2e539d5668b540ac4be0df654d5715f75cb48563ffc291bd199fd04c85cbce9.png

All seems okay. Let us make sure that this makes sense in terms of output.

summary(bwt.fit4)

Call:
lm(formula = bwt ~ gestation + ODdays + height, data = Babies.df, 
    subset = -c(239, 820))

Residuals:
    Min      1Q  Median      3Q     Max 
-53.999 -10.393  -0.050   9.772  51.514 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -139.20571   15.05961  -9.244  < 2e-16 ***
gestation      0.65219    0.03703  17.613  < 2e-16 ***
ODdays        -0.89039    0.11543  -7.714 2.61e-14 ***
height         1.21083    0.18495   6.547 8.79e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.94 on 1168 degrees of freedom
Multiple R-squared:  0.2464,	Adjusted R-squared:  0.2445 
F-statistic: 127.3 on 3 and 1168 DF,  p-value: < 2.2e-16

Let us add weight to the model. We’re going to save some typing and use the `update`` function to update our model.

bwt.fit5 <- update(bwt.fit4, ~ . + weight)
summary(bwt.fit5)

Call:
lm(formula = bwt ~ gestation + ODdays + height + weight, data = Babies.df, 
    subset = -c(239, 820))

Residuals:
    Min      1Q  Median      3Q     Max 
-53.053 -10.540   0.121  10.076  47.746 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -131.68169   15.14974  -8.692  < 2e-16 ***
gestation      0.65624    0.03688  17.795  < 2e-16 ***
ODdays        -0.90868    0.11502  -7.900 6.41e-15 ***
height         0.90486    0.20453   4.424 1.06e-05 ***
weight         0.08535    0.02485   3.434 0.000615 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.87 on 1167 degrees of freedom
Multiple R-squared:  0.254,	Adjusted R-squared:  0.2514 
F-statistic: 99.32 on 4 and 1167 DF,  p-value: < 2.2e-16