ames <- AmesHousing::make_ames()
ames %>%
ggplot(aes(Gr_Liv_Area, Sale_Price)) +
geom_point(alpha=.3) +
labs(x="Aire habitable (pi²)", y="Prix de vente ($)" )+
scale_y_continuous(labels = dollar_format(prefix = "$", big.mark = ","))AmesHousing::ames\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon,\qquad \mathbb{E}[\varepsilon]=0,\ \operatorname{Var}(\varepsilon)=\sigma^2 \]
Soit \(\mathbf{y}\in\mathbb{R}^n\), \(\mathbf{X}\in\mathbb{R}^{n\times (p+1)}\) (avec colonne d’1), \(\boldsymbol{\beta}\in\mathbb{R}^{p+1}\).
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \widehat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y} \]
d <- ames %>%
select(Sale_Price, Gr_Liv_Area, Overall_Qual, Neighborhood) %>%
mutate(Neighborhood = droplevels(Neighborhood))
levels(d$Overall_Qual) [1] "Very_Poor" "Poor" "Fair" "Below_Average"
[5] "Average" "Above_Average" "Good" "Very_Good"
[9] "Excellent" "Very_Excellent"
m_cat <- lm(Sale_Price ~ Gr_Liv_Area + Overall_Qual + Neighborhood, data = d)
tidy(m_cat) %>% slice(1:8)# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14165. 16778. 0.844 3.99e- 1
2 Gr_Liv_Area 52.1 1.61 32.3 4.14e-196
3 Overall_QualPoor 28242. 19005. 1.49 1.37e- 1
4 Overall_QualFair 33761. 17412. 1.94 5.26e- 2
5 Overall_QualBelow_Average 46906. 16770. 2.80 5.19e- 3
6 Overall_QualAverage 59023. 16675. 3.54 4.07e- 4
7 Overall_QualAbove_Average 71350. 16704. 4.27 2.00e- 5
8 Overall_QualGood 88325. 16778. 5.26 1.51e- 7
mod_std <- lm(scale(Sale_Price) ~ scale(Gr_Liv_Area) + scale(as.numeric(Overall_Qual)) + scale(Year_Built), data=ames)
broom::tidy(mod_std)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.61e-16 0.00916 6.13e-14 1.000e+ 0
2 scale(Gr_Liv_Area) 3.99e- 1 0.0113 3.54e+ 1 1.71 e-228
3 scale(as.numeric(Overall_Qual)) 4.59e- 1 0.0137 3.36e+ 1 4.17 e-210
4 scale(Year_Built) 1.88e- 1 0.0116 1.62e+ 1 8.10 e- 57
m0 <- lm(Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Built + Full_Bath + Garage_Cars, data=ames)
par(mfrow=c(2,2)); plot(m0); par(mfrow=c(1,1))
GVIF Df GVIF^(1/(2*Df))
Gr_Liv_Area 2.286185 1 1.512013
Overall_Qual 2.628358 9 1.055154
Year_Built 2.003266 1 1.415368
Full_Bath 2.119535 1 1.455862
Garage_Cars 1.859696 1 1.363707
plot(cooks.distance(m0), type="h", main="Distance de Cook")
abline(h=4/nrow(ames), col="red", lty=2)
Objectifs de Mission 1 :
Explorer rapidement les données (EDA).
Construire un modèle multiple (≤ 6 prédicteurs).
Vérifier les diagnostics (résidus, normalité, homoscédasticité, VIF).
Interpréter les coefficients bruts et standardisés.
Ouvrir missions/M1_ames.qmd — EDA → lm() → diagnostics → interprétation.
Call:
lm(formula = Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Built +
Full_Bath + Garage_Cars, data = ames)
Residuals:
Min 1Q Median 3Q Max
-442829 -17192 -901 14649 225492
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.190e+05 6.013e+04 -15.284 < 2e-16 ***
Gr_Liv_Area 5.690e+01 1.897e+00 29.993 < 2e-16 ***
Overall_QualPoor 1.965e+04 1.964e+04 1.000 0.317338
Overall_QualFair 2.906e+04 1.802e+04 1.612 0.107034
Overall_QualBelow_Average 3.808e+04 1.733e+04 2.198 0.028046 *
Overall_QualAverage 5.348e+04 1.723e+04 3.105 0.001923 **
Overall_QualAbove_Average 6.070e+04 1.727e+04 3.515 0.000446 ***
Overall_QualGood 7.858e+04 1.735e+04 4.530 6.12e-06 ***
Overall_QualVery_Good 1.230e+05 1.744e+04 7.053 2.17e-12 ***
Overall_QualExcellent 2.013e+05 1.773e+04 11.356 < 2e-16 ***
Overall_QualVery_Excellent 2.424e+05 1.862e+04 13.022 < 2e-16 ***
Year_Built 4.686e+02 2.968e+01 15.789 < 2e-16 ***
Full_Bath -4.496e+03 1.670e+03 -2.692 0.007140 **
Garage_Cars 1.318e+04 1.136e+03 11.600 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 34330 on 2916 degrees of freedom
Multiple R-squared: 0.8162, Adjusted R-squared: 0.8154
F-statistic: 996.1 on 13 and 2916 DF, p-value: < 2.2e-16
Objectifs de Mission 2 :
Comparer modèles simple vs complet.
Évaluer performance prédictive (RMSE/MAE).
Tester les coefficients et comparer via ANOVA.
Construire et interpréter IC et IP.
Vérifier calibration et couverture.
Ouvrir missions/M2_interpretation.qmd — Train/Test → RMSE/MAE → tests → intervalles → calibration.
step() : exploration mais à manier avec recul.m_log <- lm(log(Sale_Price) ~ log(Gr_Liv_Area) + Overall_Qual + Year_Built, data=ames)
broom::glance(m_log)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.814 0.813 0.176 1161. 0 11 937. -1848. -1770.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Ouvrir missions/Challenge_voitures.qmd
Ateliers EIOM — Université Laval