Supervised: Prediction Models
Examples of supervised data mining
- In supervised data mining, the target variable is identified
- The goal is to predict or explain certain outcome
- Common applications:
- prediction models
- classification models
Prediction model: response variable is numerical
- Regression method
- e.g., stock return,
- housing price,
- temperature,
- spending of a customer.
Give me two real-world business applications using the prediction models.
Application: Predicting median house value using Boston
Housing Data
We use Boston Housing Data as an illustrative example in this lab. Boston housing data is a built-in dataset in MASS
package, so you do not need to download externally. Package MASS
comes with R when you installed R, so no need to use install.packages(MASS) to download and install, but you do need to load this package.
Load Data
library(MASS)
data(Boston); #this data is in MASS package
colnames(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
You can find details of the dataset from help document.
?Boston
The original data are 506 observations on 14 variables, medv being the response variable \(y\):
EDA
We have introduced many EDA techniques before. We will only briefly go through some of them here.
dim(Boston)
[1] 506 14
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
We skip the Exploratory Data Analysis (EDA) in this notes, but you should not omit it in your real world cases. EDA is very important and always the first analysis to do before any modeling. You should also provide some visualization to present some exploratory analysis.
Give me the sample codes in R to do the exploratory analysis for the "Boston" dataset
Splitting data to training and testing samples
Next we sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.
<- sample(nrow(Boston),nrow(Boston)*0.90)
sample_index <- Boston[sample_index,]
Boston_train <- Boston[-sample_index,] Boston_test
(Optional) Standardization
If we want our results to be invariant to the units and the parameter estimates \(\beta_i\) to be comparable, we can standardize the variables. Essentially we are replacing the original values with their z-score.
1st Way: create new variables manually.
$sd.crim <- (Boston$crim-mean(Boston$crim))/sd(Boston$crim); Boston
This does the same thing.
$sd.crim <- scale(Boston$crim); Boston
Model Building
You task is to build a best model with training data. You can refer to the regression and variable selection code on the slides for more detailed description of linear regression.
The following model includes all \(x\) variables in the dataeset
<- lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data=Boston_train) model_1
To include all variables in the model, you can write the statement this simpler way.
<- lm(medv~., data=Boston_train)
model_1 summary(model_1)
Call:
lm(formula = medv ~ ., data = Boston_train)
Residuals:
Min 1Q Median 3Q Max
-15.3851 -2.7159 -0.4623 1.7193 24.4277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.680511 5.411542 7.333 1.09e-12 ***
crim -0.104331 0.033181 -3.144 0.001777 **
zn 0.053725 0.014355 3.743 0.000206 ***
indus 0.005883 0.063642 0.092 0.926387
chas 2.940600 0.885140 3.322 0.000968 ***
nox -18.665111 3.993603 -4.674 3.94e-06 ***
rm 3.355129 0.438383 7.653 1.24e-13 ***
age 0.012661 0.014074 0.900 0.368840
dis -1.581658 0.213393 -7.412 6.41e-13 ***
rad 0.305914 0.068344 4.476 9.69e-06 ***
tax -0.011811 0.003871 -3.051 0.002420 **
ptratio -0.925290 0.137805 -6.714 5.83e-11 ***
black 0.009106 0.002872 3.170 0.001628 **
lstat -0.602475 0.054080 -11.140 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.711 on 441 degrees of freedom
Multiple R-squared: 0.7487, Adjusted R-squared: 0.7413
F-statistic: 101.1 on 13 and 441 DF, p-value: < 2.2e-16
But, is this the model you want to use?
(Optional) Interaction terms in model
If you suspect the effect of one predictor x1 on the response y depends on the value of another predictor x2, you can add interaction terms in model. To specify interaction in the model, you put : between two variables with interaction effect. For example
lm(medv~crim+zn+crim:zn, data=Boston_train)
#The following way automatically add the main effects of crim and zn
lm(medv~crim*zn, data=Boston_train)
For now we will not investigate the interactions of variables. But you can try to do these analyses with the help of EducateUsBot.
Show me sample codes in R to add interaction terms between age and rm to a linear regression model for the "Boston" dataset.
Model Assessment
Suppose that everything in model diagnostics is okay. In other words, the model we have built is fairly a valid model. Then we need to evaluate the model performance in terms of different metrics.
Commonly used metrics include MSE, (adjusted) \(R^2\), AIC, BIC for in-sample performance, and MSPE for out-of-sample performance.
In-sample model evaluation (train error)
MSE of the regression, which is the square of ‘Residual standard error’ in the above summary. It is the sum of squared residuals(SSE) divided by degrees of freedom (n-p-1). In some textbooks the sum of squred residuals(SSE) is called residual sum of squares(RSS). MSE of the regression should be the unbiased estimator for variance of \(\epsilon\), the error term in the regression model.
<- summary(model_1)
model_summary $sigma)^2 (model_summary
[1] 22.18996
\(R^2\) of the model
$r.squared model_summary
[1] 0.748731
Adjusted-\(R^2\) of the model, if you add a variable (or several in a group), SSE will decrease, \(R^2\) will increase, but Adjusted-\(R^2\) could go either way.
$adj.r.squared model_summary
[1] 0.741324
AIC and BIC of the model, these are information criteria. Smaller values indicate better fit.
AIC(model_1)
[1] 2717.35
BIC(model_1)
[1] 2779.155
BIC, AIC, and Adjusted \(R^2\) have complexity penalty in the definition, thus when comparing across different models they are better indicators on how well the model will perform on future data.
Out-of-sample prediction (test error)
To evaluate how the model performs on future data, we use predict() to get the predicted values from the test set.
#medv_pred is a vector that contains predicted values for test set.
<- predict(object = model_1, newdata = Boston_test) medv_pred
Just as any other function, you can write the above statement the following way as long as the arguments are in the right order.
<- predict(model_1, Boston_test) medv_pred
The most common measure is the Mean Squared Prediction Error (MSPE): average of the squared differences between the predicted and actual values
mean((medv_pred - Boston_test$medv)^2)
[1] 15.04145
A less popular measure is the Mean Absolute Prediction Error (MAPE). You can probably guess that here instead of taking the average of squared error, MAE is the average of absolute value of error.
mean(abs(medv_pred - Boston_test$medv))
[1] 3.04876
Note that if you ignore the second argument of predict(), it gives you the in-sample prediction on the training set:
predict(model_1)
Which is the same as
$fitted.values model_1