Supervised: Prediction Models

Examples of supervised data mining

In supervised data mining, the target variable is identified
The goal is to predict or explain certain outcome
Common applications:
- prediction models
- classification models

Prediction model: response variable is numerical

Regression method
e.g., stock return,
housing price,
temperature,
spending of a customer.

Ask Educate Us GPT

Give me two real-world business applications using the prediction models.

Application: Predicting median house value using `Boston` Housing Data

We use Boston Housing Data as an illustrative example in this lab. Boston housing data is a built-in dataset in MASS package, so you do not need to download externally. Package MASS comes with R when you installed R, so no need to use install.packages(MASS) to download and install, but you do need to load this package.

Load Data

library(MASS)
data(Boston); #this data is in MASS package
colnames(Boston)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

You can find details of the dataset from help document.

?Boston

The original data are 506 observations on 14 variables, medv being the response variable \(y\):

EDA

We have introduced many EDA techniques before. We will only briefly go through some of them here.

dim(Boston)

[1] 506  14

names(Boston)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

str(Boston)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

We skip the Exploratory Data Analysis (EDA) in this notes, but you should not omit it in your real world cases. EDA is very important and always the first analysis to do before any modeling. You should also provide some visualization to present some exploratory analysis.

Ask Educate Us GPT

Give me the sample codes in R to do the exploratory analysis for the "Boston" dataset

Splitting data to training and testing samples

Next we sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.

sample_index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

(Optional) Standardization

If we want our results to be invariant to the units and the parameter estimates \(\beta_i\) to be comparable, we can standardize the variables. Essentially we are replacing the original values with their z-score.

1st Way: create new variables manually.

Boston$sd.crim <- (Boston$crim-mean(Boston$crim))/sd(Boston$crim);

This does the same thing.

Boston$sd.crim <- scale(Boston$crim);

go to top

Model Building

You task is to build a best model with training data. You can refer to the regression and variable selection code on the slides for more detailed description of linear regression.

The following model includes all \(x\) variables in the dataeset

model_1 <- lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data=Boston_train)

To include all variables in the model, you can write the statement this simpler way.

model_1 <- lm(medv~., data=Boston_train)
summary(model_1)


Call:
lm(formula = medv ~ ., data = Boston_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3851  -2.7159  -0.4623   1.7193  24.4277 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.680511   5.411542   7.333 1.09e-12 ***
crim         -0.104331   0.033181  -3.144 0.001777 ** 
zn            0.053725   0.014355   3.743 0.000206 ***
indus         0.005883   0.063642   0.092 0.926387    
chas          2.940600   0.885140   3.322 0.000968 ***
nox         -18.665111   3.993603  -4.674 3.94e-06 ***
rm            3.355129   0.438383   7.653 1.24e-13 ***
age           0.012661   0.014074   0.900 0.368840    
dis          -1.581658   0.213393  -7.412 6.41e-13 ***
rad           0.305914   0.068344   4.476 9.69e-06 ***
tax          -0.011811   0.003871  -3.051 0.002420 ** 
ptratio      -0.925290   0.137805  -6.714 5.83e-11 ***
black         0.009106   0.002872   3.170 0.001628 ** 
lstat        -0.602475   0.054080 -11.140  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.711 on 441 degrees of freedom
Multiple R-squared:  0.7487,    Adjusted R-squared:  0.7413 
F-statistic: 101.1 on 13 and 441 DF,  p-value: < 2.2e-16

But, is this the model you want to use?

(Optional) Interaction terms in model

If you suspect the effect of one predictor x1 on the response y depends on the value of another predictor x2, you can add interaction terms in model. To specify interaction in the model, you put : between two variables with interaction effect. For example

lm(medv~crim+zn+crim:zn, data=Boston_train)
#The following way automatically add the main effects of crim and zn
lm(medv~crim*zn, data=Boston_train)

For now we will not investigate the interactions of variables. But you can try to do these analyses with the help of EducateUsBot.

Ask Educate Us GPT

Show me sample codes in R to add interaction terms between age and rm to a linear regression model for the "Boston" dataset.

Model Assessment

Suppose that everything in model diagnostics is okay. In other words, the model we have built is fairly a valid model. Then we need to evaluate the model performance in terms of different metrics.

Commonly used metrics include MSE, (adjusted) \(R^2\), AIC, BIC for in-sample performance, and MSPE for out-of-sample performance.

In-sample model evaluation (train error)

MSE of the regression, which is the square of ‘Residual standard error’ in the above summary. It is the sum of squared residuals(SSE) divided by degrees of freedom (n-p-1). In some textbooks the sum of squred residuals(SSE) is called residual sum of squares(RSS). MSE of the regression should be the unbiased estimator for variance of \(\epsilon\), the error term in the regression model.

model_summary <- summary(model_1)
(model_summary$sigma)^2

[1] 22.18996

\(R^2\) of the model

model_summary$r.squared

[1] 0.748731

Adjusted-\(R^2\) of the model, if you add a variable (or several in a group), SSE will decrease, \(R^2\) will increase, but Adjusted-\(R^2\) could go either way.

model_summary$adj.r.squared

[1] 0.741324

AIC and BIC of the model, these are information criteria. Smaller values indicate better fit.

AIC(model_1)

[1] 2717.35

BIC(model_1)

[1] 2779.155

BIC, AIC, and Adjusted \(R^2\) have complexity penalty in the definition, thus when comparing across different models they are better indicators on how well the model will perform on future data.

Out-of-sample prediction (test error)

To evaluate how the model performs on future data, we use predict() to get the predicted values from the test set.

#medv_pred is a vector that contains predicted values for test set.
medv_pred <- predict(object = model_1, newdata = Boston_test)

Just as any other function, you can write the above statement the following way as long as the arguments are in the right order.

medv_pred <- predict(model_1, Boston_test)

The most common measure is the Mean Squared Prediction Error (MSPE): average of the squared differences between the predicted and actual values

mean((medv_pred - Boston_test$medv)^2)

[1] 15.04145

A less popular measure is the Mean Absolute Prediction Error (MAPE). You can probably guess that here instead of taking the average of squared error, MAE is the average of absolute value of error.

mean(abs(medv_pred - Boston_test$medv))

[1] 3.04876

Note that if you ignore the second argument of predict(), it gives you the in-sample prediction on the training set:

predict(model_1)

Which is the same as

model_1$fitted.values

go to top

Examples of supervised data mining

Prediction model: response variable is numerical

Give me two real-world business applications using the prediction models.

Application: Predicting median house value using Boston Housing Data

Load Data

EDA

Give me the sample codes in R to do the exploratory analysis for the "Boston" dataset

Splitting data to training and testing samples

(Optional) Standardization

Model Building

(Optional) Interaction terms in model

Show me sample codes in R to add interaction terms between age and rm to a linear regression model for the "Boston" dataset.

Model Assessment

In-sample model evaluation (train error)

Out-of-sample prediction (test error)

Application: Predicting median house value using `Boston` Housing Data