6. Validation
Let’s go through an end-to-end process in training and validating a classification model. After understanding this process, you may modify it for your own purpose. In general, the steps are as follows.
Data: you need to acquire your data relevant to the problem and goal you are trying to achieve; here we generate synthetic data.
Missingness: typically, there is missing data and we need to figure a way to resolve this issue; here, we randomly force data to be missing at random.
Splitting: when we train a model, we desire for the model to be able to
generalize
into the future againstunseen
data; here, we split the data into training and validation sets, where the modeling pipeline will only see the training data and will be tested for performance against the unseen validation set.Training: the training pipeline is dependent on the nature and properties of your data; here, our pipeline is one-hot encoding, imputation, scaling and finally, training a model.
Validation: after we done training a model, we cross-validate the model against unseen data; here, the unseen data is the holdout validation set.
6.1. Generate data
Below, we simulate data D
.
\(X_1 \sim \mathcal{N}(0, 1)\)
\(X_2 \sim \mathcal{N}(0, 1)\)
\(X_3\) is
red
when \(X_1 < 0\) and \(X_2 < 0\)green
when \(X_1 > 0\) and \(X_2 > 0\)blue
otherwise
\(Y_0 = 1 + 2X_1 + 3X_2 + \sigma\)
\(Y_1 = \frac{1}{1 + \exp(-Y_0)}\)
\(y \sim \mathcal{Binom}(Y_1)\)
[1]:
suppressMessages({
library('dplyr')
library('purrr')
})
set.seed(37)
getColor <- function(a, b) {
if (a < 0 && b < 0) {
return('red')
} else if (a > 0 && b > 0) {
return('green')
} else {
return('blue')
}
}
getData <- function(N=1000) {
x1 <- rnorm(N, mean=0, sd=1)
x2 <- rnorm(N, mean=0, sd=1)
x3 <- map2(x1, x2, getColor) %>% flatten_chr()
y <- 1 + 2.0 * x1 + 3.0 * x2 + rnorm(N, mean=0, sd=1)
y <- 1.0 / (1.0 + exp(-y))
y <- rbinom(n=N, size=1, prob=y)
df <- data.frame(x1=x1, x2=x2, x3=x3, y=y)
df <- df %>%
mutate(y=ifelse(y == 0, 'neg', 'pos')) %>%
mutate_at('y', as.factor)
return(df)
}
D <- getData()
[2]:
print(sapply(D, is.factor))
x1 x2 x3 y
FALSE FALSE TRUE TRUE
[3]:
print(head(D))
x1 x2 x3 y
1 0.1247540 0.4817417 green pos
2 0.3820746 0.8176848 green pos
3 0.5792428 -0.3254274 blue pos
4 -0.2937481 -0.2674061 red pos
5 -0.8283492 1.7163252 blue pos
6 -0.3327136 1.2376945 blue pos
[4]:
print(summary(D))
x1 x2 x3 y
Min. :-2.8613 Min. :-3.28763 blue :502 neg:394
1st Qu.:-0.6961 1st Qu.:-0.59550 green:257 pos:606
Median :-0.0339 Median : 0.06348 red :241
Mean :-0.0184 Mean : 0.03492
3rd Qu.: 0.6836 3rd Qu.: 0.69935
Max. : 3.8147 Max. : 3.17901
[5]:
print(dim(D))
[1] 1000 4
6.2. Generate missing data
Here, we make 10% of the data missing in D
and store the results in D.M
.
[6]:
suppressMessages({
library('missForest')
})
D.M <- prodNA(D[c('x1', 'x2', 'x3')], noNA=0.1)
D.M$y <- D$y
[7]:
print(summary(D.M))
x1 x2 x3 y
Min. :-2.86131 Min. :-3.28763 blue :457 neg:394
1st Qu.:-0.70096 1st Qu.:-0.61616 green:225 pos:606
Median :-0.02752 Median : 0.06288 red :213
Mean :-0.03416 Mean : 0.03213 NA's :105
3rd Qu.: 0.65422 3rd Qu.: 0.69815
Max. : 3.81473 Max. : 3.17901
NA's :102 NA's :93
[8]:
print(dim(D))
[1] 1000 4
6.3. Split data
Now we split D.M
into 80% training D.T
and 20% testing D.V
.
[9]:
suppressMessages({
library('caret')
})
trIndices <- createDataPartition(D$y, times=1, p=0.8, list=FALSE)
D.T <- D.M[trIndices, ]
D.V <- D.M[-trIndices, ]
[10]:
print(head(D.T))
x1 x2 x3 y
1 0.1247540 0.4817417 green pos
2 0.3820746 0.8176848 green pos
3 0.5792428 -0.3254274 blue pos
5 NA 1.7163252 blue pos
7 -0.1921595 -1.1223299 red neg
8 1.3629827 1.5202646 <NA> pos
[11]:
print(head(D.V))
x1 x2 x3 y
4 -0.2937481 -0.2674061 red pos
6 -0.3327136 1.2376945 blue pos
26 -0.4590474 1.5723978 blue pos
29 -2.1039691 1.0775911 blue pos
33 0.6553955 NA blue neg
36 0.1297151 0.2434078 green pos
[12]:
print(dim(D.T))
[1] 801 4
[13]:
print(dim(D.V))
[1] 199 4
6.4. Profile data
6.4.1. Statistical profiles
Now, let’s quickly look at the data profiles.
6.4.1.1. Training profile
[14]:
library('skimr')
skimmed <- skim(D.T[,1:2])
s <- skimmed[, c(2:6, 12)]
print(s)
# A tibble: 2 x 6
skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.hist
<chr> <int> <dbl> <dbl> <dbl> <chr>
1 x1 87 0.891 -0.0156 0.994 ▁▅▇▃▁
2 x2 73 0.909 0.0337 1.04 ▁▃▇▅▁
[15]:
skimmed <- skim(D.T[,3:4])
s <- skimmed[, c(2:7)]
print(s)
# A tibble: 2 x 6
skim_variable n_missing complete_rate factor.ordered factor.n_unique
<chr> <int> <dbl> <lgl> <int>
1 x3 81 0.899 FALSE 3
2 y 0 1 FALSE 2
# … with 1 more variable: factor.top_counts <chr>
6.4.1.2. Validation profile
[16]:
skimmed <- skim(D.V[,1:2])
s <- skimmed[, c(2:6, 12)]
print(s)
# A tibble: 2 x 6
skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.hist
<chr> <int> <dbl> <dbl> <dbl> <chr>
1 x1 15 0.925 -0.106 1.05 ▂▇▇▂▁
2 x2 20 0.899 0.0257 0.977 ▁▂▇▅▁
[17]:
skimmed <- skim(D.V[,3:4])
s <- skimmed[, c(2:7)]
print(s)
# A tibble: 2 x 6
skim_variable n_missing complete_rate factor.ordered factor.n_unique
<chr> <int> <dbl> <lgl> <int>
1 x3 24 0.879 FALSE 3
2 y 0 1 FALSE 2
# … with 1 more variable: factor.top_counts <chr>
6.4.2. Missingness profiles
Below, we visually inspect the data missingness.
[18]:
suppressMessages({
library('VIM')
})
options(repr.plot.width=4, repr.plot.height=4)
p <- aggr(
D.M,
col=c('navyblue','yellow'),
numbers=TRUE,
sortVars=TRUE,
labels=names(D.M),
cex.axis=.7,
gap=3,
ylab=c('Missing data', 'Pattern')
)
Variables sorted by number of missings:
Variable Count
x3 0.105
x1 0.102
x2 0.093
y 0.000
[19]:
p <- aggr(
D.T,
col=c('navyblue','yellow'),
numbers=TRUE,
sortVars=TRUE,
labels=names(D.T),
cex.axis=.7,
gap=3,
ylab=c('Missing data', 'Pattern')
)
Warning message in plot.aggr(res, ...):
“not enough horizontal space to display frequencies”
Variables sorted by number of missings:
Variable Count
x1 0.10861423
x3 0.10112360
x2 0.09113608
y 0.00000000
[20]:
p <- aggr(
D.V,
col=c('navyblue','yellow'),
numbers=TRUE,
sortVars=TRUE,
labels=names(D.V),
cex.axis=.7,
gap=3,
ylab=c('Missing data', 'Pattern')
)
Variables sorted by number of missings:
Variable Count
x3 0.12060302
x2 0.10050251
x1 0.07537688
y 0.00000000
6.5. One-hot encoding
We perform one-hot encoding OHE
on the categorical variables and store the result in D.O
. Note how we start the training pipeline at OHE? The reason is to avoid data leakage
. Had we applied OHE on the whole dataset, we may bias the training process towards better results.
[21]:
ohe <- dummyVars(~ ., data=D.T[,-4], sep='_')
D.O <- data.frame(predict(ohe, newdata=D[,-4]))
D.O$y <- D$y
[22]:
print(head(D.O))
x1 x2 x3_blue x3_green x3_red y
1 0.1247540 0.4817417 0 1 0 pos
2 0.3820746 0.8176848 0 1 0 pos
3 0.5792428 -0.3254274 1 0 0 pos
4 -0.2937481 -0.2674061 0 0 1 pos
5 -0.8283492 1.7163252 1 0 0 pos
6 -0.3327136 1.2376945 1 0 0 pos
[23]:
print(summary(D.O))
x1 x2 x3_blue x3_green
Min. :-2.8613 Min. :-3.28763 Min. :0.000 Min. :0.000
1st Qu.:-0.6961 1st Qu.:-0.59550 1st Qu.:0.000 1st Qu.:0.000
Median :-0.0339 Median : 0.06348 Median :1.000 Median :0.000
Mean :-0.0184 Mean : 0.03492 Mean :0.502 Mean :0.257
3rd Qu.: 0.6836 3rd Qu.: 0.69935 3rd Qu.:1.000 3rd Qu.:1.000
Max. : 3.8147 Max. : 3.17901 Max. :1.000 Max. :1.000
x3_red y
Min. :0.000 neg:394
1st Qu.:0.000 pos:606
Median :0.000
Mean :0.241
3rd Qu.:0.000
Max. :1.000
6.6. Imputation
We will preprocess the training data using k-nearest neighbor
imputation and store the result in D.I
.
[24]:
library('RANN')
imp <- preProcess(D.O, method='knnImpute')
D.I <- predict(imp, newdata=D.O)
print(anyNA(D.I))
[1] FALSE
[25]:
print(head(D.I))
x1 x2 x3_blue x3_green x3_red y
1 0.1408176 0.4391669 -1.0035059 1.6994586 -0.563210 pos
2 0.3939391 0.7693556 -1.0035059 1.6994586 -0.563210 pos
3 0.5878898 -0.3541760 0.9955098 -0.5878343 -0.563210 pos
4 -0.2708553 -0.2971486 -1.0035059 -0.5878343 1.773761 pos
5 -0.7967326 1.6526029 0.9955098 -0.5878343 -0.563210 pos
6 -0.3091849 1.1821708 0.9955098 -0.5878343 -0.563210 pos
[26]:
print(summary(D.I))
x1 x2 x3_blue x3_green
Min. :-2.79652 Min. :-3.26564 Min. :-1.0035 Min. :-0.5878
1st Qu.:-0.66661 1st Qu.:-0.61962 1st Qu.:-1.0035 1st Qu.:-0.5878
Median :-0.01525 Median : 0.02807 Median : 0.9955 Median :-0.5878
Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.69053 3rd Qu.: 0.65305 3rd Qu.: 0.9955 3rd Qu.: 1.6995
Max. : 3.77058 Max. : 3.09023 Max. : 0.9955 Max. : 1.6995
x3_red y
Min. :-0.5632 neg:394
1st Qu.:-0.5632 pos:606
Median :-0.5632
Mean : 0.0000
3rd Qu.:-0.5632
Max. : 1.7738
6.7. Rescale
Let’s rescale the variables into the range [0, 1] and store the result in D.R
.
[27]:
ran <- preProcess(D.I, method='range')
D.R <- predict(ran, newdata=D.I)
[28]:
print(head(D.R))
x1 x2 x3_blue x3_green x3_red y
1 0.4472809 0.5828955 0 1 0 pos
2 0.4858248 0.6348457 0 1 0 pos
3 0.5153584 0.4580750 1 0 0 pos
4 0.3845938 0.4670474 0 0 1 pos
5 0.3045163 0.7738113 1 0 0 pos
6 0.3787572 0.6997959 1 0 0 pos
[29]:
print(summary(D.R))
x1 x2 x3_blue x3_green
Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
1st Qu.:0.3243 1st Qu.:0.4163 1st Qu.:0.000 1st Qu.:0.000
Median :0.4235 Median :0.5182 Median :1.000 Median :0.000
Mean :0.4258 Mean :0.5138 Mean :0.502 Mean :0.257
3rd Qu.:0.5310 3rd Qu.:0.6165 3rd Qu.:1.000 3rd Qu.:1.000
Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.000
x3_red y
Min. :0.000 neg:394
1st Qu.:0.000 pos:606
Median :0.000
Mean :0.241
3rd Qu.:0.000
Max. :1.000
[30]:
print(dim(D.R))
[1] 1000 6
6.8. Feature importance
We may visualize feature importance
to see which features/variables are important to the class variable.
[31]:
featurePlot(x=D.R[,1:5],
y=D.R$y,
plot='box',
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation='free'),
y = list(relation='free')))
[32]:
featurePlot(x=D.R[,1:5],
y=D.R$y,
plot='density',
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation='free'),
y = list(relation='free')))
6.9. Feature selection
Feature selection
may help us to use only features (variables) that are important to the classification.
[33]:
ctrl <- rfeControl(
functions=rfFuncs,
method='repeatedcv',
repeats=5,
verbose=FALSE
)
profile <- rfe(y ~ ., data=D.R, sizes=c(2, 3), rfeControl=ctrl)
[34]:
profile
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
2 0.8028 0.5861 0.03459 0.07315
3 0.8102 0.5908 0.03654 0.08220
5 0.8214 0.6186 0.03640 0.08056 *
The top 5 variables (out of 5):
x2, x1, x3_red, x3_green, x3_blue
6.10. Training
We are now ready to train a classification model. We will use a random forest
to classify the data. Use the modelLookup
function to see which arguments are available. Below, the mtry
argument/parameter is available. The model will be bound to M
.
[35]:
modelLookup('rf')
model | parameter | label | forReg | forClass | probModel |
---|---|---|---|---|---|
<chr> | <fct> | <fct> | <lgl> | <lgl> | <lgl> |
rf | mtry | #Randomly Selected Predictors | TRUE | TRUE | TRUE |
[36]:
M <- train(y ~ ., data=D.R, method='rf')
[37]:
print(M)
Random Forest
1000 samples
5 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8192793 0.6111993
3 0.8136866 0.6049471
5 0.7961271 0.5705189
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
We may visualize the accuracy
of the training with the predictors varying.
[38]:
plot(M)
We may visualize the variable importance
.
[39]:
vi <- varImp(M)
plot(vi)
6.11. Predicting on validation data
6.11.1. Transform validation data
We will not run the validation data D.V
through the pipeline.
one-hot encoding,
O
imputation,
I
scaling (to the range [0, 1]),
R
[40]:
O <- predict(ohe, newdata=D.V)
[41]:
I <- predict(imp, newdata=O)
[42]:
R <- predict(ran, newdata=I)
[43]:
print(head(R))
x1 x2 x3_blue x3_green x3_red
4 0.3845938 0.4670474 0 0 1
6 0.3787572 0.6997959 1 0 0
26 0.3598337 0.7515544 1 0 0
29 0.1134421 0.6750376 1 0 0
33 0.5267653 0.4183815 1 0 0
36 0.4480240 0.5460396 0 1 0
[44]:
print(dim(R))
[1] 199 5
6.11.2. Confusion matrix
Let’s look at the confusion matrix
.
[45]:
y <- predict(M, R)
cm <- confusionMatrix(reference=D.V$y, data=y, positive='pos')
[46]:
print(cm)
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 62 16
pos 16 105
Accuracy : 0.8392
95% CI : (0.7806, 0.8873)
No Information Rate : 0.608
P-Value [Acc > NIR] : 1.118e-12
Kappa : 0.6626
Mcnemar's Test P-Value : 1
Sensitivity : 0.8678
Specificity : 0.7949
Pos Pred Value : 0.8678
Neg Pred Value : 0.7949
Prevalence : 0.6080
Detection Rate : 0.5276
Detection Prevalence : 0.6080
Balanced Accuracy : 0.8313
'Positive' Class : pos
6.11.3. Receiver operating characteristic curve
Let’s look at the area under the curve
(AUC
) for the Receiver Operating Characteristic
curve.
[47]:
library('PRROC')
y <- predict(M, R, type='prob')[,1]
results <- data.frame(
y = D.V$y,
y.pred = y
)
class.0 <- results %>% filter(y == 'neg') %>% select(y.pred)
class.1 <- results %>% filter(y == 'pos') %>% select(y.pred)
[48]:
roc <- roc.curve(
scores.class0=class.0$y.pred,
scores.class1=class.1$y.pred,
curve=TRUE,
rand.compute=TRUE)
plot(roc, legend=1, rand.plot=TRUE)
6.11.4. Precision-recall curve
The precision-recall curve is shown below.
[49]:
pr <- pr.curve(
scores.class0=class.0$y.pred,
scores.class1=class.1$y.pred,
curve=TRUE,
rand.compute=TRUE)
plot(pr, legend=1, rand.plot=TRUE)