3. Classification
3.1. Simulate data
Below, we simulate data.
\(X_1 \sim \mathcal{N}(0, 1)\)
\(X_2 \sim \mathcal{N}(0, 1)\)
\(Y_0 = 1 + 2X_1 + 3X_2 + \sigma\)
\(Y_1 = \frac{1}{1 + \exp(-Y_0)}\)
\(y \sim \mathcal{Binom}(Y_1)\)
[1]:
suppressMessages({
library('dplyr')
})
set.seed(37)
getData <- function(N=1000) {
x1 <- rnorm(N, mean=0, sd=1)
x2 <- rnorm(N, mean=0, sd=1)
y <- 1 + 2.0 * x1 + 3.0 * x2 + rnorm(N, mean=0, sd=1)
y <- 1.0 / (1.0 + exp(-y))
y <- rbinom(n=N, size=1, prob=y)
df <- data.frame(x1=x1, x2=x2, y=y)
df <- df %>%
mutate(y=ifelse(y == 0, 'neg', 'pos')) %>%
mutate_if(is.character, as.factor)
return(df)
}
T <- getData()
V <- getData()
print(summary(T))
x1 x2 y
Min. :-2.8613 Min. :-3.28763 neg:394
1st Qu.:-0.6961 1st Qu.:-0.59550 pos:606
Median :-0.0339 Median : 0.06348
Mean :-0.0184 Mean : 0.03492
3rd Qu.: 0.6836 3rd Qu.: 0.69935
Max. : 3.8147 Max. : 3.17901
3.2. Classification methods
There are many classifcation models available in R
. We only show a few here and use the pROC package to visualize the performance.
3.2.1. Random forest
Using random forest.
[2]:
suppressMessages({
library('pROC')
library('randomForest')
})
m <- randomForest(y ~ ., data=T, mtry=2, ntree=1000)
[3]:
print(m)
Call:
randomForest(formula = y ~ ., data = T, mtry = 2, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 2
OOB estimate of error rate: 20.6%
Confusion matrix:
neg pos class.error
neg 287 107 0.2715736
pos 99 507 0.1633663
[4]:
p <- predict(m, newdata=V, type='prob')
y_pred = p[,2]
r <- roc(V$y, y_pred, levels=c('neg', 'pos'), direction='<')
print(r)
Call:
roc.default(response = V$y, predictor = y_pred, levels = c("neg", "pos"), direction = "<")
Data: y_pred in 402 controls (V$y neg) < 598 cases (V$y pos).
Area under the curve: 0.9062
[5]:
options(repr.plot.width=4, repr.plot.height=4)
plot(r, print.auc=TRUE)
3.2.2. Logistic regression
Using logistic regression.
[6]:
m <- glm(y ~ ., family='binomial', data=T)
[7]:
print(m)
Call: glm(formula = y ~ ., family = "binomial", data = T)
Coefficients:
(Intercept) x1 x2
0.7692 1.6568 2.3364
Degrees of Freedom: 999 Total (i.e. Null); 997 Residual
Null Deviance: 1341
Residual Deviance: 724.4 AIC: 730.4
[8]:
y_pred <- predict(m, newdata=V, type='response')
r <- roc(V$y, y_pred, levels=c('neg', 'pos'), direction='<')
print(r)
Call:
roc.default(response = V$y, predictor = y_pred, levels = c("neg", "pos"), direction = "<")
Data: y_pred in 402 controls (V$y neg) < 598 cases (V$y pos).
Area under the curve: 0.933
[9]:
options(repr.plot.width=4, repr.plot.height=4)
plot(r, print.auc=TRUE)
3.2.3. AdaBoost
Using AdaBoost.
[10]:
library('fastAdaboost')
m <- adaboost(y ~ ., nIter=50, data=T)
[11]:
print(m)
adaboost(formula = y ~ ., data = T, nIter = 50)
y ~ .
Dependent Variable: y
No of trees:50
The weights of the trees are:0.96387420.80235240.71526760.63439580.63139020.58530950.62330850.58756820.53374960.5300160.50916270.55428910.54749730.60577360.55118370.54656730.49836120.53329020.48512470.46704230.51634080.52241440.53706090.55988470.55272350.56468040.54326980.53403390.5140650.51328940.54461680.49396760.4945920.45737710.47303330.63182430.52913920.57678280.54805410.53541630.56575640.53455380.53633590.51994930.53944080.56411030.51954350.54601860.48144430.5298128
[12]:
p <- predict(m, newdata=V)
y_pred <- p$prob[,2]
r <- roc(V$y, y_pred, levels=c('neg', 'pos'), direction='<')
print(r)
Call:
roc.default(response = V$y, predictor = y_pred, levels = c("neg", "pos"), direction = "<")
Data: y_pred in 402 controls (V$y neg) < 598 cases (V$y pos).
Area under the curve: 0.8936
[13]:
options(repr.plot.width=4, repr.plot.height=4)
plot(r, print.auc=TRUE)
3.2.4. Support vector machine
Using support vector machine SVM
.
[14]:
library('e1071')
m <- svm(y ~ ., probability=TRUE, data=T)
[15]:
print(m)
Call:
svm(formula = y ~ ., data = T, probability = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 405
[16]:
p <- predict(m, newdata=V, probability=TRUE)
y_pred <- attr(p, 'probabilities')[,1]
r <- roc(V$y, y_pred, levels=c('neg', 'pos'), direction='<')
print(r)
Call:
roc.default(response = V$y, predictor = y_pred, levels = c("neg", "pos"), direction = "<")
Data: y_pred in 402 controls (V$y neg) < 598 cases (V$y pos).
Area under the curve: 0.9166
[17]:
options(repr.plot.width=4, repr.plot.height=4)
plot(r, print.auc=TRUE)