Splitting the data into training-data, validation-data, and test-data (60%,20%,20%) I will show how to get to a classification accuracy of 96.2% using an
ensemble of 20 deep-learning networks.

forest_cover

Following on from my previous post, I wanted to provide an example (that is probably more for me to learn from than for others) of implementing a deep-learning algorithm in R using my new-favourite package: H2O on the “Forest Cover” data-set. Essentially we start with a data-set of 581,102 observations x 13 variables (2 are categorical explanatory variables with 4 and 40 levels; and the variable to predict “Cover Type” is also categorical with 7 levels).

The dataset appeared on Kaggle and the competition was described:

In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

I will not focus on the data-set too much (along with analysis that should always be done before blindly making predictions) as I want to instead focus on the features of H2O in R and why it is my favourite deep-learning tool.

Step 1: Import, Brief Glance, Set-up Model

It is interesting to note that we can initialise H2O to use all available threads of the CPU by passing “nthreads=-1“. I import the data into R in the normal way and briefly check what it looks like.

This is the variable to predict and it contains 7 different categories of Cover:

aaa

The Soil Type explanatory-categorical has lots of levels (40):

aab

Wilderness Area is much simpler with only 4:

aac

We can briefly check how wilderness area is related to the cover type (using a mosaic plot):

aaad

And then see if the variation in any of the continuous variable looks useful – for example this is a box-whisker plot of Cover Type (x-axis) and Elevation (y-axis). To me, it looks like there is enough variation in the data to make this a good predictor.aaae

It is not necessary for us to deal with missing-values (although the data has none) and standardise because H2O automatically does this for us (along with creating dummies from a categorical variable).

Step 2: Create some base scenarios (default models)

It is useful to see what kind of predictions we can get without doing anything (hopefully after doing something we get better). Running the following models (a variety of machine learning algorithms) with their default settings I generate a confusion matrix on the validation data to see how many observations we miss-classify (and whether we are over-training):

Gradient Boosted Machines -> 22,879 errors (out of 115,915)

Random Forest -> 7,551 errors

Neural Network -> 12,049 errors

To create a comparison closer to home I also examine what a multinomial logistic regression gives us (by calling h2o.glm with family = “multinomial“)

Multinomial Logistic Regression -> 32,456 errors

It appears as if the data really lends itself to the Random Forest. We will later analyse if this is actually  the case or if perhaps the default parameters for the Random Forest were a lucky-hit compared to the defaults for the Neural Network. Either way, we don’t want to get more than 7,551 errors

Step 3: Improve the NN

We can check to see that the basic neural-network created has 54 input neurons (10 continuous variables + 4 levels for wilderness + 40 levels for soil type) which means it is using 1-of-C dummies encoding. The default hidden layer has [200,200] ReLU neurons, and we are using a read-out layer of 7 softmax neurons to classify (along with CE loss), all for a total of 53,007 parameters to estimate and around 497,000 data-points to train on.

I start off by adding L1 and L2 regularization (and boosting the number of training rounds/epochs from 10 to 100). This seems to reduce our errors down to 8,333, which makes sense as we have made our model less prone to catching ‘noise’ and to generalise better. (If you think this could all be down to just increasing epochs I ran the model without weight-decay and the errors were at 11,795 so that is not the case)

I experiment with a wide range of hidden-layers and amount of neurons (more than listed in the R-script – ideally I wanted to construct a list of potentials so that the chosen parameter was somewhere in the middle).  The best classification I get from [400,300,200] neurons which is a fairly complex network, however the errors fall down to 5,741.

Including drop-out doesn’t improve the result too much (it actually makes them worse) and errors go up to 19,343. This is interesting to investigate further. Perhaps the weight-decay by itself is already enough for this data-set? In which case I run a quick test dropping weight-decay and introducing drop-out. Errors were still around 20,000 so I decided not to use drop-out.

I evaluated the model by specifying that the squared sum of incoming weights per unit should be no bigger than 10 (“max_w2=10“), however this did not make much different and the errors hovered around 6,429.

Finally I briefly checked whether the Tanh activation function would do any better than ReLU – I ended up with 8,778 errors however.

After all of this:
nnsummary

Hence, I ended up nominating model #5 with 5,741 errors on the classification data-set:

model

h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  hidden = c(400,300,200),
  epochs = 100,
  l1 = 1e-5,
  l2 = 1e-5
  #overwrite_with_best_model = TRUE, #default
  #max_categorical_features = FALSE, #default: otherwise hashing
  #loss = "CrossEntropy",  #default
  #activation = "Rectifier", #default
  #input_dropout_ratio=0, #default
  #max_w2 = Inf, #default
  #train_samples_per_iteration = -2, #default is auto-tuning, not -1: all data
  #use_all_factor_levels = TRUE #default: 1-of-C dummies
)

Step 4: Test our best model (95.1%)

The model ended up scoring a success-rate of 95.1% on the test-data (only 5,696 errors which was even lower than on the validation). I also evaluated the other models and their ranking was very similar to the one on the validation data.

Step 5: Create an Ensemble of Networks (96.2%)

“All happy families are alike; each unhappy family is unhappy in its own way.”― Leo Tolstoy, Anna Karenina

Training an ensemble of networks and using the ensemble to predict has been empirically more successful than using any of the models in the ensemble alone. I wanted to examine this approach with my best-performing model and trained 20 models. Since these have been trained using mini-batch and have random weight initialisations, we should expect them to turn out a little different (also this is why if you run my code you won’t be able to reproduce my exact errors). The hope is that the wrong models are wrong in slightly different ways and when we take a vote amongst them we have a higher chance of being correct.

This turns out to be the case and we end up with only 4,434 errors on the test-data (compared to 5,696 previously), which means we have an accuracy of 96.2%! This may not look like much but consider that we have reduced the number of errors by 22%

Step 6: Extra – Grid-search to optimise parameters

A popular alternative to my approach in Step 3 is to conduct a grid-search to find the best parameters (this is useful because we can have a1 optimal given b1 but then a3 optimal given b5 – complex interactions between parameters – and thus we need to see whether f(a1,b1) gives fewer errors than f(a3,b5)).

I decided to run this on the Random Forest model, however. This is because it scored very well in the beginning and I wanted to see if it would be even better and overtake the NN.

hyper_params <- list(
  ntrees=c(10,20,50,75,100),  # Default=50
  max_depth=c(5,10,20,30,50),  # Default = 20
  nbins=c(10,20,50,100,150)  #Default = 20
)

forest.grid <- h2o.grid(
  "randomForest",
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  hyper_params = hyper_params
)

It turns the grid-search returns the same ntrees and max_depth as before and only increased nbins from 20 to 50. This tuned model gives 7,738 errors on the test-data which is not as good as our single optimised neural-network.

Code

Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_66')
ls()
rm(list = ls())
setwd("H:/H2O/")

library(h2o) #v3.6.0.8

#######################################
# SUMMARY: Predicting Forest Cover Types
# 96.2% Success Rate
#######################################
# 1. Elevation (m)
# 2. Aspect (azimuth from true north)
# 3. Slope (°)
# 4. Horizontal distance to nearest surface water feature (m)
# 5. Vertical distance to nearest surface water feature (m)
# 6. Horizontal distance to nearest roadway (m)
# 7. A relative measure of incident sunlight at 09:00 h on the summer solstice
# 8. A relative measure of incident sunlight at noon on the summer solstice
# 9. A relative measure of incident sunlight at 15:00 h on the summer solstice
# 10. Horizontal distance to nearest historic wildfire ignition point (m)
# 11. Wilderness area designation (four binary values, one for each wilderness area)
# 12. Soil type designation (40 binary values, one for each soil type)
#######################################
# Train     / Validate    / Test
# 60%       / 20%         / 20%
# 350,000   / 115,000     / 115,000 
#######################################

#######################################
# 1. Import, Quick-glance, set-up model
#######################################

# Initialise H2O with 16GB RAM and all cores
localH2O <- h2o.init(max_mem_size = '16g', nthreads= -1) # Allowed cores: 8                 

# Import Raw
train.data <- read.table("covtype.csv", header = TRUE, sep = ",", strip.white = TRUE)

# Briefly examine categorical variables
dim(train.data) #581,012 rows by 13 columns
summary(train.data)

# Missing Data?
require(Amelia)
missmap(train.data,
        main="Forest Cover - Missing Data", 
        col=c("white", "black"),
        legend=FALSE)  # Nope

# Grouping of categorical (dimensionality)
# Cover Type
table(train.data$Cover_Type)
barplot(table(train.data$Cover_Type), 
        main="Cover Type", col="red")

# Soil Type
table(train.data$Soil_Type)
barplot(table(train.data$Soil_Type), 
        main="Soil Type", col="green")

# Wilderness Area
table(train.data$Wilderness_Area)
barplot(table(train.data$Wilderness_Area), 
        main="Wilderness Area", col="blue")

mosaicplot(train.data$Wilderness_Area ~ train.data$Cover_Type, 
           main="Cover Type by Wilderness Area", shade=FALSE, 
           color=TRUE, xlab="Wilderness_Area", ylab="Cover_Type")

# Numeric variables
boxplot(train.data$Elevation ~ train.data$Cover_Type)

# Note: Breaking categoricals into 1-of-(C-1) dummies and standardisation done automatically
# Rescale -> add or subtract a constant and then multiply or divide by a constant
# Standardise -> rescale to have a mean of 0 and s.d. of 1
# Normalise -> rescale the values to a range of e.g [0,1] or [-1,1]

# Load data into H2O frame
data.hex <- as.h2o(x=train.data)

# Split the dataset into 60:20:20 for training, validation, test
hex_split <- h2o.splitFrame(data.hex, ratios = c(0.6, 0.2))
train.hex <- hex_split[[1]]
validate.hex <- hex_split[[2]]
test.hex <- hex_split[[3]]

# Set y ~ x
predict <- "Cover_Type" #What we are guessing
explanatory <- setdiff(names(train.hex),predict)

#######################################
# 2. Benchmark models (default)
#######################################

# Vector to keep track of best Machine Learning Algo
model.errors <- list()

# [1]. Gradient Boosted Machines
model.gbm <- h2o.gbm(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex
)

# # Assess performance on validation data
# y.hat <- as.data.frame(h2o.predict(object=model.gbm, newdata=validate.hex))
# y = as.data.frame(validate.hex[,c(predict)])
# names(y) <- c("predict")
# print(sum(y.hat$predict==y$predict))
# # Equivalent to:
# # Get last value from confusion matrix:
rev(h2o.confusionMatrix(h2o.performance(model.gbm, valid=T))$Rate)[1] # Errors: 22,879 / 115,915
model.errors[[1]] <- rev(h2o.confusionMatrix(h2o.performance(model.gbm, valid=T))$Rate)[1]

# [2]. Random Forest
model.rf <- h2o.randomForest(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.rf, valid=T))$Rate)[1] # 7,551
model.errors[[2]] <- rev(h2o.confusionMatrix(h2o.performance(model.rf, valid=T))$Rate)[1]
# It seems that tree classifiers work well for this data-set 

summary(train.hex)
# [3]. Neural Network
model.dl <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex
  #overwrite_with_best_model = TRUE, #default
  #max_categorical_features = FALSE, #default: otherwise hashing
  #loss = "CrossEntropy",  #default
  #activation = "Rectifier", #default
  #hidden = c(200,200),  #default
  #epochs = 10, #default
  #l1 = 0, #default
  #l2 = 0, #default
  #input_dropout_ratio=0, #default
  #max_w2 = Inf, #default
  #train_samples_per_iteration = -2, #default is auto-tuning, not -1: all data
  #use_all_factor_levels = TRUE #default: 1-of-C dummies
)

model.dl 
#[56] Input Neurons
#[200,200] Rectifier Hidden Neurons
#[7] Softmax output Neurons
#(no dropout, CE loss, 53,007 weights)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl, valid=T))$Rate)[1] # 12,049
model.errors[[3]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl, valid=T))$Rate)[1]


# Before we begin with improving our ML alogrithims
# Let's see what a standard multinomial logistic regression does

# [0]. Multinomial Logistic Regression
model.glm <- h2o.glm(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  family = "multinomial"
)
model.glm
rev(h2o.confusionMatrix(h2o.performance(model.glm, valid=T))$Rate)[1] # 32,456 

#######################################
# 3. Optimise NN Model to beat defaults
#######################################
?h2o.deeplearning

# [4]. Neural Network (L2 Regularization & L1 Regularization & Epochs boosted from 10 to 100)
model.dl2 <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  epochs=100,
  l2=1e-5,
  l1=1e-5
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl2, valid=T))$Rate)[1] # 8,333
model.errors[[4]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl2, valid=T))$Rate)[1]

# [5]. Neural Network (Experiment with Hidden-Layers)
# Random Hyper-Parameter Search (alt. method is Grid Search)
hidden = list(
  c(250),
  c(50),
  c(200,150),
  c(250, 200),
  c(200,200,200),
  c(100,100,100),
  c(300,200,100),
  c(500,400,300),
  c(600,500,400),
  c(350,250,150),
  c(100,80,40),
  c(400,300,200),  # This ends up being the best
  c(128,128,28),
  c(700,500,300)
)
models.dl3lst <- list()
errors <- list()

for (i in 1:length(hidden)) {
  print(paste("Hidden Layer: ",hidden[i]))
  model.hidden <- h2o.deeplearning(
    x=explanatory,
    y=predict,
    training_frame = train.hex,
    validation_frame = validate.hex,
    epochs=100,
    hidden = hidden[[i]],
    l2=1e-5,
    l1=1e-5
  )
  models.dl3lst[[i]] <- model.hidden
  # Assess performance on validation data
  print(rev(h2o.confusionMatrix(h2o.performance(models.dl3lst[[i]], valid=T))$Rate)[1])
  errors[[i]] <- rev(h2o.confusionMatrix(h2o.performance(models.dl3lst[[i]], valid=T))$Error)[1]
}

# Choose model with smallest error
minerror.indx <- which.min(errors)
print(paste("Lowest Error with Hidden Layer: ",hidden[minerror.indx]))  
# c(400, 300, 200)

# Assess performance on validation data
model.dl3 <- models.dl3lst[[minerror.indx]]
rev(h2o.confusionMatrix(h2o.performance(model.dl3, valid=T))$Rate)[1] # 5,741
model.errors[[5]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl3, valid=T))$Rate)[1]

# [6]. Neural Network (Try Dropout)
model.dl4 <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  epochs=100,
  activation = "RectifierWithDropout",
  hidden = hidden[[minerror.indx]], #c(400, 300, 200)
  l2=1e-5,
  l1=1e-5
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl4, valid=T))$Rate)[1] # 19,343
# This really destroys the model!
model.errors[[6]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl4, valid=T))$Rate)[1]

# [6.5]. Neural Network (Out of interest I remove weight-decay and keep dropout)
model.dl4.5 <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  epochs=100,
  activation = "RectifierWithDropout",
  hidden = hidden[[minerror.indx]] #c(400, 300, 200)
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl4.5, valid=T))$Rate)[1] # 21,122

# [7]. Neural Network (Try adding a weight limit to Rectifier)
model.dl5 <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  activation = "Rectifier",
  epochs = 100,
  hidden = hidden[[minerror.indx]], #c(400, 300, 200)  
  l2=1e-5,
  l1=1e-5,
  max_w2=10  
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl5, valid=T))$Rate)[1] # 6,429
model.errors[[7]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl5, valid=T))$Rate)[1]

# [8]. Neural Network (Try using the tanh function)
model.dl6 <- h2o.deeplearning(
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  activation = "Tanh",
  epochs = 100,
  hidden = hidden[[minerror.indx]], #c(400, 300, 200)  
  l2=1e-5,
  l1=1e-5
)

# Assess performance on validation data
rev(h2o.confusionMatrix(h2o.performance(model.dl6, valid=T))$Rate)[1] # 8,778
model.errors[[8]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl6, valid=T))$Rate)[1]

# Review all errors and choose model
print(model.errors) 

# It looks like we will use model #5 (model.dl3) which had 5,741/115,979 errors
# We could pursue this hyper-parameter optimisation fruther and conduct a grid search

#######################################
# 4. TEST: 95.1% Success Rate
#######################################

# Expected top performer (What we would submit)

# y.hat <- as.data.frame(h2o.predict(object=model.dl3, newdata=test.hex))
# y = as.data.frame(test.hex[,c(predict)])
# names(y) <- c("predict")
# print(sum(y.hat$predict!=y$predict))

# Equivalently:
#95.1% Success Rate#
h2o.confusionMatrix(h2o.performance(model.dl3, data=test.hex))  #Error: 0.0489 (5,696/116,476)

# Confusion Matrix
# class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error                  Rate
# class_1   40360    1951       2       0      27       7     167 0.0507 =        2,154 / 42,514
# class_2    1856   54675      82       2     164      71      15 0.0385 =        2,190 / 56,865
# class_3       0     122    6755      14       2     222       0 0.0506 =        360 / 7,115
# class_4       0       0     121     415       0      24       0 0.2589 =        145 / 560
# class_5      28     285      19       0    1576       6       1 0.1770 =        339 / 1,915
# class_6       8     126     151       5       6    3168       0 0.0855 =        296 / 3,464
# class_7     174      38       0       0       0       0    3831 0.0524 =        212 / 4,043
# Totals    42426   57197    7130     436    1775    3498    4014 0.0489 =        5,696 / 116,476

# Out of interest try others
# (Although changing the model now would be tuning on test and cheating)
h2o.confusionMatrix(h2o.performance(model.dl2, data=test.hex))  #Error: 0.0700 (8,159/116,476)
h2o.confusionMatrix(h2o.performance(model.dl4, data=test.hex))  #Error: 0.1675 (19,514/116,476)
h2o.confusionMatrix(h2o.performance(model.dl5, data=test.hex))  #Error: 0.0545 (6,349/116,476)
h2o.confusionMatrix(h2o.performance(model.dl6, data=test.hex))  #Error: 0.0750 (8,733/116,476)

# And the base-line cases
h2o.confusionMatrix(h2o.performance(model.gbm, data=test.hex))  #Error: 0.1978 (23,041/116,476)
h2o.confusionMatrix(h2o.performance(model.rf, data=test.hex))   #Error: 0.0640 (7,452/116,476)
h2o.confusionMatrix(h2o.performance(model.dl, data=test.hex))   #Error: 0.1014 (11,806/116,476)

#######################################
# 5. Ensemble (20 Nets): 96.2% Success Rate
#######################################

# Create a linear ensemble of our best network
# Since initialised randomly each one will be a bit different
# Take a simple vote (mode) between them all

# [i] Create ensemble predictions
ensemble <- list()
for (i in 1:20) {
  # [A]: Train
  ensemble.model <- h2o.deeplearning(
    x=explanatory,
    y=predict,
    training_frame = train.hex,
    validation_frame = validate.hex,
    epochs=100,
    hidden = hidden[[minerror.indx]], #c(400, 300, 200)
    l2=1e-5,
    l1=1e-5
  )
  # [B]: Perform classification on test-data
  predictions <- h2o.predict(ensemble.model, test.hex)
  y.hat = as.data.frame(predictions)
  ensemble[[i]] <- y.hat$predict
}
# Combine all the predictions into a data-frame
ensemble.df <- as.data.frame(do.call(cbind, ensemble))

# Calculate the mode value from rows -> ensemble.mode
require(functional)
ensemble.mode <- as.data.frame(
  apply(ensemble.df, 1, Compose(table,
                                function(i) i==max(i),
                                which,
                                names,
                                as.numeric,mean)))
names(ensemble.mode) <- c("predict")

# [ii] Truth
y = as.data.frame(test.hex[,c(predict)])
names(y) <- c("predict")

# [iii] Compare
# Report accuracy of ensemble
print(paste("Ensemble Errors:",
            sum(
              as.integer(y$predict) != as.integer(ensemble.mode$predict))
)
)
##Error: 4,434/116,476
#96.2% Success Rate#

# Compared to Our best model (as before):
h2o.confusionMatrix(h2o.performance(model.dl3, data=test.hex))  #Error: 5,696/116,476 -> 95.1%
#IMPROVEMENT OF 1.1 percentage points 

#######################################
# 6. Grid Search the RF
#######################################

# As a last check:
# RF did very well as a default-case
# Could be that by chance the default-parameters were optimal for this data
# OR: Data better suited for that model and optimised parameters would be even better
# Try a grid search:

?h2o.randomForest
?h2o.grid
#5*5*5=75 different model variations submitted to grid
hyper_params <- list(
  ntrees=c(10,20,50,75,100),  # Default=50
  max_depth=c(5,10,20,30,50),  # Default = 20
  nbins=c(10,20,50,100,150)  #Default = 20
)

forest.grid <- h2o.grid(
  "randomForest",
  x=explanatory,
  y=predict,
  training_frame = train.hex,
  validation_frame = validate.hex,
  hyper_params = hyper_params
)

# Best model (first in the list):
model.rfb <- h2o.getModel(forest.grid@model_ids[[1]])
model.rfb
# Best Parameters (on validation): ntrees=50, max_depth=20, nbins=50
# Run on test data:
h2o.confusionMatrix(h2o.performance(model.rfb, data=test.hex)) #Error: 7,738

# So the default model was actually better with Error: 7,452
# Since only nbins went from 20 to 50 it seems like the default parameters
# were by chance optimal for the data

# End
h2o.shutdown(prompt=FALSE)