**Splitting the data into training-data, validation-data, and test-data (60%,20%,20%) I will show how to get to a classification accuracy of 96.2% using an **

**ensemble of 20 deep-learning networks.**

Following on from my previous post, I wanted to provide an example (that is probably more for me to learn from than for others) of implementing a deep-learning algorithm in R using my new-favourite package: H2O on the “Forest Cover” data-set. Essentially we start with a data-set of 581,102 observations x 13 variables (2 are categorical explanatory variables with 4 and 40 levels; and the variable to predict “Cover Type” is also categorical with 7 levels).

The dataset appeared on Kaggle and the competition was described:

In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

I will not focus on the data-set too much (along with analysis that should always be done before blindly making predictions) as I want to instead focus on the features of H2O in R and why it is my favourite deep-learning tool.

## Step 1: Import, Brief Glance, Set-up Model

It is interesting to note that we can initialise H2O to use all available threads of the CPU by passing “*nthreads=-1*“. I import the data into R in the normal way and briefly check what it looks like.

This is the variable to predict and it contains 7 different categories of Cover:

The Soil Type explanatory-categorical has lots of levels (40):

Wilderness Area is much simpler with only 4:

We can briefly check how wilderness area is related to the cover type (using a mosaic plot):

And then see if the variation in any of the continuous variable looks useful – for example this is a box-whisker plot of Cover Type (x-axis) and Elevation (y-axis). To me, it looks like there is enough variation in the data to make this a good predictor.

It is not necessary for us to deal with missing-values (although the data has none) and standardise because H2O automatically does this for us (along with creating dummies from a categorical variable).

## Step 2: Create some base scenarios (default models)

It is useful to see what kind of predictions we can get without doing anything (hopefully after doing something we get better). Running the following models (a variety of machine learning algorithms) with their default settings I generate a confusion matrix on the **validation** data to see how many observations we miss-classify (and whether we are over-training):

Gradient Boosted Machines -> 22,879 errors (out of 115,915)

Random Forest -> 7,551 errors

Neural Network -> 12,049 errors

To create a comparison closer to home I also examine what a multinomial logistic regression gives us (by calling *h2o.glm* with family = “*multinomial*“)

Multinomial Logistic Regression -> 32,456 errors

It appears as if the data really lends itself to the Random Forest. We will later analyse if this is actually the case or if perhaps the default parameters for the Random Forest were a lucky-hit compared to the defaults for the Neural Network. Either way, we don’t want to get more than 7,551 errors

## Step 3: Improve the NN

We can check to see that the basic neural-network created has 54 input neurons (10 continuous variables + 4 levels for wilderness + 40 levels for soil type) which means it is using 1-of-C dummies encoding. The default hidden layer has [200,200] ReLU neurons, and we are using a read-out layer of 7 softmax neurons to classify (along with CE loss), all for a total of 53,007 parameters to estimate and around 497,000 data-points to train on.

I start off by adding **L1 and L2 regularization** (and boosting the** number of training rounds/epochs** from 10 to 100). This seems to reduce our errors down to 8,333, which makes sense as we have made our model less prone to catching ‘noise’ and to generalise better. (If you think this could all be down to just increasing epochs I ran the model without weight-decay and the errors were at 11,795 so that is not the case)

I experiment with a **wide range of hidden-layers and amount of neurons** (more than listed in the R-script – ideally I wanted to construct a list of potentials so that the chosen parameter was somewhere in the middle). The best classification I get from [**400,300,200**] neurons which is a fairly complex network, however the errors fall down to 5,741.

Including **drop-out** doesn’t improve the result too much (it actually makes them worse) and errors go up to 19,343. This is interesting to investigate further. Perhaps the weight-decay by itself is already enough for this data-set? In which case I run a quick test dropping weight-decay and introducing drop-out. Errors were still around 20,000 so I decided not to use drop-out.

I evaluated the model by specifying that the squared sum of** incoming weights per unit should be no bigger** than 10 (“*max_w2=10*“), however this did not make much different and the errors hovered around 6,429.

Finally I briefly checked whether the **Tanh** activation function would do any better than ReLU – I ended up with 8,778 errors however.

After all of this:

Hence, I ended up nominating model #5 with 5,741 errors on the classification data-set:

h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, hidden = c(400,300,200), epochs = 100, l1 = 1e-5, l2 = 1e-5 #overwrite_with_best_model = TRUE, #default #max_categorical_features = FALSE, #default: otherwise hashing #loss = "CrossEntropy", #default #activation = "Rectifier", #default #input_dropout_ratio=0, #default #max_w2 = Inf, #default #train_samples_per_iteration = -2, #default is auto-tuning, not -1: all data #use_all_factor_levels = TRUE #default: 1-of-C dummies )

## Step 4: Test our best model (95.1%)

The model ended up scoring a success-rate of 95.1% on the test-data (only 5,696 errors which was even lower than on the validation). I also evaluated the other models and their ranking was very similar to the one on the validation data.

## Step 5: Create an Ensemble of Networks (96.2%)

“All happy families are alike; each unhappy family is unhappy in its own way.”― Leo Tolstoy, Anna Karenina

Training an ensemble of networks and using the ensemble to predict has been empirically more successful than using any of the models in the ensemble alone. I wanted to examine this approach with my best-performing model and trained 20 models. Since these have been trained using mini-batch and have random weight initialisations, we should expect them to turn out a little different (also this is why if you run my code you won’t be able to reproduce my exact errors). The hope is that the wrong models are wrong in slightly different ways and when we take a vote amongst them we have a higher chance of being correct.

**This turns out to be the case and we end up with only 4,434 errors on the test-data (compared to 5,696 previously), which means we have an accuracy of 96.2%! **This may not look like much but consider that we have reduced the number of errors by 22%

## Step 6: Extra – Grid-search to optimise parameters

A popular alternative to my approach in Step 3 is to conduct a grid-search to find the best parameters (this is useful because we can have *a1* optimal given* b1* but then *a3* optimal given *b5* – complex interactions between parameters – and thus we need to see whether *f(a1,b1)* gives fewer errors than *f(a3,b5)*).

I decided to run this on the Random Forest model, however. This is because it scored very well in the beginning and I wanted to see if it would be even better and overtake the NN.

hyper_params <- list( ntrees=c(10,20,50,75,100), # Default=50 max_depth=c(5,10,20,30,50), # Default = 20 nbins=c(10,20,50,100,150) #Default = 20 ) forest.grid <- h2o.grid( "randomForest", x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, hyper_params = hyper_params )

It turns the grid-search returns the same *ntrees* and *max_depth* as before and only increased *nbins* from 20 to 50. This tuned model gives 7,738 errors on the test-data which is not as good as our single optimised neural-network.

Code

Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_66') ls() rm(list = ls()) setwd("H:/H2O/") library(h2o) #v3.6.0.8 ####################################### # SUMMARY: Predicting Forest Cover Types # 96.2% Success Rate ####################################### # 1. Elevation (m) # 2. Aspect (azimuth from true north) # 3. Slope (°) # 4. Horizontal distance to nearest surface water feature (m) # 5. Vertical distance to nearest surface water feature (m) # 6. Horizontal distance to nearest roadway (m) # 7. A relative measure of incident sunlight at 09:00 h on the summer solstice # 8. A relative measure of incident sunlight at noon on the summer solstice # 9. A relative measure of incident sunlight at 15:00 h on the summer solstice # 10. Horizontal distance to nearest historic wildfire ignition point (m) # 11. Wilderness area designation (four binary values, one for each wilderness area) # 12. Soil type designation (40 binary values, one for each soil type) ####################################### # Train / Validate / Test # 60% / 20% / 20% # 350,000 / 115,000 / 115,000 ####################################### ####################################### # 1. Import, Quick-glance, set-up model ####################################### # Initialise H2O with 16GB RAM and all cores localH2O <- h2o.init(max_mem_size = '16g', nthreads= -1) # Allowed cores: 8 # Import Raw train.data <- read.table("covtype.csv", header = TRUE, sep = ",", strip.white = TRUE) # Briefly examine categorical variables dim(train.data) #581,012 rows by 13 columns summary(train.data) # Missing Data? require(Amelia) missmap(train.data, main="Forest Cover - Missing Data", col=c("white", "black"), legend=FALSE) # Nope # Grouping of categorical (dimensionality) # Cover Type table(train.data$Cover_Type) barplot(table(train.data$Cover_Type), main="Cover Type", col="red") # Soil Type table(train.data$Soil_Type) barplot(table(train.data$Soil_Type), main="Soil Type", col="green") # Wilderness Area table(train.data$Wilderness_Area) barplot(table(train.data$Wilderness_Area), main="Wilderness Area", col="blue") mosaicplot(train.data$Wilderness_Area ~ train.data$Cover_Type, main="Cover Type by Wilderness Area", shade=FALSE, color=TRUE, xlab="Wilderness_Area", ylab="Cover_Type") # Numeric variables boxplot(train.data$Elevation ~ train.data$Cover_Type) # Note: Breaking categoricals into 1-of-(C-1) dummies and standardisation done automatically # Rescale -> add or subtract a constant and then multiply or divide by a constant # Standardise -> rescale to have a mean of 0 and s.d. of 1 # Normalise -> rescale the values to a range of e.g [0,1] or [-1,1] # Load data into H2O frame data.hex <- as.h2o(x=train.data) # Split the dataset into 60:20:20 for training, validation, test hex_split <- h2o.splitFrame(data.hex, ratios = c(0.6, 0.2)) train.hex <- hex_split[[1]] validate.hex <- hex_split[[2]] test.hex <- hex_split[[3]] # Set y ~ x predict <- "Cover_Type" #What we are guessing explanatory <- setdiff(names(train.hex),predict) ####################################### # 2. Benchmark models (default) ####################################### # Vector to keep track of best Machine Learning Algo model.errors <- list() # [1]. Gradient Boosted Machines model.gbm <- h2o.gbm( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex ) # # Assess performance on validation data # y.hat <- as.data.frame(h2o.predict(object=model.gbm, newdata=validate.hex)) # y = as.data.frame(validate.hex[,c(predict)]) # names(y) <- c("predict") # print(sum(y.hat$predict==y$predict)) # # Equivalent to: # # Get last value from confusion matrix: rev(h2o.confusionMatrix(h2o.performance(model.gbm, valid=T))$Rate)[1] # Errors: 22,879 / 115,915 model.errors[[1]] <- rev(h2o.confusionMatrix(h2o.performance(model.gbm, valid=T))$Rate)[1] # [2]. Random Forest model.rf <- h2o.randomForest( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.rf, valid=T))$Rate)[1] # 7,551 model.errors[[2]] <- rev(h2o.confusionMatrix(h2o.performance(model.rf, valid=T))$Rate)[1] # It seems that tree classifiers work well for this data-set summary(train.hex) # [3]. Neural Network model.dl <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex #overwrite_with_best_model = TRUE, #default #max_categorical_features = FALSE, #default: otherwise hashing #loss = "CrossEntropy", #default #activation = "Rectifier", #default #hidden = c(200,200), #default #epochs = 10, #default #l1 = 0, #default #l2 = 0, #default #input_dropout_ratio=0, #default #max_w2 = Inf, #default #train_samples_per_iteration = -2, #default is auto-tuning, not -1: all data #use_all_factor_levels = TRUE #default: 1-of-C dummies ) model.dl #[56] Input Neurons #[200,200] Rectifier Hidden Neurons #[7] Softmax output Neurons #(no dropout, CE loss, 53,007 weights) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl, valid=T))$Rate)[1] # 12,049 model.errors[[3]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl, valid=T))$Rate)[1] # Before we begin with improving our ML alogrithims # Let's see what a standard multinomial logistic regression does # [0]. Multinomial Logistic Regression model.glm <- h2o.glm( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, family = "multinomial" ) model.glm rev(h2o.confusionMatrix(h2o.performance(model.glm, valid=T))$Rate)[1] # 32,456 ####################################### # 3. Optimise NN Model to beat defaults ####################################### ?h2o.deeplearning # [4]. Neural Network (L2 Regularization & L1 Regularization & Epochs boosted from 10 to 100) model.dl2 <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, epochs=100, l2=1e-5, l1=1e-5 ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl2, valid=T))$Rate)[1] # 8,333 model.errors[[4]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl2, valid=T))$Rate)[1] # [5]. Neural Network (Experiment with Hidden-Layers) # Random Hyper-Parameter Search (alt. method is Grid Search) hidden = list( c(250), c(50), c(200,150), c(250, 200), c(200,200,200), c(100,100,100), c(300,200,100), c(500,400,300), c(600,500,400), c(350,250,150), c(100,80,40), c(400,300,200), # This ends up being the best c(128,128,28), c(700,500,300) ) models.dl3lst <- list() errors <- list() for (i in 1:length(hidden)) { print(paste("Hidden Layer: ",hidden[i])) model.hidden <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, epochs=100, hidden = hidden[[i]], l2=1e-5, l1=1e-5 ) models.dl3lst[[i]] <- model.hidden # Assess performance on validation data print(rev(h2o.confusionMatrix(h2o.performance(models.dl3lst[[i]], valid=T))$Rate)[1]) errors[[i]] <- rev(h2o.confusionMatrix(h2o.performance(models.dl3lst[[i]], valid=T))$Error)[1] } # Choose model with smallest error minerror.indx <- which.min(errors) print(paste("Lowest Error with Hidden Layer: ",hidden[minerror.indx])) # c(400, 300, 200) # Assess performance on validation data model.dl3 <- models.dl3lst[[minerror.indx]] rev(h2o.confusionMatrix(h2o.performance(model.dl3, valid=T))$Rate)[1] # 5,741 model.errors[[5]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl3, valid=T))$Rate)[1] # [6]. Neural Network (Try Dropout) model.dl4 <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, epochs=100, activation = "RectifierWithDropout", hidden = hidden[[minerror.indx]], #c(400, 300, 200) l2=1e-5, l1=1e-5 ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl4, valid=T))$Rate)[1] # 19,343 # This really destroys the model! model.errors[[6]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl4, valid=T))$Rate)[1] # [6.5]. Neural Network (Out of interest I remove weight-decay and keep dropout) model.dl4.5 <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, epochs=100, activation = "RectifierWithDropout", hidden = hidden[[minerror.indx]] #c(400, 300, 200) ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl4.5, valid=T))$Rate)[1] # 21,122 # [7]. Neural Network (Try adding a weight limit to Rectifier) model.dl5 <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, activation = "Rectifier", epochs = 100, hidden = hidden[[minerror.indx]], #c(400, 300, 200) l2=1e-5, l1=1e-5, max_w2=10 ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl5, valid=T))$Rate)[1] # 6,429 model.errors[[7]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl5, valid=T))$Rate)[1] # [8]. Neural Network (Try using the tanh function) model.dl6 <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, activation = "Tanh", epochs = 100, hidden = hidden[[minerror.indx]], #c(400, 300, 200) l2=1e-5, l1=1e-5 ) # Assess performance on validation data rev(h2o.confusionMatrix(h2o.performance(model.dl6, valid=T))$Rate)[1] # 8,778 model.errors[[8]] <- rev(h2o.confusionMatrix(h2o.performance(model.dl6, valid=T))$Rate)[1] # Review all errors and choose model print(model.errors) # It looks like we will use model #5 (model.dl3) which had 5,741/115,979 errors # We could pursue this hyper-parameter optimisation fruther and conduct a grid search ####################################### # 4. TEST: 95.1% Success Rate ####################################### # Expected top performer (What we would submit) # y.hat <- as.data.frame(h2o.predict(object=model.dl3, newdata=test.hex)) # y = as.data.frame(test.hex[,c(predict)]) # names(y) <- c("predict") # print(sum(y.hat$predict!=y$predict)) # Equivalently: #95.1% Success Rate# h2o.confusionMatrix(h2o.performance(model.dl3, data=test.hex)) #Error: 0.0489 (5,696/116,476) # Confusion Matrix # class_1 class_2 class_3 class_4 class_5 class_6 class_7 Error Rate # class_1 40360 1951 2 0 27 7 167 0.0507 = 2,154 / 42,514 # class_2 1856 54675 82 2 164 71 15 0.0385 = 2,190 / 56,865 # class_3 0 122 6755 14 2 222 0 0.0506 = 360 / 7,115 # class_4 0 0 121 415 0 24 0 0.2589 = 145 / 560 # class_5 28 285 19 0 1576 6 1 0.1770 = 339 / 1,915 # class_6 8 126 151 5 6 3168 0 0.0855 = 296 / 3,464 # class_7 174 38 0 0 0 0 3831 0.0524 = 212 / 4,043 # Totals 42426 57197 7130 436 1775 3498 4014 0.0489 = 5,696 / 116,476 # Out of interest try others # (Although changing the model now would be tuning on test and cheating) h2o.confusionMatrix(h2o.performance(model.dl2, data=test.hex)) #Error: 0.0700 (8,159/116,476) h2o.confusionMatrix(h2o.performance(model.dl4, data=test.hex)) #Error: 0.1675 (19,514/116,476) h2o.confusionMatrix(h2o.performance(model.dl5, data=test.hex)) #Error: 0.0545 (6,349/116,476) h2o.confusionMatrix(h2o.performance(model.dl6, data=test.hex)) #Error: 0.0750 (8,733/116,476) # And the base-line cases h2o.confusionMatrix(h2o.performance(model.gbm, data=test.hex)) #Error: 0.1978 (23,041/116,476) h2o.confusionMatrix(h2o.performance(model.rf, data=test.hex)) #Error: 0.0640 (7,452/116,476) h2o.confusionMatrix(h2o.performance(model.dl, data=test.hex)) #Error: 0.1014 (11,806/116,476) ####################################### # 5. Ensemble (20 Nets): 96.2% Success Rate ####################################### # Create a linear ensemble of our best network # Since initialised randomly each one will be a bit different # Take a simple vote (mode) between them all # [i] Create ensemble predictions ensemble <- list() for (i in 1:20) { # [A]: Train ensemble.model <- h2o.deeplearning( x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, epochs=100, hidden = hidden[[minerror.indx]], #c(400, 300, 200) l2=1e-5, l1=1e-5 ) # [B]: Perform classification on test-data predictions <- h2o.predict(ensemble.model, test.hex) y.hat = as.data.frame(predictions) ensemble[[i]] <- y.hat$predict } # Combine all the predictions into a data-frame ensemble.df <- as.data.frame(do.call(cbind, ensemble)) # Calculate the mode value from rows -> ensemble.mode require(functional) ensemble.mode <- as.data.frame( apply(ensemble.df, 1, Compose(table, function(i) i==max(i), which, names, as.numeric,mean))) names(ensemble.mode) <- c("predict") # [ii] Truth y = as.data.frame(test.hex[,c(predict)]) names(y) <- c("predict") # [iii] Compare # Report accuracy of ensemble print(paste("Ensemble Errors:", sum( as.integer(y$predict) != as.integer(ensemble.mode$predict)) ) ) ##Error: 4,434/116,476 #96.2% Success Rate# # Compared to Our best model (as before): h2o.confusionMatrix(h2o.performance(model.dl3, data=test.hex)) #Error: 5,696/116,476 -> 95.1% #IMPROVEMENT OF 1.1 percentage points ####################################### # 6. Grid Search the RF ####################################### # As a last check: # RF did very well as a default-case # Could be that by chance the default-parameters were optimal for this data # OR: Data better suited for that model and optimised parameters would be even better # Try a grid search: ?h2o.randomForest ?h2o.grid #5*5*5=75 different model variations submitted to grid hyper_params <- list( ntrees=c(10,20,50,75,100), # Default=50 max_depth=c(5,10,20,30,50), # Default = 20 nbins=c(10,20,50,100,150) #Default = 20 ) forest.grid <- h2o.grid( "randomForest", x=explanatory, y=predict, training_frame = train.hex, validation_frame = validate.hex, hyper_params = hyper_params ) # Best model (first in the list): model.rfb <- h2o.getModel(forest.grid@model_ids[[1]]) model.rfb # Best Parameters (on validation): ntrees=50, max_depth=20, nbins=50 # Run on test data: h2o.confusionMatrix(h2o.performance(model.rfb, data=test.hex)) #Error: 7,738 # So the default model was actually better with Error: 7,452 # Since only nbins went from 20 to 50 it seems like the default parameters # were by chance optimal for the data # End h2o.shutdown(prompt=FALSE)