1. What Neural Networks are Not

In his post “Big Data: Are we making a big mistake?“, Tim Harford mentions a quote by Spiegelhalter, which I really liked:

There are a lot of small data problems that occur in big data … they don’t disappear because you’ve got lots of the stuff. They get worse

Artificial Neural Networks are extremely versatile to the form of data they can absorb (time-series, cross-sections, images, sound, movies and even games); however that does not mean that a careful econometric analysis of the data supplied is not needed and that the more data that is thrown at them (especially if it is BIG DATA), the better. For example; LeCun mentions in his 1998 paper that input variables should be uncorrelated if possible using PCA (principle component analysis), sampling error does not go away and omitted-variable bias still remains a problem.

Harford mentions how many were jumping onto the band-wagon and merely finding statistical patterns in the data (which leads them to the same problems undergrads encounter in EC101), coupled with a theory-free analysis of mere correlations. It seems that hundreds of years of statistical/econometric-theory were immediately disregarded. For some reason neural networks were not seen as an extension (or even improvement) to previous econometric models (e.g. GLM) but a replacement. His article prompted a brilliant follow-up “No amount of data can replace human thought” which quotes Bernard Levin:

The silicon chip will transform everything, except everything that matters, and the rest will still be up to us.

Even Ilya Sutskever (a co-author of the seminal paper “ImageNet Classification with Deep Convolutional Neural Networks“, which broke the records classifying images) seemed to reiterate the sentiment of the article: https://www.youtube.com/watch?v=czLI3oLDe8M

The flip-side of the above is that with careful statistical thought Neural Networks are a ground-making tool for the econometrician to make some real progress.

NNs have a universality – no matter what the function, there is guaranteed to be a neural network so that for every possible input(x), the value f(x) is output from the network.

Empirical evidence suggests that deep-networks have a hierarchical structure that makes them well-adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems.

Varian released an article “Beyond Big Data” where he mentions:

Many companies have the data: they just don’t know what to do with it. Missing ingredients: data tools (easy), knowledge (hard), and experience (very hard).

In another (“Big Data: New Tricks for Econometrics“), he writes:

I believe that these methods have a lot to offer and should be more widely known and used by economists. In fact, my standard advice to graduate students these days is “go to the computer science department and take a class in machine learning.”

Coming from an economics background, what interests me about NNs is: learning what they learn -> what do they choose to extract from the raw data (images, sound-waves, numbers) that helps them predict with such accuracy that this image is a dog and that Mr X will earn above $50k?

Feature maps are fascinating to analyse (the one below from Ilya’s paper shows what their deep network learnt to extract to be able to classify the images):

images

Google have recently released renderings of what their networks think things look like:

classvis

Perhaps closer to econometrics: many packages now try to help users to measure variable importance in their neural networks as visible here using H2O. This brings us closer to the usual economic model where we don’t just care about r-squared but also about the meaning of the estimated parameters and whether this ties in with theory. What do the final weight and biases mean?

2. Economics & Neural Networks

Neural networks are starting to already pop-up in economic papers; I wanted to share a few summaries which look favourably upon:

Discrete-Choice Modelling: “NNs, on the other hand, are able to deal with complex non-linear relationships, are fault tolerant in producing acceptable results under imperfect inputs and are suitable for modelling reactive behaviour which is often described using rules, linking a perceived situation with appropriate action” E.g. Predicting mode of transport chosen (based on cost, duration, socio-demographic, etc.), typically done by logit models.

The accuracy of the proposed model, in terms of classification rate, ranged between 95 and 97% compared to 50–73% for the discrete choice models.”

Nested logit models vs. Artificial Neural Networks for Simulating Mode Choice Behaviour”: “ANN approach may replace discrete choice methods in some particular context (intra-regional), where choice behaviour is difficult to simulate and where socio-economic variables or alternative specific attributes may have strong incidence”

A Non-Linear Analysis of Discrete Choice Behavior by the Logit Model” : “Furthermore, it is found that the NN model can analyse the effect or phenomena that the conventional logit model cannot catch

Neural Networks and the Multinomial Logit for Brand Choice Modelling : a Hybrid Approach” : “… ability of neural networks to model non-linear preferences with few (if any) a priori assumptions about the nature of the underlying utility function, while the Multinomial Logit can suffer from a specification bias.” -> no a priori knowledge but NN parameters cannot be interpreted

They propose that “in such an approach a neural network model would be used first, to diagnose non-linear utility functions and second, to determine the nature of the non-linear components of the function”

Predicting Prices/Demand: “ … Predicting Online Auction Closing Price” – Closing price of an auction is not known and is dependent on several factors such as the number of auctions selling the same item, the number of bidders participating in that auction as well as the behaviour of every individual bidder

“… Grey Theory performs better than ANN in predicting the closing price on online auctions especially when there is insufficient information. However, given an environment where a lot of historical data are available, the ANN can be utilized since it requires a lot of data to make accurate prediction.”

Referenced Papers:

  1. “Evaluation of discrete choice and neural network approaches for modelling driver compliance with traffic information”
  2. “A Non-Linear Analysis of Discrete Choice Behavior by the Logit Model”
  3. “Neural Networks and the Multinomial Logit for Brand Choice Modeling: A Hybrid Approach”
  4. “Predicting Online Auction Closing Price”

3. Improving Neural Networks

Michael Nielsen has a great section where he goes on in detail about improvements to the basic neural-network design.

Cross-Entropy Cost Function: Combining the logistic activation function with the traditional MSE often results in a learning slow-down – when the model is very wrong (e.g you submit a cat video instead of your dissertation) it doesn’t learn never to do that again (something we would!), instead it actually learns slower than if it was slightly wrong (spelling mistake in the title of a dissertation). This is because the gradient of the activation function enters the gradient of the cost function (via the chain rule) and when we are on either end of the logistic-sigmoid function (very wrong or very right – depending on whether we initialised with high/low weights) the gradient is pretty much zero and the learning-rate is slow (which is not good news if we need to move to the opposite direction). We would say that the neuron has saturated on the wrong side.

a1

Ideally we want a cost function where the partial derivative captures the intuition that the greater the initial error the faster the neuron should learn (difference between f(x) and y) and so increase with the error. The CE function has been designed with this in mind:

a2

It is worth noting that if our neurons are linear then the quadratic cost will not give rise to any learning-slowdown.

CE loss functions are commonly used for ‘classification’ problems as opposed to ‘regression’ problems (which use the MSE) since they heavily penalise incorrect guesses between classes. Put another way, cross entropy essentially ignores all computed outputs which don’t correspond to a 1 target output.

Overfitting: There is a nice quote by Enrico Fermi –6

I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

Neural networks often have more than 100,000 parameters and with so much freedom in the model it may work well only for existing data and fail to generalise (this is exacerbated if we have lots of noise in the data with a small training data-set).

  • Regularization (Weight Decay / L2) adds a term to the cost function which is a function of the sum of squares of the weights – this prevents any single weight from getting too big. To put it another way: large weights are only allowed if they considerably improve the first part of the cost function (prediction).

a3

  • L1 Regularization is also sometimes used (in conjunction) and adds the sum of the absolute of the weights and lets only strong weights survive (constant pulling force towards zero)
  • Regularized networks are constrained to fit simple models based on patterns they see in the training data and are resistant to learning peculiarities of the noise in the data. This may not obviously produce a better outcome (imagine we have 10 points in a straight line: do we fit a straight line through them or a 9 degree polynomial? Is the simpler model always the best one?). However, empirically it has shown great promise. The hope is that regularization forces the network to do real learning about the phenomenon and to generalize better.

Drop-out: has been a massive empirical success. It is best explained in this paper, however briefly they randomly (and temporarily) drop half of the hidden neurons to reduce complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons it is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. In my mind a bit like convolutions. Some software packages also let you apply dropout to the input layer.

Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

12

Hold-out: Instead of simply increasing the number of epochs/rounds we train our model for and looking at how the cost falls we can introduce another data-set (apart from training-data and test-data) called “validation-data”. After each epoch (of training on a batch of train data) we calculate a confusion matrix to see the accuracy of our model on the validation data. We can continue training for a lot more epochs than normal until we think we have found a global maximum and use the model trained at that epoch (rather than at the final epoch)

3

Optimising hyper-parameters: Sometimes the way to select the best hyper-parameter combination (due to complex interactions) is to just try out several different values and use the ones which score highest on the validation data. For example we can try the following grid-search:

4

Artificially expanding the data-set: Perhaps bootstrapping or adding operations that reflect real-world variations

Variations of the neural-network structure: Instead of MLPs we may use convolutions to get a Convolutional Neural Network or perhaps instead of feed-forward we have a Recurrent Neural Network. More importantly – the best model is chosen for the data (convolutions do wonders for image-recognition). This can even be extended to NOT using NNs and going with Random Forests or Vector Machines if the data is more suited for it.

Activation Functions: Traditionally the logistic-sigmoid activation function has been used. However, the “Tanh” function which is basically a re-scaled sigmoid function has been favoured. The Tanh function outputs within the range of [-1,1] and is centred around 0 which seems to work better with the back-propagation method. However, Nielsen mentions that:

Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better

The reasoning appears to be that if we have a positive output r (within the range of [0,1] only) then our weights will either be all positive or all negative. This means that all weights to the same neuron must either increase or decrease together and this does not let some of the weights increase and some to decrease. We end up zig-zagging towards the efficient solution (which is slow!). We can only have some of the weights increasing and some decreasing if some of the input activations have different signs. Now, since the Tanh function is symmetric around 0 (tanh(-x)=-tanh(x)) we may even expect half the activations to be negative and half positive.

Recently, Rectified Linear Units (ReLU neurons) have been used with great success. They output a value which is max(0,w.x+b).

a4

To see why they are useful: both sigmoid and tanh neurons have a flat slope on both side which causes them to saturate. However, increasing the weighted input to a ReLU will never cause it to saturate since it has a linear gradient and thus learning does not slow down. However, the gradient vanishes when the weighted input is reduced to the point where it becomes negative.

4. Practical Considerations

Neural networks are no exception when it comes down to pre-processing data. Before we fit our model it is important to get the data in the correct shape for the model. Let’s clarify three definitions, loosely-speaking:

Normalisation – rescaling values to a bounding range e.g. [-1,1] or [0,1]

Standardisation – rescaling values to have a mean of 0 and std dev of 1 (the data is not bounded) … subtract measure of location and divide by measure of scale

Rescaling – adding a constant and then multiplying by a constant

In true back-propagation style let’s work backwards:

The Read-Out Layer

Obviously, it is critical to have the target values within the range of the output activation function (this means if the output activation function has a range of [-1,1] e.g. Tanh then the values we are predicting must also lie within [-1,1]). However, it is advised that one should choose an activation function suited to the distribution of the target values (rather than the other way round).

Binary y-data (male/female): 1-of-C dummy-coding (e.g as many dummies as there are values) along with softmax activation is most common (alternatively 1-of-(C-1) dummy-coding with the log sigmoid activation, can also be used).

The below diagram illustrates the difference between 1-of-C and 1-of-(C-1) for a binary output:

1-of-C

7

1-of-(C-1)

8

The one-node method appears to be more common (and requires fewer weights and biases); however which method is ‘better’ is beyond the scope of this post. James D. McCaffrey writes that: “if you want to use weight decay (e.g. L1/L2), you should use the two-output-node design”, however he also mentions that in theory: both will give the same result. His preference appears to be to use 1-of-C because: (i) it works better with weight-regularization and (ii) it is nice to interpret the output values as probabilities.

Categorical y-data (green/blue/red): 1-of-C dummy-coding with softmax activation

Continuous y-data (income): If the data has no known bounds then we have a classic ‘regression’ problem (as opposed to ‘classification’) and no scaling is needed – in this case we can use a linear activation function (which = no activation function)

Continuous bounded y-data (probability): The Logistic and Tanh functions are commonly used (former outputs a range of [0,1] whereas latter outputs a range of [1,1] so the y-data my need to be re-scaled to match)

Standardising the y-data is needed if one has two or more target variables (with a scale-sensitive error-function like the MSE) and has no a priori reason to assume that one is more important than the other. For example: if one target has a range of 0 to 1 and another has 0 to 1,000,000 then the latter will be more ‘important’ for the error-function to get right. According to Warren S. Sarle: “it is essential to rescale the targets so that their variability reflects their importance, or at least is not in inverse relation to their importance”. In practice: this means one standardises their targets to have a mean of 0 and std dev of 1 (equal importance).

The Hidden Layer

How many hidden layers? How many neurons? Warren S. Sarle writes there the “best” number of hidden units depends on many things, in a complex way (number of input and output units, number of training cases, amount of noise, complexity of function, type of activation function, regularization, etc.). He is keen to empathise that “rules of thumb are nonsense”.

A good way to decide is to simply try many networks (hyper-parameter optimisation e.g. grid search) with different number of hidden units and estimate the generalization error on a validation data-set to pick the best one (this will be our approach in the final section). However, one should not typically have more weights than training-cases as such a large network will over-fit.

The Input Data

If possible the input-values should be uncorrelated (a NN is not a black-box to blindly through data at) using principal component analysis, or Karhunen-Loeve expansion can remove linear correlation.

Constants are not useful (usually dropped automatically by most software packages) and should be dropped.

Missing data is handled differently by different packages (sometimes the mean for the class or for the full population is calculated) depending on data-availability and how big the problem is (if we have lots of data and one observation is missing lots of variables then we can just drop it)

Binary x-data (male/female): Should be coded as (-1,1) instead of (0,1). This is because  if the data are centred near the origin then the initial hyperplanes will cut through the data in a variety of directions, whereas if it is offset from the origin (0,1) or (9,10) the many of the hyperplanes will miss the data entirely. If the hyperplanes miss then the activation function will saturate and make learning difficult.

Categorical x-data (religion):  Basically, since we typically use weight-decay we would use 1-of-C coding. Some details: categorical x-data has to be broken into either 1-of-C or 1-of-(C-1) dummies (rather than containing all the factors). Why? Imagine we had (1,2,3) for (red,green,blue) and the training data contained an input which was definitely not green but either red or blue. Training with a single target variable will result in the output being an average of (1,3) which is 2 (green) – instead of either 1 or 3. We could use 1-of-C or 1-of-(C-1) coding.

We could also use 1-of-(C-1) coding with “effects” and so red = (1,0), green = (0,1), blue = (-1,-1). “Effects” coding is just like 1-of-(C-1) coding except the omitted category has all variables coded as -1  (not 0). Theoretically if we have a bias term then the two methods are equivalent (effects and 1-of-(C-1)). However, effects has the advantage that the dummy variables “require no standardizing”, according to Warren S. Sarle.

James D. McCaffrey again mentions that if one wishes to use weight-decay (penalise the model for having high weights – common technique to reduce over-fitting) then it is “preferable to use 1-of-C encoding”. This is because weight-decay biases the output for the (C-1) categories towards the output for the 1 omitted category. If 1-of-C coding is used then weight decay biases the output for all C categories towards the mean of all categories. Finally, a more advanced encoding method (“feature-hashing”, which is a fast and space-efficient way of vectorizing features) is not discussed here.

Continuous x-data (age): In general this should be standardised to a range of [-1,1] (using “Min-Max Normalisation”). “Gaussian Normalisation” can also be used and often produces the range [-10,10]. However, this depends primarily on how the network combines input variables to compute the next input to the next layer.

The difference between re-scaling, normalisation and standardising is perhaps best understood here:

We would normally want to standardise variables so that they have a common range/deviation. If the range is not the same then a variable with a range of [-1000,1000] would dwarf the contribution of a variable with range [-1,1]  to the distance function.

We want to rescale variables to have a mean of 0 (part of standardisation) so that more of the initial hyperplanes cut through the data in different directions.

Why would we normalise? In theory it is not needed to limit the range. However, in practice it helps standard back-propagation as it makes the Hessian matrix more stable. If the range is very small (-0.1,0.1) then that would also miss a lot of the initial hyperplanes. If the range is very big (-1000000,1000000) then that would likely saturate the neuron and produce a flat-gradient (this obviously depends on the activation function) which reduces learning. The following link by Warren S. Carle mentions:

The main emphasis in the NN literature on initial values has been on the avoidance of saturation, hence the desire to use small random values. How small these random values should be depends on the scale of the inputs as well as the number of inputs and their correlations. Standardizing inputs removes the problem of scale dependence of the initial weights.

This article by James McCaffrey has a neat table to summarise the above:

Capture

“Efficient BackProp” – LeCun 1998

This paper contains a section called “Practical Tricks” which I would like to para-phrase:

  1. Stochastic learning is usually faster than batch-learning (using the full set of training data) and results in better solutions because of the ‘noise’ in the updates (which help the weights jump to another local minima).
  2. With stochastic learning – shuffling the training set to present the model with ‘different’ classes which possess ‘different’ information is a good proxy for choosing information-rich data. We want to choose data that produces a large error more frequently to help teach our algorithm. This improves learning.
  3. Batch-learning can be advantageous when conditions of convergence are well understood and many acceleration techniques only operate in batch learning (conjugate gradient)
  4. Normalising the inputs: average of each input variable should be close to zero and inputs should be scaled so that their covariances are roughly equal. Inputs should also be uncorrelated.
  5. Non-linear activation functions are what give neural networks their non-linear capability (and ability to approximate any non-linear function) compared to e.g. step-function
  6. Activation functions (like Tanh) which are symmetric about the origin are preferred over (e.g. logistic) for the same reason that reason that inputs should be normalised – they are more likely to produce outputs (inputs for next layer) which are close to zero.
  7. Target values should be chosen at the point of the maximum second derivative on the activation function (e.g. -0.5 and 0.5 for the Tanh function) so as to avoid saturating the output units. I think this may have been a way to overcome the processing limitation on the back-propagation method in 1998 as it appears contrary to us using [-1,1] nowadays. I think this is why we see papers using (0.1,0.9) for the logistic sigmoid- for example Carle mentions, “I suspect the popularity of this gimmick is due to the slowness of the standard backprop … gives you incorrect posterior probability estimates … gimmick is unnecessary if you use an efficient training algorithm.
  8. Weights should be chosen randomly but in such a way that the activation function is primarily activated in its linear region … if weights are all very large then the sigmoid will saturate resulting in small gradients that make learning slow. However, LeCun states that: “achieving this requires coordination between the training set normalisation, choice of sigmoid, choice of weight initialisation”.
  9. Each weight should have its own learning rate which is proportional to the square root of the number of inputs to the unit. Weights in lower layers should typically be larger than in the higher layers.