Memory in neural networks
- chosen research topic: Liquid Time-constant Neural Networks
- approximates continuous functions -> sequence modeling problem
- learned representations show a CAUSAL STRUCTURE article
- goal for now: understand the capabilities and challenges of sequence modeling starting with the basics
- simplification: discretize continuous functions -> recurrent neural networks
- Recurrent Neural Networks introduction/overview
- given a sequence, the RNN processes its elements one-by-one
- stores a hidden state, which works as a memory
- contains information on previous data it has seen
- in practice for all elements of the sequence:
- both the input and the previous hidden state are transformed by neural networks
- they are concatenated and passed through a tanh function, which helps to keep the output values between -1 and 1
- the result is the new hidden state
- The problem with RNNs
- RNNs are trained via backpropogation (Backpropagation Through Time to be exact)
- has to backpropagate through every update of the hidden state using the chain rule
- multiplying by a value <1 many times (tanh)
- going backwards the gradient values used to update the weights of the neural networks become exponentially small
- vanishing gradients problem
- early layers of the neural network do not learn
- RNN will forget the information in these layers
- has to backpropagate through every update of the hidden state using the chain rule
-
-> with a long enough sequence RNNs have a hard time carrying information from earlier time steps to later ones
- Solutions
- careful weight initialization
- orthogonal recurrent kernel article
- Xavier/Gloriot for input kernel
- batch normalization article
- normalizes the input to a layer by adjusting and scaling it based on the mean and standard deviation of the batch
- stable distribution of inputs
- use of activation function with non-vanishing gradients
- risks of encountering the related exploding gradient problem
- ReLU: dying ReLU problem
- …
- specialized architectures: GRU, LSTM
- use internal mechanism called gates that can regulate the flow of information
- learn which data in a sequence is important to keep or throw away
- LSTM (Long Short-Term Memory)
- GRU (Gated Recurrent Unit) medium article