Dark Web

Don't forget to turn on your firewall!

Memory in neural networks

chosen research topic: Liquid Time-constant Neural Networks
- approximates continuous functions -> sequence modeling problem
- learned representations show a CAUSAL STRUCTURE article
goal for now: understand the capabilities and challenges of sequence modeling starting with the basics
- simplification: discretize continuous functions -> recurrent neural networks
Recurrent Neural Networks introduction/overview
given a sequence, the RNN processes its elements one-by-one
stores a hidden state, which works as a memory
- contains information on previous data it has seen
in practice for all elements of the sequence:
1. both the input and the previous hidden state are transformed by neural networks
2. they are concatenated and passed through a tanh function, which helps to keep the output values between -1 and 1
3. the result is the new hidden state
The problem with RNNs
RNNs are trained via backpropogation (Backpropagation Through Time to be exact)
- has to backpropagate through every update of the hidden state using the chain rule
  - multiplying by a value <1 many times (tanh)
- going backwards the gradient values used to update the weights of the neural networks become exponentially small
  - vanishing gradients problem
  - early layers of the neural network do not learn
  - RNN will forget the information in these layers
-> with a long enough sequence RNNs have a hard time carrying information from earlier time steps to later ones
Solutions
careful weight initialization
- orthogonal recurrent kernel article
- Xavier/Gloriot for input kernel
batch normalization article
- normalizes the input to a layer by adjusting and scaling it based on the mean and standard deviation of the batch
- stable distribution of inputs
use of activation function with non-vanishing gradients
- risks of encountering the related exploding gradient problem
- ReLU: dying ReLU problem
- …
specialized architectures: GRU, LSTM
- use internal mechanism called gates that can regulate the flow of information
- learn which data in a sequence is important to keep or throw away
LSTM (Long Short-Term Memory)
GRU (Gated Recurrent Unit) medium article