Dark Web

Online RL: agent takes actions in environment, gets rewards, observations
Offline RL: agent learns from experiences of other agents

Don't forget to turn on your firewall!

Behavior Cloning (BC) -> learn to mimic the other agents
Q-Learning -> decision-tree of states and actions with transition values
- can learn from offline data (unlike policy gradient methods)
- Conservative Q-Learning (CQL) article
  - prevents Q-value overestimation
- Temporal Difference Learning medium

Reinforcement Learning Upside Down: Don’t Predict Rewards - Just Map Them to Actionslink

Decision Transformer articlevideo
- uses sequence modeling (GPT) for modeling states
- conditioned on the desired reward
- outputs action

RL as one big sequence modeling problem article Q-Transformer article Control-Oriented Learning for Dynamical Systems video

learning an observation-action mapping from human demonstrations link
2 approaches
- behavior cloning
  - directly learns from observation-action pairs link
  - most dominant: DAgger framework link
- indirectly via inverse reinforcement learning link
state-aware imitation learning link
- adds a secondary objective to the learning task to bias the policy towards states where more training data is available
meta-learning
- pre-train policies to adapt to a task -> one-shot learning link
generative adversarial imitation learning link
- phrases the behavior cloning problem as a min-max optimization problem between a generator policy and discriminator classifier
- end-to-end differentiable link
- incomplete demonstrations link
- imperfect demonstrations link