The stack is basic. I develop on an old lenovo laptop and test for a few dozen frames(you can learn a lot without a CUDA GPU) before pushing it to a desktop with a cheap nvidia card. It uses pytorch and pyboy, and the model is just a couple Conv2d expansions and compressions before hitting a Linear layer outputting predicted reward for certain keypresses(basically). The model training is based off of deep Q learning. I'm looking at a pytorch tutorial[1] when I get stuck, but I'm trying to fumble around and try it myself as much as possible before looking at it.
I have an idea to have variable Q training propagation based on the amplitude of the reward so that bigger rewards propagate more, but I haven't got there yet.
Here is a great video on reinforcement learning[2].
I gave a talk a PyConSG this year[1], which included a demonstration of training a Reinforcement Learning model on a 'Bubble Breaker' game. There's also more detail available[2].
The Jupyter notebook is included in the GitHub repo[3], and includes a 'scaled down version' that takes ~5mins to train on a MacBook's CPU. There's also a downloadable 'full scale' model that was trained in ~7hours on a Titan X. It plays the game (on average) better than me...
You could also compile a neural net into a less python-tied format e.g. ONNX or torchscript. In general a siloed pytorch env would be massive, I'm assuming at least a gig or two.
I recently just completed and open sourced my Pytorch implementation of a Deep Q-Network(DQN) to play Atari Pong. The implementation follows from the paper - Playing Atari with Deep Reinforcement Learning (DQN_neurips implementation) and Human-level control through deep reinforcement learning (DQN_nature implementation).
You can train your agent from scarch or load a trained policy from a checkpoint file and see videos as your agent is training.
While trying to learn the latest in Deep Reinforcement Learning, I was able to take advantage of many excellent resources (see credits [1]), but I couldn't find one that provided the right balance between theory and practice for my personal experience. So I decided to create something myself, and open-source it for the community, in case it might be useful to someone else.
None of that would have been possible without all the resources listed in [1], but I rewrote all algorithms in this series of Python notebooks from scratch, with a with a "pedagogical approach" in mind. It is a hands-on step-by-step tutorial about Deep Reinforcement Learning techniques (up ~2018/2019 SOTA) guiding through theory and coding exercises on the most utilized algorithms (QLearning, DQN, SAC, PPO, etc.)
I shamelessly stole the title from a hero of mine, Andrej Karpathy, and his "Neural Network: Zero To Hero" [2] work. I also meant to work on a series of YouTube videos, but didn't have the time yet. If this posts gets any type of interest, I might go back to it. Thank you.
P.S.: A friend of mine suggested me to post here, so I followed their advice: this is my first post, I hope it properly abides with the rules of the community.
Yes, and the only reason it doesn't work is that no one has written truly fast, GPU implementations of them. Don't let anyone here teach you otherwise, even small scale crappy versions (like what I could code in numpy) can successfully solve reinforcement learning problems rather quickly. Nay-sayers might tell you that it doesn't work, but they are wrong. Global optimization is strictly superior to local optimization in general, and we in the AI field are stuck deep in a local minimum right now.
I will contrast with my anecdote.. I build a simulation of 10x10 cityblocks with a population of 30 pedestrians.. I made a reward function based on distance to a random "target point" with penalty for walking in the street vs sidewalk.. and penalty for walking into walls.. the inputs were a 360 degree raycast of 16 samples.. and "distance to target" and the outputs were awsd keyboard inputs. I left it running overnight and by morning, bots were pretty efficiently walking around the city to their random targets via sidewalks. It felt like magic. I could have coded the behavior directly but the learned version seemed somewhat noisier and more organic. This was done a couple years ago using the a JavaScript deepq learning library. It feels like a squishier version of a-star or something.
You just use the rewards to optimize the function that tells you what's the predicted reward at any stage for a given action. And then take those best acions
Just as in my parent post. Q learning is kinda old tech for AI. It works rather well though and is also computationally cheap to run. An old laptop could do it with ease.
Thanks! Coding would go a bit beyond the target audience here, but I do have some examples from experience and the internet. Whenever starting on a new problem, I've found there's two steps to repeat (neither are really coding time, more so training time). The first is to run some training to see if any hyperparameters of the RL agent need to be significantly adjusted (discount factor, learning rate, etc) and the second is to actually train the best combination of agents. For initiating the training, there's very little new coding to do.
Now, if you also had to code a simulation environment for the agent to interact with, then that could be significant coding as you move to solve a new problem. Updating the state features/action space are minimal code though. Hopefully that helps clarify!
This repository focuses on a clean and minimal implementation of reinforcement learning algorithms. The highlights features of this repo are:
* Most algorithms are self-contained in single files with a common dependency file common.py that handles different gym spaces.
* Easy logging of training processes using Tensorboard and Integration with wandb.com to log experiments on the cloud. Check out [https://app.wandb.ai/costa-huang/cleanrltest]
* Easily customizable and being able to debug directly in Python’s interactive shell.
* Convenient use of commandline arguments for hyper-parameters tuning.
Currently I support A2C, PPO, and DQN. If you are interested, please consider giving it a try :)
Motivation:
There are two types of RL library on the two ends of the spectrum. The first one is the demo kind that really just demos what the algorithm is doing, only deals with one gym environment and hard to record experiments and tune parameters.
On the other end of the spectrum, we have OpenAI/baselines, ray-project/ray, and couple google repos. My personal experience with them is that I could only run benchmark with them. They try to write modular code and employ good software engineering practices, but the problem is python is a dynamic language without IDE support. As a result, I had no idea what variable types in different files are and it was very difficult to do any kind of customization. I had to see through dozens of files before even able to try some experiments.
That’s why I created this repo that leans towards the first kind, but has more actual experimental support. I support multiple gym spaces (still working on it), command line arguments to tune parameters, and very seamless experiment logging, all of which are essential characteristics for building a pipeline for research I believe.
The paper, “Hybrid Reward Architecture for Reinforcement Learning”[0], describes an evolution on DeepMind's Deep Q-Network (DQN) design[1] that was able to play many Atari games, presented in early 2015.
The HRA design requires more preprocessing (DQN worked on the raw pixels and on score alone), but the charts mapping how much faster HRA learns and how much stronger it plays is convincing.
I think qlearning is really interesting, I posted yesterday a simple implementation/demo in Javascript of qlearning. This paper goes way beyond qlearning by deducing states based on a deep neural network from the actual game rendering, really cool. Regardless, as a first intro to qlearning I had fun putting this together https://news.ycombinator.com/item?id=9105818
I've been playing with reinforcement learning on simulated robots, using OpenAI Gym[0] and PyBullet[1].
I find it interesting because, in theory, the learned policies transfer to real-world robots, and the time is nigh for development in that area. Plus you get to watch it do funky robot shit and slowly get better.
If you want a quick way to dive in, I have a repo[2] I've
been using to train on a variety of PCs and Google Colab[3]. The task is a Panda robot pushing a randomly
placed object. I had some trouble getting it all working with recent Keras at first, so my repo might save you time.
I really like this line of work and I expect will grow quite substantially over the next few years. Of course, Reinforcement Learning has been around for a long time. Similarly, Q Learning (the core model in this paper) has been around a very long time. What is new is that normally you see these models applied to toy MDP problems with simple dynamics, and linear Q function approximations for fear of non-convergence etc. What's novel about this work is that they fully embrace a complex non-linear Q function (ConvNet) looking at the raw pixels, and get it to actually work in (relatively speaking) complex environments (games). This requires several important tricks, as is discussed at length in their Nature paper (e.g. experience replay, updating the Q function only once in a while, etc.).
I implemented the DQN algorithm (used in this work) in Javascript a while ago as well (http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo...) if people are interested in poking around, but my version does not implement all the bells and whistles.
The results in this work are impressive, but also too easy to antropomorphise. If you know what's going on under the hood you can start to easily list off why this is unlike anything humans/animals do. Some of the limitations include:
- Most curcially, the exploration used is random. You button mash random things and hope to receive a reward at some point or you're completely lost. If anything at any point requires a precise sequence of actions to get a reward, exponentially more training time is necessary.
- Experience replay that performs the model updates is performed uniformly at random, instead of some kind of importance sampling. This one is easier to fix.
- A discrete set of actions is assumed. Any real-valued output (e.g. torque on a join) is a non-obvious problem in the current model.
- There is no transfer learning between games. The algorithm always starts from scratch. This is very much unlike what humans do in their own problem solving.
- The agent's policy is reactive. It's as if you always forgot what you did 1 second ago. You keep repeatedly "waking up" to the world and get 1 second to decide what to do.
- Q Learning is model-free, meaning that the agent builds no internal model of the world/reward dynamics. Unlike us, it doesn't know what will happen to the world if it perfoms some action. This also means that it does not have any capacity to plan anything.
Of these, the biggest and most insurmountable problem is the first one: Random exploration of actions. As humans we have complex intuitions and an internal model of the dynamics of the world. This allows us to plan out actions that are very likely to yield a reward, without flailing our arms around greedily, hoping to get rewards at random at some point.
Games like Starcraft will significantly challenge an algorithm like this. You could expect that the model would develop super-human micro, but have difficulties with the overall strategy. For example, performing an air drop to enemy base would be impossible with the current model: You'd have to plan it out over many actions: "load the marines into the ship, fly the ship in stealth around the map, drop it at the precise location of enemy base".
Hence, DQN is best at games that provide immediate rewards, and where you can afford to "live in the moment" without much planning. Shooting things in space invaders is a good example. Despite all these shortcoming, these are exciting results!
But consider the multi-model Q-distribution in Fig. 3. That graph, and the variance of its peaks, is the key to understanding why training Deep RL algorithms becomes so intractable in stochastic, real-world settings.
Max Entropy Estimation and approximate Bellmen Equations will do very well in Atari games for example. Or creating unbeatable MOBA competitors. But for robotics a breakthrough that allows for quick generalization from sparse input data is what is needed.
One-Shot Visual Imitation Learning via Meta-Learning
I have an idea to have variable Q training propagation based on the amplitude of the reward so that bigger rewards propagate more, but I haven't got there yet.
Here is a great video on reinforcement learning[2].
[1] https://pytorch.org/tutorials/intermediate/mario_rl_tutorial...
[2] https://www.youtube.com/watch?v=93M1l_nrhpQ&t=3381
reply