Reflection: Deep RL with Hugging Face

Certificate of Excellence for the Hugging Face Deep Reinforcement Learning Course issued to Ryan Ruff on 3/8/2023
Look! I did a thing!

Back in January, I learned about the Hugging Face Deep Reinforcement Learning Course from Thomas Simonini and decided it would be fun to participate. Now that I’ve finished the course, I thought it would be a good idea to reflect back on what I learned from the experience and offer some feedback from an educational perspective. For brevity’s sake, here’s a high level summary of what I found to be the pros and cons:

Pros:

  • Using Google Colab made it very quick and easy to get started.
  • The sequence of problems was very well thought out and flows nicely from one to the next. I felt like each unit does an excellent job of setting up a need that gets addressed by the following unit.
  • It finds a nice balance between the mathematical and natural language descriptions of the discussed algorithms.
  • I enjoyed the use of video games as a training environment. Watching the animation playback provided helpful feedback on agent progress and kept the lessons engaging.
  • I developed an appreciation for how gym environments and wrappers could be used to standardize my code for training experiments.
  • I feel I now have a much better understanding of how the Hugging Face Hub might be useful to me as a developer.

Cons:

  • The usage limits on Google’s Colab eventually became a hinderance. It might be difficult to pass some of these lessons if that’s your only resource and some of the units suffer from nightmarish dependencies if you try to install locally.
  • I really disliked the use of Google Drive links in lessons, particularly when they contained binaries. I’d feel a lot safer about trusting this content if it came from a “huggingface.co” source.
  • Some of the later units felt little unpolished in comparison to the early ones. It was a little frustrating to spend increasing time debugging issues that turned out to be mere typos (“SoccerTows”) while also having drastically less scaffolding to work with.
  • Accessibility of these resources seemed very limited. Some of the text in images was difficult to read and lacking alt text. Some of the video content would benefit from a larger font and slower narration.
  • While training bots for Atari games was certainly fun, the lax attitude towards attribution is concerning from an ethical and legal standpoint.

Overall I had an enjoyable time going through the course. I found myself looking forward to each new unit and cheering on my model as I habitually refreshed the leaderboards. I did not, however, join the Hugging Face community on Discord, so I can’t comment on what the social elements of the course were like. Nothing personal, I just dislike Discord.

It’s probably also important to note that I came into the course with some prior experience with both Python and Deep Learning already. For me, I felt this course made a nice follow-up to Andrew Ng’s Deep Learning Specialization on Coursera in terms of content. While it might be possible to make it through the course without having used PyTorch or TensorFlow before, I feel like you’ll probably want to at least have a conceptual understanding of how neural networks work and some basic familiarity with Python package management before starting the course.

My favorite lesson in the course was the “Pixelcopter” problem in Unit 4. This was the first time in the course where hitting the cut-off score seemed like a real accomplishment. I probably spent more time here than in the rest of the course combined but felt like it was this productive struggle that enabled me to learn some important lessons as a result. Up until this point I felt like I had just been running somebody else’s scripts. Here, it felt like my choice of hyperparameters made a huge difference in the outcome.

Part of the problem with choosing hyperparameters for this Pixelcopter was trying to keep the training time under the time constraints of running in Google Colab. If the training time was too short the bot wouldn’t produce a viable strategy, and if the training time was too long then Colab would timeout. At this point, I went back to the bonus unit on Optuna to manage my hyperparameter optimization. I was able to get a model that produced what I though was a high score, but the variance was so high that I didn’t quite the cut-off score.

Eventually I got so frustrated with the situation that I set up a local Jupyter-Lab server on an old laptop so I could train unattended for longer periods of time. However, this came with its own set of problems because I mistakenly tried to install more recent versions of the modules. Apparently the “gym” module had become “gymnasium” and undergone changes that made it incompatible with the sample code. In an effort to keep things simple, I rolled gym back to an earlier version so I could concentrate on what the existing code was doing.

Once I got my local development environment going, I let Optuna run a bunch of tests with different hyperparameters. This gave me some insight into how sensitive these variables really were. Some bots would never find a strategy that worked. Other bots would find a strategy that seems to work at first, then something would go wrong in a training episode and the performance would start dropping instead. With this in mind, I decided to add an extra layer to my model and started looking more closely at the videos produced by my agents.

What I noticed from the videos was that some bots would attempt to take the “safe” route and some bots would take the “risky” route. The ones that take the safe route tended to do well in early parts of the episode, but started to crash once the blocks speed up enough. The ones that try to take the risky route do way better on later stages, but the early crashes result in unpredictable performance overall.

In an effort to stabilize my agent’s performance, l started playing around with using different environment wrappers. The “Time-Aware” Observation Wrapper seemed to help a little, but I ran into a problems with gym again when I attempted to implement a “Frame Stack”. Apparently the there was a bug in the specific version I had rolled back to, and explicitly pinning my gym version to 0.21 resolved the issue. With a flattened multi-frame time-aware observation the bot was able to come up with a more viable strategy.

Video showing my Pixelcopter bot’s growth with more input data

What really drove this lesson home was that I felt it set up a real need for the actor-critic methods in Unit 6. I knew precisely what was meant by “significant variance in policy gradient estimation”. I also learned in Unit 6 that the reason the Normalization wrapper I tried before wasn’t working was that I didn’t know I had to load the weights into my evaluation environment. All of these small elements came together at the same time. My extensive trial and error time with Pixelcopter in Unit 4 had show me precisely why that approach would be insufficient for the robotics applications in Unit 6. I felt like understanding this need really solidified the driving ideas behind the actor-critic model.

I also thoroughly enjoyed the “SoccerTwos” problem in Unit 7. However, the part where I had to download the Unity binaries was very discomforting. Not only was the link hosted on “drive.google.com”, but the files inside the zip folder were misnamed “SoccerTows” instead of “SoccerTwos“. It looks like this issue may have been corrected since then, but I won’t deny it caused a moment of panic when I couldn’t find the model I’d been training because it wound up in a slightly different location that I expected. I feel like Hugging Face should have the resources to host these files on their own, and the fact there were typos in the filenames makes me wonder if enough attention is being paid to potential supply chain vulnerabilities.

My least favorite unit had to be Unit 8 Part 1. I felt like I was being expected to recreate the Mona Lisa after simply watching a video of someone painting it. I didn’t really feel like I was learning anything except how to copy and paste code into the right locations. And this might be a sign of my age, but it was extremely frustrating to not be able to clearly read the code in the video. Some of the commands run in the console are only on screen for a second and it’s not always clear where the cursor is at. The information may be good, but the presentation leaves much to be desired. As a suggestion to the authors, I’d consider maybe splitting this content up and showing how to set up a good development environment earlier in the course so you can focus more on the PPO details here.

While fun examples, I also felt a little uneasy with the way the Atari games were included in this course. Specifically, the course presents Space Invaders in a manner that seems to attribute it to Atari when it was technically made by Taito. I feel like this is more a complaint for OpenAI as the primary maintainer of gym than it is toward Hugging Face, but I got the distinct impression that these Atari games the RL zoo are technically being pirated. After finding this arxiv paper on the subject, it looks like OpenAI erroneously assumed that the liberal license to the code in this paper gave them justification to use the Atari ROMs as benchmarks for large scale development. Given that these ROMs are being used to derive new commercial products, what might have been “fair use” by the original paper is now potentially copyright infringement on a massive scale. I strongly believe the developers of Space Invaders deserve to both be cited and paid for their work if its going to be used by AI companies in this way.

In conclusion, I think completing this course gave me a better understanding of what Hugging Face is attempting to build. The course taught me the struggle of providing reproduceable machine learning experiments and demonstrated the need to have a standardized process for sharing pre-trained models. This free course is a great introduction to these resources. At the same time, the course also drew my attention to the ways this hub might be misused as well. I think I would feel more comfortable with using models from the Hugging Face Hub if I knew that the models hosted there were sourced from ethically collected data. A good starting point might be to add a clearly identified “code license” and “data license” on project home pages. While Hugging Face says this should be included in the model’s README, a lot of the projects I saw on the hub didn’t readily include this information. I sincerely hope Hugging Face will take appropriate efforts to enforce a level of community standards that prevent it from turning into into “the wild west” of AI.

In any event, thank you to Thomas Simonini and Hugging Face for putting this course together. I really did have a fun time and learned a good deal in the process!

The models I built during this course can be found through my Hugging Face profile here.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.