Maze: Applied Reinforcement Learning for Real-World Problems
We are excited to announce Maze, a new framework for applied reinforcement learning (RL). Beyond introducing Maze, this blog post also outlines the motivation behind it, what distinguishes it from other RL frameworks, and how Maze can support you when applying RL — and hopefully prevent some headaches along the way.
tl;dr: What is Maze and (why) do we need another RL framework?
Maze is a framework for applied reinforcement learning. It focuses on solutions for practical problems arising when dealing with use cases outside of academic and toy settings. To this end, Maze provides tooling and support for:
- Scalable state-of-the-art algorithms capable of handling multi-step/auto-regressive, multi-agent, and hierarchical RL settings with dictionary action and observation spaces.
- Building powerful perception learners with building blocks for (graph-)convolution, self-and graph attention, recurrent architectures, action and observation masking, and more.
- Complex workflows like imitation learning from teacher policies and fine-tuning.
- Customizable event logging, enabling insights into agents’ behavior and facilitating debugging.
- Customization of trainers, wrappers, environments, and other components.
- Leveraging best practices like reward scaling, action masking, observation normalization, KPI monitoring, and more.
- Easy and flexible configuration.
This breadth of features reflects a holistic approach to the RL life-cycle and development process. In contrast, many other RL frameworks follow a more narrow approach by prioritizing algorithms above all other, potentially just as crucial, aspects of building RL-driven applications.
Maze offers both a CLI and an API. Thorough documentation is available. To get you started, we provide Jupyter notebooks you can run locally or in Colab (and further examples in the documentation). Maze utilizes PyTorch for its networks.
We aim to achieve a high degree of configurability while maintaining an intuitive interface and resorting to sane defaults whenever appropriate. Running Maze can be as easy as this:
from maze.api.run_context import RunContext
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnvrc = RunContext(env=lambda: GymMazeEnv(‘CartPole-v0’))
rc.train(n_epochs=50)# Run trained policy.
env = GymMazeEnv(‘CartPole-v0’)
obs = env.reset()
done = Falsewhile not done:
action = rc.compute_action(obs)
obs, reward, done, info = env.step(action)
However, every realistic, practical RL use case will be considerably more complex than CartPole-v0. Problems will need to be debugged, behavior recorded and interpreted, components (like policies) configured and customized. This is where Maze shines: It offers several features that support you in mitigating and resolving those issues. Among others:
- An extensive, event-based logging system for recording and understanding your agent’s behavior in detail.
- A versatile environment structure allowing for flexibility in how an environment is represented in the action and observation space.
- A powerful configuration system (based on Hydra) enabling to quickly change the agent and environment composition without changing a single line of code.
Maze is completely independent of other RL frameworks, but plays nicely with some of them — like RLlib: Agents can be trained with RLLib’s algorithm implementations while still operating within Maze. This way the full extent of Maze’s features can be leveraged while complementing the native trainers with RLlib’s algorithm implementations.
If this has caught your interest, check out the links mentioned in the introduction — and, of course: read on, there is more to come.
Introduction: Solving Real-World Problems with RL Is (Often) Hard
Reinforcement learning (RL) is, together with supervised and unsupervised learning, one of the three branches of modern machine learning. Set apart by its abilities to learn from approximative reward signals and to devise long-term strategies, it poses a great fit for many complex, challenging real-world problems. This includes supply chain management, design optimizations, chip design, and plenty of others — including, of course, many popular games.
Why, then, haven’t we seen more widespread adoption of RL in the industry so far? Compared to the more established un- and supervised learning, traditional RL is rather sample-inefficient and in almost all cases requires interaction with a simulation of the problem domain. However, simulators both accurate and fast enough are often unavailable, and creating them is usually not trivial. On top of that, RL has a reputation for being difficult to implement, train and debug — after all, more moving components have to work in harmony compared to an isolated supervised model.
We have worked on several reinforcement learning projects in recent years, and experienced just that: Due to their complexity, real-world problem settings often require sophisticated tooling support to develop, evaluate and debug them successfully. We are big fans of existing frameworks like RLLib and Stable Baselines and draw a lot of inspiration from them. However, we have encountered enough hiccups and limitations during our projects to start wondering about how to design a framework mitigating those repeatedly experienced issues in the first place.
This motivated us to start working on Maze: a reinforcement learning framework that puts practical concerns in the development and productionization of RL applications front and center.
Maze emphasizes being an applied and practical RL framework supporting users in building RL solutions to real-world problems. In our opinion, this necessitates adherence to some guiding principles and best software engineering practices:
- Extensive, up-to-date documentation. Our exhaustive documentation describes underlying concepts, presents code examples, and includes tutorials as well as tips & tricks for effectively and efficiently training RL agents. We strive to keep the documentation updated and the presented code snippets and examples runnable.
- Test-driven. Plenty of unit tests assert that breaking changes and unexpected behavior are quickly noticed.
- Hyperparameters: Sane defaults & transparency. RL, even more than un- and supervised ML, includes lots of hyperparameters. It can be a daunting task to find the right combination of values enabling agents to learn efficiently. We provide sane defaults as configuration files. For each run, Maze logs the used hyperparameter values so that transparency is given at all times.
- Code quality. Maze aspires to be production-ready. This includes a certain level of code quality — you won’t find any arcane, undocumented snippets here.
- Usability. We aim to provide a smooth user experience with Maze, despite its complexity and amount of configurability. The CLI allows training and rollouts in a single command each. The API requires only two statements to train an agent and offers additional convenience functions for frequent tasks like policy evaluation. Both can be initialized with a minimal amount of information.
- Reproducibility. Due to its many moving parts, RL can be tricky to reproduce. We use consistent seeding in our environments and algorithms to ensure that identically configured runs yield the same results.
- Customizability. A framework not flexible enough to support your ideas can be frustrating. We designed Maze to be modular and loosely coupled to enable the customization of individual components without having to interact with low-level code or worry about inter-dependencies with other components.
Beyond environments, policies, and wrappers this also applies to e.g. trainers and runners, which launch and coordinate training and rollout jobs. This allows the customization of training algorithms and/or deployment scenarios.
Maze in Action
You’ve seen the minimal code setup to get Maze up and running. What does a realistic project with all the nuts and bolts look like? As a case study, we tried our hand at the “Learning to run a power network” (L2RPN) challenge, in which the specified power grid has to be kept running by avoiding blackouts. We re-implemented last year’s winning solution with Maze and have published the code on GitHub along with a write-up on Medium.
Another use case is supply chain management: we utilize Maze to optimize the stock replenishment process for the steel producer voestalpine. As this is not an open-source project, we cannot provide the source code for this project. However, if this is up to your alley, we strongly recommend giving the joint talk by voestalpine and us on this topic at the applied AI conference for logistics a listen.
More open-source projects are in our pipeline and will be announced shortly.
Discussing every one of Maze’s features is not within the scope of this article, hence we focus on a few of our favorites here. The title of each paragraph links to the corresponding page in the documentation.
Maze offers a CLI via the maze-run command-line script. As maze-run is built on Hydra, it can process configuration settings both directly and via .yaml files. A simple training run on CartPole-v0 might look like this:
maze-run -cn conf_train env.name=CartPole-v0
If you’d like to train with PPO:
maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo
While a CLI can be convenient, many use cases require more direct interaction with the framework and its components. To this end, we provide a high-level API in Python handling all of the complexity involved in configuring and running training and rollout, so that you don’t need to worry about any of it. We aim to keep API and CLI usage as congruent as possible — if you know how to use one, you should be good to go with the other.
Debugging code is hard. Debugging RL systems is even more so. With Maze, you can fire events whenever something notable happens in your environment or with your agent. Events can then be aggregated and inspected in Tensorboard to get an overview across many episodes. You can also log them in a CSV file and inspect them with a tool of your choice.
The setup for an RL-centered application can be quite complex — especially when experimenting with and comparing different configurations. Hydra is designed with such use cases in mind. We leverage its abilities to enable users to quickly compose and modify their configurations without having to change a single line of code.
To learn an effective policy, an agent first has to be able to make sense of its environment, e.g. by utilizing neural networks to learn useful feature representations. That’s why Maze’s perception module includes extendable, configurable building blocks to conveniently build PyTorch-based neural networks to this end. It also allows visualizing the assembled networks, e.g.:
The OpenAI gym spaces interface is the inofficial standard for environments in the RL community. Maze adheres to this standard but allows a larger degree of freedom with its structured environments. This is done by decoupling the environment from the converter, which transforms the environment’s state into a gym-compatible observation space object. This allows for easier experimentation with different state/observation representations: You can write and configure different converters modifying the observation’s representation to the policy without touching the environment’s core logic.
Structured environments also define all interfaces necessary to support more complex scenarios such as multi-step/auto-regressive, multi-agent, and hierarchical RL.
Tensorboard is widely used for the monitoring of training and evaluation of deep learning models, also independently from Tensorflow (many libraries, e.g. PyTorch, integrate Tensorboard). Maze hooks up its event logging system with Tensorboard — everything that is logged will be visible there, including the run configuration, your custom events, action sampling statistics, and the observation distribution.
We aim for Maze to include state-of-the-art, scalable, and practically relevant algorithms. As of now, the following algorithms are supported: A2C, PPO, Impala, SAC, behavioral cloning, and evolutionary strategies. More are in the pipeline, e.g. SACfD, AlphaZero, and MuZero. Maze also offers out-of-the-box support for advanced training workflows such as imitation learning from teacher policies and policy fine-tuning.
If you want to explore Maze, we recommend giving our “Getting Started” notebooks a shot — you can spin up a Google Colab notebook by clicking the Colab button at the top of the notebook. This way you can try out Maze without having to install anything locally. Alternatively, you can install Maze with
pip install maze-rl.
If you enjoyed this article — stay tuned! We plan to regularly release articles revolving around RL and featuring Maze from now on. Some of the upcoming articles will introduce the soon-to-be-released model zoo. Others will be part of a series dedicated to exploring challenging aspects of using RL in practice and how to tackle them with Maze.
We are continuously developing Maze, making it more mature and adding new features. We love community contributions — no matter whether they are new features, identifications of bugs, feature requests, or simply questions. Don’t hesitate to either open an issue on GitHub, drop us a line on GitHub discussions, or ask a question tagged
maze-rlon StackOverflow. And if you decide to use Maze in one of your projects — tell us, we might feature your project in our documentation.
For further information, we encourage you to check out Maze on GitHub, its documentation on readthedocs.io, and learning resources as Juypter notebooks runnable on Google Colab. We hope that Maze is as useful to others as it is to us, and look forward to seeing what the RL community at large will build with it. Give it a try and let us know what you think!