Reinforcing Power Grids: A Baseline Implementation for L2RPN

6 min readJun 25, 2021


The “Learning to run a power network” (L2RPN) challenge is a series of competitions proposed by Kelly et al. (2020) with the aim to test the potential of reinforcement learning to control electrical power transmission. The challenge is motivated by the fact that existing methods are not adequate for real-time network operations on short temporal horizons in a reasonable compute time. Also, power networks are facing a steadily growing share of renewable energy, requiring faster responses. This raises the need for highly robust and adaptive power grid controllers.
In 2020, one such competition was run at the IEEE World Congress on Computational Intelligence (WCCI) 2020. The winners have published their novel approach of combining a Semi-MDP with an after-state representation at ICLR 2021 and made their implementation publicly available. The latest iteration of the L2RPN challenge poses a welcome opportunity to introduce our RL framework Maze and to replicate the winning approach with it. The corresponding code is available here.

Maze is an application-oriented reinforcement learning framework that envisions enabling AI-based optimization for a wide range of industrial decision processes and making RL as a technology accessible to industry and developers.

Maze covers the complete development life cycle of RL applications — ranging from simulation engineering to agent development, training, and deployment. If this piqued your interest, we’d encourage you to check out our GitHub repository and the official documentation.
You can try Maze by installing the pip package, pulling the Docker image, or running notebooks on Google Colab.

SMAAC with Maze

Figure: Conceptual overview of the Semi-Markov Decision Process (SMPD) with afterstate representation implemented in Maze.

We adopt the original implementation of the authors to make it usable within the Maze framework. Above is a conceptual draft of the control flow in Maze when using it to solve a Semi-Markov Decision Process (SMPD) with afterstate representation. The SMPD logic is implemented via an environment wrapper, while the transformation into the afterstate representation is performed in the Observation Conversion Interface.

As proposed in the original paper we train the agent with a Soft Actor-Critic.

Training with Maze

Training an agent with Maze comes with a few nice features, which we will highlight here. The configuration of Maze and its experiments is done via Hydra. For the present example, we provide all the necessary configuration files for environment, models, wrappers, and the experiment (which includes the hyperparameters and algorithm to use). They can be found here. With these, you can start training simply with:

maze-run -cn conf_train env=maze_smaac model=maze_smaac_nets wrappers=maze_smaac +experiment=sac_train

(Note that the parent folder of the hydra_config folder has to be in the PYTHONPATH.)

Logging and Monitoring

During training, Maze outputs a variety of useful statistics and information in TensorBoard and on the command line. Here is a screenshot of a typical output you will get on your command line when training with Maze.

You will also see the evolution of these training and task-specific events and KPIs over the course of agent optimization in Tensorboard (SCALARS tab).

To give a glimpse of why this is useful, we show train_ActionEventswhich already provides some valuable insights: Although the agent is performing substantially more NOOP-actions (no topology changes), which is specific to the problem addressed, it also performs quite a few topology change actions. This gives a nice intuition on the behaviour of the agent and shows already at training time that it is not simply converging to the trivial NOOP policy.

To confirm this intuition we can further inspect the corresponding action distributions.

Here we see an example of the distributions of sampled high-level actions (named goal_topology) for an evaluation environment at the start of training. They are all nicely distributed around zero, which is exactly what we want at the beginning of training to foster exploration of all possible actions.

We can also check out the observation distributions telling us if our observation normalization strategy works as expected and remains stable during the course of training.

If the Perception Module was used to create your models, you will also get images that depict the respective network architectures.


To produce a rollout of a trained policy you simply need to run:

maze-run policy=torch_policy env=maze_smaac model=maze_smaac_nets wrappers=maze_smaac_rollout runner=evaluation input_dir=experiment_logdir

To get a GIF of a rollout, like the one shown at the top of this post, you have to change the export flag to true in conf/wrappers/smaac_l2rpn_rollout.yaml and just rerun the command above.

export: true
duration: 0.5

This can be quite nice for a visual inspection of your agent’s behaviour.


To evaluate if our implementation learns a policy that actually improves over the baseline (NOOP policy), we employ the original evaluation script of the SMAAC authors in a custom Maze RolloutEvaluator. Results are shown in the table below. 10 chronics have been chosen by the SMAAC authors as test cases. These chronics are forwarded for a given number of steps (given in parenthesis in the Test Chronic (ffw) column), after which the agent takes control and performs up to 864 steps on its own. The environment is manually set to done after 864 steps, so 864 is the maximum number of achievable steps.
We compare the results of the official SMAAC repo, our Maze implementation, and the NOOP baseline and provide the increase in performance over the NOOP policy.

Evaluation results

We can see that while not being able to exactly match the original implementation on all test chronics, the results are still comparable, providing a clear improvement over the NOOP baseline, and can for sure be tweaked with a little bit more fine-tuning.


This blog post presents a re-implementation of last year’s WCCI’s L2RPN winning approach with the RL framework Maze. It also provides some first insights into the capabilities of Maze by the example of Event and KPI Logging, Observation and Action Distribution Monitoring, or the powerful Perception Module.

If this caught your attention, we invite you to check out the Maze SMAAC repository, the Maze repository, or the official documentation of Maze as there is a lot more to explore.
We’d be happy if this repository is useful to you as a starting point for your own power grid (and any other) RL experiments — or maybe even for winning the next challenge!

Stay tuned for upcoming blog posts, check out Maze on GitHub and visit us on LinkedIn.




enliteAI is a technology provider for Artificial Intelligence specialized in Reinforcement Learning and Computer Vision/geoAI.