Skip to content

Commit

Permalink
Merge pull request #489 from assume-framework/463-general-tutorial-no…
Browse files Browse the repository at this point in the history
…tebook-fixes

463 general tutorial notebook fixes
  • Loading branch information
kim-mskw authored Nov 21, 2024
2 parents ef325c0 + 446ec25 commit 18a8b62
Show file tree
Hide file tree
Showing 15 changed files with 3,340 additions and 9,232 deletions.
22 changes: 11 additions & 11 deletions docs/source/learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Reinforcement Learning Overview
One unique characteristic of ASSUME is the usage of Reinforcement Learning (RL) for the bidding of the agents.
To enable this the architecture of the simulation is designed in a way to accommodate the learning process. In this part of
the documentation, we give a short introduction to reinforcement learning in general and then pinpoint you to the
relevant parts of the code. the descriptions are mostly based on the following paper
relevant parts of the code. The descriptions are mostly based on the following paper
Harder, Nick & Qussous, Ramiz & Weidlich, Anke. (2023). Fit for purpose: Modeling wholesale electricity markets realistically with multi-agent deep reinforcement learning. Energy and AI. 14. 100295. `10.1016/j.egyai.2023.100295 <https://doi.org/10.1016/j.egyai.2023.100295>`.

If you want a hands-on introduction check out the prepared tutorial in Colab: https://colab.research.google.com/github/assume-framework/assume
Expand All @@ -18,7 +18,7 @@ If you want a hands-on introduction check out the prepared tutorial in Colab: ht
The Basics of Reinforcement Learning
=====================================

In general RL and deep reinforcement learning (DRL), in particular, open new prospects for agent-based electricity market modeling.
In general, RL and deep reinforcement learning (DRL) in particular, open new prospects for agent-based electricity market modeling.
Such algorithms offer the potential for agents to learn bidding strategies in the interplay between market participants.
In contrast to traditional rule-based approaches, DRL allows for a faster adaptation of the bidding strategies to a changing market
environment, which is impossible with fixed strategies that a market modeller explicitly formulates. Hence, DRL algorithms offer the
Expand Down Expand Up @@ -105,8 +105,8 @@ Similar to TD3, the smaller value of the two critics and target action noise :ma
y_i,k = r_i,k + γ * min_j=1,2 Q_i,θ′_j(S′_k, a_1,k, ..., a_N,k, π′(o_i,k))
where r_i,k is the reward obtained by agent i at time step k, γ is the discount factor, S′_k is the next state of the
environment, and π′(o_i,k) is the target policy of agent i.
where :math:`r_i,k` is the reward obtained by agent :math:`i` at time step :math:`k`, :math:`\gamma` is the discount factor, :math:`S'_k` is the next state of the
environment, and :math:`\pi'(o_i,k)` is the target policy of agent :math:`i`.

The critics are trained using the mean squared Bellman error (MSBE) loss:

Expand All @@ -120,8 +120,8 @@ The actor policy of each agent is updated using the deterministic policy gradien
∇_a Q_i,θ_j(S_k, a_1,k, ..., a_N,k, π(o_i,k))|a_i,k=π(o_i,k) * ∇_θ π(o_i,k)
The actor is updated similarly using only one critic network Q_{θ1}. These changes to the original DDPG algorithm allow increased stability and convergence of the TD3 algorithm. This is especially relevant when approaching a multi-agent RL setup, as discussed in the foregoing section.
Please note that the actor and critics are updated by sampling experience from the buffer where all interactions of the agents are stored, namley the observations, actions and rewards. There are more complex buffers possible, like those that use importance sampling, but the default buffer is a simple replay buffer. You can find a documentation of the latter in :doc:`buffers`
The actor is updated similarly using only one critic network :math:`Q_{θ1}`. These changes to the original DDPG algorithm allow increased stability and convergence of the TD3 algorithm. This is especially relevant when approaching a multi-agent RL setup, as discussed in the foregoing section.
Please note that the actor and critics are updated by sampling experience from the buffer where all intercations of the agents are stored, namely the observations, actions and rewards. There are more complex buffers possible, like those that use importance sampling, but the default buffer is a simple replay buffer. You can find a documentation of the latter in :doc:`buffers`


The Learning Implementation in ASSUME
Expand All @@ -136,15 +136,15 @@ The Actor
We will explain the way learning works in ASSUME starting from the interface to the simulation, namely the bidding strategy of the power plants.
The bidding strategy, per definition in ASSUME, defines the way we formulate bids based on the technical restrictions of the unit.
In a learning setting, this is done by the actor network. Which maps the observation to an action. The observation thereby is managed and collected by the units operator as
summarized in the following picture. As you can see in the current working version the observation space contains of a residula load forecast for the next 24 h and aprice forecast for 24 h as well as the
the current capacity of the powerplant and its marginal costs.
summarized in the following picture. As you can see in the current working version, the observation space contains a residual load forecast for the next 24 hours and a price
forecast for 24 hours, as well as the current capacity of the power plant and its marginal costs.

.. image:: img/ActorTask.jpg
:align: center
:width: 500px

The action space is a continuous space, which means that the actor can choose any price between 0 and the maximum bid price defined in the code. It gives two prices for two different party of its capacity.
One, namley :math:`p_inlfex` for the minimum capacity of the power plant and one for the rest ( :math:`p_flex`). The action space is defined in the config file and can be adjusted to your needs.
The action space is a continuous space, which means that the actor can choose any price between 0 and the maximum bid price defined in the code. It gives two prices for two different parts of its capacity.
One, namley :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
After the bids are formulated in the bidding strategy they are sent to the market via the units operator.

.. image:: img/ActorOutput.jpg
Expand All @@ -171,7 +171,7 @@ You can read more about the different algorithms and the learning role in :doc:`
The Learning Results in ASSUME
=====================================

Similarly, to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter, enables the usage of the
Similarly to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter enables the usage of the
predefined dashboards to track the leanring process in the "Assume:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
A more detailed description is given in the dashboard itself.

Expand Down
33 changes: 17 additions & 16 deletions docs/source/learning_algorithm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,23 @@
Reinforcement Learning Algorithms
##################################

In the chapter :doc:`learning` we got an general overview about how RL is implements for a multi-agent setting in Assume. In the case one wants to apply these RL algorithms
to a new problem, one does not necessarily need to understand how the RL algorithms are are working in detail. The only thing needed is the adaptation of the bidding strategies,
which is covered in the tutorial. Yet, for the interested reader we will give a short overview about the RL algorithms used in Assume. We start with the learning role which is the core of the leanring implementation.

In the chapter :doc:`learning` we got a general overview of how RL is implemented for a multi-agent setting in Assume.
If you want to apply these RL algorithms to a new problem, you do not necessarily need to understand how the RL algorithms work in detail.
All that is needed is to adapt the bidding strategies, which is covered in the tutorial.
However, for the interested reader, we will give a brief overview of the RL algorithms used in Assume.
We start with the learning role, which is the core of the learning implementation.

The Learning Role
=================

The learning role orchestrates the learning process. It initializes the training process and manages the experiences gained in a buffer.
Furthermore, it schedules the policy updates and, hence, brings the critic and the actor together during the learning process.
Particularly this means, that at the beginning of the simulation, we schedule recurrent policy updates, where the output of the critic is used as a loss
of the actor, which then updates its weights using backward propagation.
The learning role orchestrates the learning process. It initializes the training process and manages the experience gained in a buffer.
It also schedules policy updates, thus bringing critic and actor together during the learning process.
Specifically, this means that at the beginning of the simulation we schedule recurrent policy updates, where the output of the critic
is used as a loss for the actor, which then updates its weights using backward propagation.

With the learning role, we can also choose which RL algorithm should be used. The algorithm and the buffer have base classes and can be customized if needed.
But without touching the code there are easy adjustments to the algorithms that can and eventually need to be done in the config file.
The following table shows the options that can be adjusted and gives a short explanation. For more advanced users is the functionality of the algorithm also documented below.
The following table shows the options that can be adjusted and gives a short explanation. For more advanced users, the functionality of the algorithm is also documented below.



Expand All @@ -43,7 +44,7 @@ The following table shows the options that can be adjusted and gives a short exp
batch_size The batch size of experience considered from the buffer for an update.
gamma The discount factor, with which future expected rewards are considered in the decision-making.
device The device to use.
noise_sigma The standard deviation of the distribution used to draw the noise, which is added to the actions and forces exploration. noise_scale
noise_sigma The standard deviation of the distribution used to draw the noise, which is added to the actions and forces exploration.
noise_dt Determines how quickly the noise weakens over time.
noise_scale The scale of the noise, which is multiplied by the noise drawn from the distribution.
early_stopping_steps The number of steps considered for early stopping. If the moving average reward does not improve over this number of steps, the learning is stopped.
Expand All @@ -58,15 +59,15 @@ TD3 (Twin Delayed DDPG)
-----------------------

TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing.
We recommend reading OpenAI Spinning guide or the original paper to understand the algorithm in detail.
We recommend reading the OpenAI Spinning guide or the original paper to understand the algorithm in detail.

Original paper: https://arxiv.org/pdf/1802.09477.pdf

OpenAI Spinning Guide for TD3: https://spinningup.openai.com/en/latest/algorithms/td3.html

Original Implementation: https://github.com/sfujim/TD3

In general the TD3 works in the following way. It maintains a pair of critics and a single actor. For each step so after every time interval in our simulation, we update both critics towards the minimum
In general, the TD3 works in the following way. It maintains a pair of critics and a single actor. For each step (after every time interval in our simulation), we update both critics towards the minimum
target value of actions selected by the current target policy:


Expand All @@ -77,7 +78,7 @@ target value of actions selected by the current target policy:
Every :math:`d` iterations, which is implemented with the train_freq, the policy is updated with respect to :math:`Q_{\theta_1}` following the deterministic policy gradient algorithm (Silver et al., 2014).
TD3 is summarized in the following picture from the others of the original paper (Fujimoto, Hoof and Meger, 2018).
TD3 is summarized in the following picture from the authors of the original paper (Fujimoto, Hoof and Meger, 2018).


.. image:: img/TD3_algorithm.jpeg
Expand All @@ -88,17 +89,17 @@ TD3 is summarized in the following picture from the others of the original paper
The steps in the algorithm are translated to implementations in ASSUME in the following way.
The initialization of the actors and critics is done by the :func:`assume.reinforcement_learning.algorithms.matd3.TD3.initialize_policy` function, which is called
in the learning role. The replay buffer needs to be stable across different episodes, which corresponds to runs of the entire simulation, hence it needs to be detached from the
entities of the simulation that are killed after each episode, like the elarning role. Therefore, it is initialized independently and given to the learning role
entities of the simulation that are killed after each episode, like the learning role. Therefore, it is initialized independently and given to the learning role
at the beginning of each episode. For more information regarding the buffer see :doc:`buffers`.

The core of the algorithm is embodied by the :func:`assume.reinforcement_learning.algorithms.matd3.TD3.update_policy` in the learning algorithms. Here the critic and the actor are updated according to the algorithm.
The core of the algorithm is embodied by the :func:`assume.reinforcement_learning.algorithms.matd3.TD3.update_policy` in the learning algorithms. Here, the critic and the actor are updated according to the algorithm.

The network architecture for the actor in the RL algorithm can be customized by specifying the network architecture used.
In stablebaselines3 they are also referred to as "policies". The architecture is defined as a list of names that represent the layers of the neural network.
For example, to implement a multi-layer perceptron (MLP) architecture for the actor, you can set the "actor_architecture" config item to ["mlp"].
This will create a neural network with multiple fully connected layers.

Other available options for the "policy" include Long-Short-Term Memory (LSTMs). The architecture for the observation handling is implemented from [2].
Note that the specific implementation of each network architecture is defined in the corresponding classes in the codebase. You can refer to the implementation of each architecture for more details on how they are implemented.
Note, that the specific implementation of each network architecture is defined in the corresponding classes in the codebase. You can refer to the implementation of each architecture for more details on how they are implemented.

[2] Y. Ye, D. Qiu, J. Li and G. Strbac, "Multi-Period and Multi-Spatial Equilibrium Analysis in Imperfect Electricity Markets: A Novel Multi-Agent Deep Reinforcement Learning Approach," in IEEE Access, vol. 7, pp. 130515-130529, 2019, doi: 10.1109/ACCESS.2019.2940005.
3 changes: 2 additions & 1 deletion docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ Upcoming Release
The features in this section are not released yet, but will be part of the next release! To use the features already you have to install the main branch,
e.g. ``pip install git+https://github.com/assume-framework/assume``

**Bugfixes:**
**Bugfixes:**
- **Tutorials**: General fixes of the tutorials, to align with updated functionalitites of Assume
- **Tutorial 07**: Aligned Amiris loader with changes in format in Amiris compare (https://gitlab.com/fame-framework/fame-io/-/issues/203 and https://gitlab.com/fame-framework/fame-io/-/issues/208)

v0.4.3 - (11th November 2024)
Expand Down
Loading

0 comments on commit 18a8b62

Please sign in to comment.