Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a code "map" section to the developer documentation #965

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/code/client.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Client helper functions
client/cli
client/experiment
client/manual
client/runner

.. automodule:: orion.client
:members:
5 changes: 5 additions & 0 deletions docs/src/code/client/runner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Runner client
=============

.. automodule:: orion.client.runner
:members:
1 change: 1 addition & 0 deletions docs/src/developer/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The documentation for developers is organized in the following, easy to read, se

* :doc:`Getting Started <installing>`. Installing the development environment
* :doc:`Conventions <standards>`. Get familiar with the project's standards and guidelines
* :doc:`Source code map <plan>`. Get familiar with the interactions in the code.
* :doc:`Testing <testing>`. Implementing your changes and how to test your code
* :doc:`Documenting <documenting>`. Documenting your changes and updating the documentation
* :doc:`Continuous Integration <ci>`. Get familiar with our continuous integration setup
Expand Down
136 changes: 136 additions & 0 deletions docs/src/developer/plan.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
***************
Source code map
***************

This document will walk the path of an orion experiment through the
code. Not every detail is explained, but there are ample links to the
classes and methods involved if you want to dig further in a certain
section.

Departure
---------

You start an experience by running ``orion hunt <script> <params>``.

The code in :py:func:`orion.core.cli.main` will parse the command line
arguments and route to :py:func:`orion.core.cli.hunt.main`.

The command line arguments are passed to
:py:func:`orion.core.io.experiment_builder.build_from_args`, which
does some setup and hands over the arguments to
:py:func:`orion.core.io.experiment_builder.build`. This will hand over
the configuration to
:py:func:`orion.core.io.experiment_builder.consolidate_config` which
will look up the experiment in the configured storage to see if it's
already there and merge the loaded configuration with the provided one
with various helpers from :py:mod:`orion.core.io.resolve_config` to
build the final configuration. The result is eventually handled off to
:py:func:`orion.core.io.experiment_builder.create_experiment` to
bouthilx marked this conversation as resolved.
Show resolved Hide resolved
create an :py:class:`orion.core.worker.experiment.Experiment` and set
its properties.

If the experiment is new, meaning it has no storage id, then it will
attempt to save it to storage, which may conflict in case another
instance of ``orion hunt`` is doing the same thing. The storage is
responsible for repoting conflicts and
abergeron marked this conversation as resolved.
Show resolved Hide resolved
:py:func:`orion.core.io.experiment_builder.build` is called again
recursively in that case to retry the whole operation.

The created experiment finds its way back to
:py:func:`orion.core.cli.hunt.main` and is handed off to
:py:func:`orion.core.cli.hunt.workon` along with some more
configuration for the workers.

This method will setup a few more objects to manage the optimization
process: a :py:class:`orion.core.worker.consumer.Consumer` to act as
the bridge to the user script and an
:py:class:`orion.client.experiment.ExperimentClient` to coordinate
everything and calls
:py:meth:`orion.client.experiment.ExperimentClient.workon` which
mostly creates a :py:class:`orion.client.runner.Runner` and calls its
:py:meth:`orion.client.runner.Runner.run` method.


The Run Loop
------------

We are finally in the main run loop. It is composed of three main
phases that repeat.


First phase
~~~~~~~~~~~

In the first phase we call
:py:meth:`orion.client.runner.Runner.sample`. This will check if new
trials are required using
:py:meth:`orion.client.runner.Runner.should_sample` and request those
trials using :py:meth:`orion.client.experiment.ExperimentClient.suggest`.

This will first check if any trials are available in the storage using
:py:meth:`orion.core.worker.experiment.Experiment.reserve_trial`.

If none are available, it will produce new trials using
:py:meth:`orion.core.worker.producer.Producer.produce` which loads
the state of the algorithm from the storage, runs it to suggest new
:py:class:`orion.core.worker.trial.Trial` and saves both the new
trials and the new algorithm state to the storage. This is protected
from concurrent access by other instances of ``orion hunt`` by locking
the storage for the duration of that operation.


The second phase
~~~~~~~~~~~~~~~~

In the second phase we call
:py:meth:`orion.client.runner.Runner.scatter` with the trials
generated in the first phase, if any.

This schedules each trial to be executed using the configured executor
and registers the futures that the executor returns. Execution is
handled asynchronously and the futures enable us to keep track of the
state of the trials.


The third phase
~~~~~~~~~~~~~~~

In the third phase we call
:py:meth:`orion.client.runner.Runner.gather` which will wait on all
currently registered futures with a timeout to get some results.

Once we get those results we de-register the futures and record the
results with
:py:meth:`orion.client.experiment.ExperimentClient.observe` or update
the count of broken trials if they did not finish successfully.

Finally we monitor the total amount of time spent waiting for trials
to finish.
bouthilx marked this conversation as resolved.
Show resolved Hide resolved


Stopping criteria
~~~~~~~~~~~~~~~~~

There are multiple criteria that are monitored to stop the
experiment.

The first obvious one is the configured maximum number of trials to
run. If this is reached, then we stop running more. This is checked at
the beginning of the loop with
:py:attr:`orion.client.runner.Runner.is_running`.

The experiment can also stop if too many trials fail, either because
they fail to start, they crashed, were killed (like by an external job
scheduler) or the take too much time to complete. This is checked in
:py:meth:`orion.client.runner.Runner.gather` with
:py:attr:`orion.client.runner.Runner.is_broken`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 levels of max_trials/max_broken. There is at the level of the experiment. If we reach either max_trials or max_broken, all Runners will stop. And there is at the level of the Runner (under the config name worker, that's a bit confusing since the introduction of the Runner which now control multiple workers). If max_trials or max_broken is reached within the execution of this Runner, it will stop, but the other runner working on the same experiment may continue.

See for instance in doc:
https://orion.readthedocs.io/en/stable/user/config.html#max-trials
vs
https://orion.readthedocs.io/en/stable/user/config.html#config-worker-max-trials


If one of the workers returns an unexpected result the experiment is
also stop immediately because it is assume that something is wrong
abergeron marked this conversation as resolved.
Show resolved Hide resolved
with either the code or the configuration and spending more time
computing stuff will not fix it. This is also checked for in
:py:meth:`orion.client.runner.Runner.gather`.

Finally if the loop spends too much time waiting and nothing happens
the experiment is considered stalled and will also stop. This is
checked at the end of :py:meth:`orion.client.runner.Runner.run`.
1 change: 1 addition & 0 deletions docs/src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
developer/overview
developer/installing
developer/standards
developer/plan
developer/testing
developer/documenting
developer/ci
Expand Down