-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for virtual env directory flag #611
Add support for virtual env directory flag #611
Conversation
👷 Deploy Preview for amazing-pothos-a3bca0 processing.
|
64c4b93
to
0364ea3
Compare
ae7d1f5
to
9b94cec
Compare
95cb840
to
9b0e6e3
Compare
06be330
to
26032c9
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
- Coverage 93.28% 93.06% -0.23%
==========================================
Files 55 54 -1
Lines 2502 2163 -339
==========================================
- Hits 2334 2013 -321
+ Misses 168 150 -18 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for creating this PR so quickly, @LennartKloppenburg ! This is looking very good.
I added some comments in line, and I have a gut feeling we may need to add some additional tests to cover the possible behaviours in _get_or_create_venv_py_interpreter
.
We can aim to release this change as part of 1.2.1 (if we consider it a bugfix) or 1.3 (if we consider it a new feature) 🎉
cosmos/operators/virtualenv.py
Outdated
self.log.info(f"Checking for venv interpreter: {py_interpreter_path} : {py_interpreter_path.is_file()}") | ||
if py_interpreter_path.is_file(): | ||
self.log.info(f"Found Python interpreter in cached virtualenv: `{str(py_interpreter_path)}`") | ||
return str(py_interpreter_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, does it still make sense to install any potential dependencies/update them - if there were requirement changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ideally we'd be able to clean up the virtual env after the DAG run, but for the reasons you mentioned before this can be tricky.
One way to "perhaps" invalidate the virtualenv is to check when it was created and, after say 24 hours or 48 or so, have this operator clean it up and recreate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time-based approach could lead to some strange scenarios and be tricky to troubleshoot. How feasible would be for us to run a pip install
in an existing virtualenv? It should be very quick if it was previously setup, and it would make the operator reliable.
Regarding the cleanup - I know - ideally we'd be able to set the venv only once during the DAG setup and delete during tear down. Unfortunately - to my knowledge - even the latest Airflow (2.7) does not allow us to have a setup/tear down per worker node during the lifecycle of a DAG. But this can be an improvement for the future - in a separate PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response here!
so the underlying prepare_virtualenv
that we are "avoiding" after determining it's already there is imported from Airflow core (airflow.utils.python_virtualenv
). That little helper also takes into account the python requirements so if we bypass this helper, we can't inject requirements unless we repeat the logic over here:
...
pip_cmd = None
if requirements is not None and len(requirements) != 0:
pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)
if requirements_file_path is not None and requirements_file_path:
pip_cmd = _generate_pip_install_cmd_from_file(
venv_directory, requirements_file_path, pip_install_options
)
if pip_cmd:
execute_in_subprocess(pip_cmd)
...
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case, we probably would only need to do part of the logic:
if requirements is not None and len(requirements) != 0:
pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)
Since we don't support requirements_file_path
.
If we don't want to add this call unnecessarily, we'd probably need a pip freeze
call - to confirm if the desired dependencies are already installed, which may be more work.
We probably need one or both of these. Otherwise, we're at the risk of an Airflow worker having partial/outdated dependencies that are incompatible with the dependencies the user requested.
I'm in favour of us caching for performance reasons, but we still should aim to have the task being idempotent.
**kwargs: Any, | ||
) -> None: | ||
self.py_requirements = py_requirements or [] | ||
self.py_system_site_packages = py_system_site_packages | ||
super().__init__(**kwargs) | ||
self._venv_dir = virtualenv_dir | ||
self._venv_tmp_dir: None | TemporaryDirectory[str] = None | ||
|
||
@cached_property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general thought: do we still want to cache this property? Is there any risk that we could end up caching the incorrect path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this property cached? If people are debugging or want to pass in more dynamically configured directories, I don't know how this decorator behaves :) Is it per task_id
per dagrun_id
or is it more persistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property is cached while the Python process is alive.
384c7b5
to
02dbd9a
Compare
@tatiana I've updated the PR with some changes you've requested :) One lingering issue: The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done. |
If execution_config was reused, Cosmos 1.2.2 would raise: ``` astronomer-cosmos/dags/basic_cosmos_task_group.py Traceback (most recent call last): File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dagbag.py", line 343, in parse loader.exec_module(new_module) File "<frozen importlib._bootstrap_external>", line 848, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 74, in <module> basic_cosmos_task_group() File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dag.py", line 3817, in factory f(**f_kwargs) File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 54, in basic_cosmos_task_group orders = DbtTaskGroup( File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/airflow/task_group.py", line 26, in __init__ DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs)) File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/converter.py", line 113, in __init__ raise CosmosValueError( cosmos.exceptions.CosmosValueError: ProjectConfig.dbt_project_path is mutually exclusive with RenderConfig.dbt_project_path and ExecutionConfig.dbt_project_path.If using RenderConfig.dbt_project_path or ExecutionConfig.dbt_project_path, ProjectConfig.dbt_project_path should be None ``` This has been raised by an Astro customer and our field engineer, who tried to run: https://github.com/astronomer/cosmos-demo
53c7595
to
3a84aa7
Compare
@tatiana Just completed the rebase, saw some artifacts that trip up the tests, will look at those tomorrow :) ! |
62c2a7c
to
547b0af
Compare
@tatiana
When I run the tests locally they pass, maybe I missed something while rebasing? I rebased so much that I no longer know where it was introduced :D |
Hi @LennartKloppenburg ! I'm sorry for the massive delay, I've been working on other projects and it has been hard to keep up with everything. I'm planning to get back to this PR next week, so we can try to release it as part of Cosmos 1.5 |
Hi @LennartKloppenburg ! I'm very sorry for the very long delay.
If you are happy with the proposed changes, please feel free to incorporate them into your PR. |
## Description Added `virtualenv_dir` as an option to `ExecutionConfig` which is then propagated downstream to `DbtVirtualenvBaseOperator`. The following now happens: - If the flag is set, the operator will attempt to locate the `venv`'s `python` binary under the provided `virtualenv_dir`. - If so, it will conclude that the `venv` exists and continues without creating a new one. - If not, it will create a new one at `virtualenv_dir` - If the flag is not set, simply continue using the temporary directory solution that was already in place. ## Impact A very basic test using a local `docker compose` set-up as per the contribution guide and the [example_virtualenv](https://github.com/astronomer/astronomer-cosmos/blob/main/dev/dags/example_virtualenv.py) DAG saw the DAG's runtime go down from **2m31s** to just **32s**. I'd this improvement to be even more noticeable with more complex graphs and more python requirements. ## Related Issue(s) Closes: #610 Partially solves: #1042 Follow up ticket: #1157 ## Breaking Change? None, the flag is optional and is ignored (with a [warning](https://github.com/astronomer/astronomer-cosmos/compare/main...LennartKloppenburg:astronomer-cosmos:feature/cache-virtualenv?expand=1#diff-61b585fb903927b6868b9626c95e0ec47e3818eb477d795ebd13b0276d4fd76cR125)) when used outside of `VirtualEnv` execution mode. ## Important notice Most of the changes in this PR were originally implemented in PR #611 by @LennartKloppenburg. It became stale over the last few months due to limited maintainer availability. Our sincere apologies to the original author. What was accomplished since: 1. Rebased 2. Fixed conflicts 3. Fixed failing tests 4. Introduced new tests Co-authored-by: Lennart Kloppenburg <[email protected]>
We took this to completion in #1079, giving the credits to @LennartKloppenburg and this original PR |
Description
Added
virtualenv_dir
as an option toExecutionConfig
which is then propagated downstream toDbtVirtualenvBaseOperator
.The following now happens:
venv
'spython
binary under the providedvirtualenv_dir
.venv
exists and continues without creating a new one.virtualenv_dir
Impact
A very basic test using a local
docker compose
set-up as per the contribution guide and the example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this improvement to be even more noticeable with more complex graphs and more python requirements.Related Issue(s)
Implements #610
Breaking Change?
None, the flag is optional and is ignored (with a warning) when used outside of
VirtualEnv
execution mode.Checklist