Runtime Data Sources #10289

ramananayak · 2024-04-02T15:33:20Z

ramananayak
Apr 2, 2024

Describe the bug
I want to add fluent_datasource at runtime after a FileDataContext is already defined.
context.fluent_datasources is of type dictionary. When I add a new fluent_datasource, it does not add to the existing dictionary.
Where as it works on datasource.

To Reproduce

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = gx.datasource.fluent.PostgresDatasource(name="ds_runtime", connection_string=connection_string,create_temp_table=True)

# Running below does not update the dictonary
context.fluent_datasources[runtime_datasource.name] = runtime_datasource

# where as, if I run below command then it will update properly. Also, it will also update fluent_datasources
context.datasources[runtime_datasource.name] = runtime_datasource

Expected behavior
context.fluent_datasources should show added runtime_datasource inside the dictonary

Environment (please complete the following information):

Operating System: MacOS
Great Expectations Version: 0.18.12
Data Source: Redshift
Cloud environment: AWS

Additional context
Add any other context about the problem here.

Kilo59 · 2024-04-11T12:43:59Z

Kilo59
Apr 11, 2024

@ramananayak
Sorry for the confusion this is because the context.fluent_datasources property is just a dictionary comprehension of context.datasources with all non-fluent datasources filtered out.

Would could alter the return type annotation to be an Immutable Mapping[str, FluentDatasource] to help with this. But it wouldn't alter runtime behavior and you'd have to rely on a type-checker or IDE to warn about it being immutable.

great_expectations/great_expectations/data_context/data_context/abstract_data_context.py

Lines 4375 to 4381 in abcf671

    
           @property 
        
           def fluent_datasources(self) -> Dict[str, FluentDatasource]: 
        
               return { 
        
                   name: ds 
        
                   for (name, ds) in self.datasources.items() 
        
                   if isinstance(ds, FluentDatasource) 
        
               }

The idiomatic way to add or update a datasource is by using one of the context.sources.add_or_update_<DATASOURCE_TYPE>() methods. This method also bootstraps the datasource with the components needed to do config substitution and connect certain datasources to things like s3/gcs/databricks etc.

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

0 replies

ramananayak · 2024-04-17T14:04:16Z

ramananayak
Apr 17, 2024
Author

thanks for the clarification @Kilo59 .
I tried with context.sources.add_postgress()
But for FileDataContext type this will end up updating the context file (great_expectations.yml) file with connection string details I am using as a variable in my code.
This does not serve the purpose of being runtime. Also because of this write lock on the context file, if multiple checks running on same config will lead to failures. I want this source to be used just for runtime without affecting the (great_expectations.yml) file.

I did some investigation and saw that for FileDataContext() context file is opened in w mode (

great_expectations/great_expectations/data_context/data_context/file_data_context.py

Line 168 in abcf671

with open(config_filepath, "w") as outfile:

) .
So is there any way to add configurations for true run time use without changing context file everytime.

Same case with dataasset, I don’t see any example to show how can we create runtime dataseet. Currently I am testing with fluent datasource, all the methods are just keep adding dataasset to context file. So it will lead to growing config file.
in 0.17.1, below would have created run time data asset without any update in context file, for refrence below

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

I don't rally know how can I achieve this in the latest version.
Thanks for your help !

0 replies

Kilo59 · 2024-04-18T13:23:38Z

Kilo59
Apr 18, 2024

@ramananayak
I don't think this is exactly what you are looking for but you can use an EphemeralDataContext that doesn't persist anything.

import great_expectations as gx
context = gx.get_context(mode="ephemeral")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

The code above ☝️ should work but you won't have access to your filebacked checkpoints or expectations etc.
You would need to modify the code to pull in those items.

I will pass this along to our team working on the v1.0 release (and any other feedback you have).

0 replies

Kilo59 · 2024-04-18T13:49:09Z

Kilo59
Apr 18, 2024

There's a somewhat related issue where a user is creating an ephemeral context from a file context but is unable to load the fluent configs.
For you, this shouldn't be a problem, though.

EphemeralDataContext does not load fluent datasources correctly #9283

Updated example that should allow your ephemeral context to pull in the project config from your file context.

import great_expectations as gx

# Create two different contexts using THE SAME config
file_ctx = gx.get_context(mode="file")
ephm_ctx = gx.get_context(mode="ephemeral", project_config=file_ctx.config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = ephm_ctx.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

0 replies

ramananayak · 2024-04-19T13:48:32Z

ramananayak
Apr 19, 2024
Author

Hi @Kilo59
Thanks for sharing this. Yes as you mentioned , I tried Ephemeral context and it looks like it will work.

Here is my version

import yaml
import great_expectations as gx
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import EphemeralDataContext

context_root_dir="path to my initial great_expectation.yml file "
with open(context_root_dir+'/great_expectations.yml', 'r') as file:
    conf = yaml.safe_load(file)
    
context_config = DataContextConfig(**conf)
ephm_ctx  = EphemeralDataContext(project_config=context_config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = ephm_ctx.sources.add_postgres(name="ds_runtime", 
connection_string=connection_string, 
create_temp_table=True)

print(repr(runtime_datasource))

This is working. although I have to mention complete path for all the respective GX directories (like plugin directory) but that's understood.

But as I mentioned above, 0.17.11 supported RuntimeBatchRequest, where I could define datasource, dataasset and runtimequery as a part of checkpoint. I could see it is also available in 0.18.9 documentation.
But I am not able to get it working. I am struggling with this.
https://docs.greatexpectations.io/docs/reference/api/core/batch/RuntimeBatchRequest_class
For example, will work perfectly in 0.17.11

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

Is it supported in the latest GX. or I have to go with creating dataasset separately outside of checkpoint for input query and then call the checkpoint as a part of validation ?
Is there any way to add datasource and query as a part of checkpoint.

Because this is a really helpfull feature for us, as we keep all the respective queries as a part of checkpoint and they stay separately , easy to identify dataasset and expectations together.

thanks !

0 replies

ramananayak · 2024-05-01T15:12:37Z

ramananayak
May 1, 2024
Author

Hi @Kilo59
Do you have information about how can we set this type of config (one in the previous comment) in the latest GX version.
for data asset ? IN the new GX version, Do we have to create dataasset first for every query and then add the required checkpoint ? So there is no way for run time dataasset creation ?

If you have any idea on this, if you can give some pointers that would really help.

thanks !
Ram

0 replies

Kilo59 · 2024-05-01T15:42:48Z

Kilo59
May 1, 2024

@ramananayak any workflow from 0.17 should still work in 0.18.

I think the issue is that the new "Fluent Style" Datasource (which are datasources created using the context.sources.add_<TYPE>()) methods do not support declaring queries as part of the batch request.

The documentation for the old "Block Style" datasources is no longer part of our latest version. You'll have to refer to 0.15 docs

You can continue to use the old ("Block Style" Datasources) or you can create a QueryAsset.

runtime_datasource = ephm_ctx.sources.add_postgres(
  name="ds_runtime", 
  connection_string=connection_string, 
  create_temp_table=True
)

my_query_asset = runtime_datasource.add_query_asset(name="my_query", query="select column 1 from table")

batch_request = my_query_asset.build_batch_request()

# pass batch_request to your checkpoint

Does the QueryAsset with an ephemeral context meet your needs, or are you still wanting something different?
We are actively working on 1.0 and this kind of feedback is invaluable.

0 replies

ramananayak · 2024-05-16T07:21:56Z

ramananayak
May 16, 2024
Author

Hi @Kilo59 thanks for you response.
As you mentioned, If I am correct, In the latest version block style datasource config is not supported.
and I assume older version 0.18 and 0.17 support will end once the next 2 latest version will be released.

Now I understand that QueryAsset is the only way to go, I think I may have to write custom code to support run time query from config.

But I think run time query config is a nice feature to have because we have a lot of config which user will (say analyst) will setup in the form of config and all we do is to wrap the config in Airflow scheduler which runs this checks.
This enables us to automate whole flow through config driven framework.
Now everything becoming first class object, automating whole flow with multiple checks in a single input will add much more friction and only enable users to add single check at a time.

As I know lot of people use this method to add multiple checks in a single time. Also moving everything to a config file (in case of filedata context) also makes config file very bulky with lot of unnecessary configs added in context.
Hope this makes sense
thank you so much again !

0 replies

jcampbell · 2024-05-17T19:03:17Z

jcampbell
May 17, 2024
Maintainer

@ramananayak -- in your case, are you expecting to be able to use the validation results that come from these runtime assets at any time other than the immediate validation? We designed runtime assets to mean that the data would be available/provided at runtime, but the asset configuration itself was durable. The intent of that approach was to ensure that saved validation results could be identified by the asset's (durable) name. It sounds to me like that may be the gap, in that you're not looking to have the configuration of the asset persist at all.

I'd love to jump on a call with you and @Kilo59 if you'd like to make sure I understand the case fully, since we've recently been looking at the question of how to support runtime cases more clearly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Data Sources #10289

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Runtime Data Sources #10289

ramananayak Apr 2, 2024

Replies: 9 comments

Kilo59 Apr 11, 2024

ramananayak Apr 17, 2024 Author

Kilo59 Apr 18, 2024

Kilo59 Apr 18, 2024

ramananayak Apr 19, 2024 Author

ramananayak May 1, 2024 Author

Kilo59 May 1, 2024

ramananayak May 16, 2024 Author

jcampbell May 17, 2024 Maintainer

ramananayak
Apr 2, 2024

Kilo59
Apr 11, 2024

ramananayak
Apr 17, 2024
Author

Kilo59
Apr 18, 2024

Kilo59
Apr 18, 2024

ramananayak
Apr 19, 2024
Author

ramananayak
May 1, 2024
Author

Kilo59
May 1, 2024

ramananayak
May 16, 2024
Author

jcampbell
May 17, 2024
Maintainer