Support for streaming (Kafka) datasource #4849

shakti-garg · 2020-06-26T17:10:44Z

shakti-garg
Jun 26, 2020

Is your feature request related to a problem? Please describe.
Some of the teams are publishing their data using Kafka topics. We want to run data quality validations over them to find quality metrics (at run-time) which can be aggregated over time periods to provide historical view.

Describe the solution you'd like
Extend data source to a kafka specific data source.

2021-05-08T00:05:52Z

github-actions[bot]
bot May 8, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

0 replies

shakti-garg · 2021-09-22T01:20:45Z

shakti-garg
Sep 22, 2021
Author

I raised this feature request last year as it was in need for work-related project. Though we later implemented an in-house custom implementation in KStreams using great_expectations core package. Unfortunately, we were not able to open-source it.

@dingobar can you also share your functional usecase, for which you think this feature will fit?

@jcampbell @abegong Do you think that it is still a useful and needed feature? If yes, I can contribute in its development!

0 replies

bhcastleton · 2021-12-07T14:18:39Z

bhcastleton
Dec 7, 2021

This is a common question from members of the Great Expectations community. We'd love to see more ideas on how this could be implemented. Currently, it's not at the top of the backlog for the core team, but community ideas and contributions would be welcome.

0 replies

shakti-garg · 2021-12-09T05:11:08Z

shakti-garg
Dec 9, 2021
Author

@bhcastleton In my opinion, before going for implementation, we should be on same page for the outcome/usecase we are trying to achieve. Can you please help with a specific problem statement that other community members are looking for?

@dingobar On same note, can you be more specific on end-to-end usecase. For instance, if we are able to test the streaming data in real-time, what is the business value you are aiming for? What is the feedback loop going to be?

I tell you my experience. We implemented internally for a usecase an year back using GE library on a KStreams app. It was very cool to publish quality metrics in real-time and helped consumers to assess the quality of a streaming dataset. Though over the time, one hard reality became evident. As it was immutable streaming data, it was hard for producer or consumer to react to any quality issue or fix it. In short, It just become a monitoring tool which can raise alerts when quality of dataset drops below a benchmark.

0 replies

mparikhcloudbeds · 2022-01-03T13:27:05Z

mparikhcloudbeds
Jan 3, 2022

hi @shakti-garg , thanks for sharing your experience.

Part of our Stream governance mandates that we should be able to assess quality of our data at each touchpoint of the data systems. I agree that Kafka being immutable distributed log store, we can't change original data but on the flip side our compliance does mandate to have auditable means of stream data history, measurable metrics around the quality of data and also gathering lineage.

I must say being able to gauge quality of data in motion and having ways to actually park undesired data into some set of buckets would be desirable. Now GE may have subset of these goals as a framework which is still useful to have implemented as an injectable library for Kafka.

0 replies

sergialonsaco · 2022-05-19T07:31:31Z

sergialonsaco
May 19, 2022

Hey folks! This would be a super useful feature from great_expectations in our case. I think adding integration with a Kafka source topic would be great, because then the implementations of using great_expectations as a data quality filter to avoid non-expected data into your lake would be easier. In my case, I would use it as one of the steps of my ETL to ingest data, with some basics expectations that if the incoming data would not pass, I would not let this data go to the datalake but go to a different topic to raise an alert and check it later.

Is there any chance that this topic gets reopened? @shakti-garg.
I understand that maybe there are some side solutions by using some Kstream implementation in python like Faust and passing the data collected to GE maybe? Even though, I would love to see this feature on GE.

0 replies

oussamabenreskallah · 2023-02-05T21:51:24Z

oussamabenreskallah
Feb 5, 2023

Hello Team
Using GX as a tool for Developing Data quality in Stream Data (Topics) is a very good functionality. In my company and I think for many companies this will be very helpful since kafka is a great tools to decouple system and enhance security. Communication between system and teams use mainly Topics (kafka) or API. Please, can we reopen the discussion on this ?

0 replies

Boburmirzo · 2024-05-16T10:58:38Z

Boburmirzo
May 16, 2024

Hi Everyone,

My name is Bobur and I am a Developer Advocate for GlassFlow.

I'd like to jump into this engaging discussion which seems quite relevant to my recent experiments. Nowadays I am playing GX OSS Python SDK and how it can work with real-time data using Python frameworks for stream processing such as GlassFlow.

From my understanding, you do not need a special data source connector to Kafka topics and you can do real-time data quality checks with the GreatExpectations at the transformation layer of the Python framework. As real-time data event comes, you map ever to in-memory Data Assets like Pandas, you load your expectations from the GreatExpectations cloud, run the validation process, and send the results to other downstream services.

Python framework itself can connect to Kafka topics automatically.

Let me know if you need my help to prove this concept. @martimors @shakti-garg @bhcastleton

1 reply

Kilo59 Jun 27, 2024

@Boburmirzo

Correct. You could create a custom Pandas or Spark Datasource that removes some of the boilerplate.

Here's a simple example of a custom SQLDatasource but doing a custom Pandas/Spark datasource should follow the same pattern.

https://gist.github.com/Kilo59/0bad18731ab1154f12e12217f4404ad7

If folks are interested in this approach, I can follow up with a more complete example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for streaming (Kafka) datasource #4849

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support for streaming (Kafka) datasource #4849

shakti-garg Jun 26, 2020

Replies: 8 comments · 1 reply

github-actions[bot] bot May 8, 2021

shakti-garg Sep 22, 2021 Author

bhcastleton Dec 7, 2021

shakti-garg Dec 9, 2021 Author

mparikhcloudbeds Jan 3, 2022

sergialonsaco May 19, 2022

oussamabenreskallah Feb 5, 2023

Boburmirzo May 16, 2024

Kilo59 Jun 27, 2024

shakti-garg
Jun 26, 2020

Replies: 8 comments 1 reply

github-actions[bot]
bot May 8, 2021

shakti-garg
Sep 22, 2021
Author

bhcastleton
Dec 7, 2021

shakti-garg
Dec 9, 2021
Author

mparikhcloudbeds
Jan 3, 2022

sergialonsaco
May 19, 2022

oussamabenreskallah
Feb 5, 2023

Boburmirzo
May 16, 2024