Support for streaming (Kafka) datasource #4849
Replies: 8 comments 1 reply
-
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Beta Was this translation helpful? Give feedback.
-
I raised this feature request last year as it was in need for work-related project. Though we later implemented an in-house custom implementation in KStreams using great_expectations core package. Unfortunately, we were not able to open-source it. @dingobar can you also share your functional usecase, for which you think this feature will fit? @jcampbell @abegong Do you think that it is still a useful and needed feature? If yes, I can contribute in its development! |
Beta Was this translation helpful? Give feedback.
-
This is a common question from members of the Great Expectations community. We'd love to see more ideas on how this could be implemented. Currently, it's not at the top of the backlog for the core team, but community ideas and contributions would be welcome. |
Beta Was this translation helpful? Give feedback.
-
@bhcastleton In my opinion, before going for implementation, we should be on same page for the outcome/usecase we are trying to achieve. Can you please help with a specific problem statement that other community members are looking for? @dingobar On same note, can you be more specific on end-to-end usecase. For instance, if we are able to test the streaming data in real-time, what is the business value you are aiming for? What is the feedback loop going to be? I tell you my experience. We implemented internally for a usecase an year back using GE library on a KStreams app. It was very cool to publish quality metrics in real-time and helped consumers to assess the quality of a streaming dataset. Though over the time, one hard reality became evident. As it was immutable streaming data, it was hard for producer or consumer to react to any quality issue or fix it. In short, It just become a monitoring tool which can raise alerts when quality of dataset drops below a benchmark. |
Beta Was this translation helpful? Give feedback.
-
hi @shakti-garg , thanks for sharing your experience. Part of our Stream governance mandates that we should be able to assess quality of our data at each touchpoint of the data systems. I agree that Kafka being immutable distributed log store, we can't change original data but on the flip side our compliance does mandate to have auditable means of stream data history, measurable metrics around the quality of data and also gathering lineage. I must say being able to gauge quality of data in motion and having ways to actually park undesired data into some set of buckets would be desirable. Now GE may have subset of these goals as a framework which is still useful to have implemented as an injectable library for Kafka. |
Beta Was this translation helpful? Give feedback.
-
Hey folks! This would be a super useful feature from great_expectations in our case. I think adding integration with a Kafka source topic would be great, because then the implementations of using great_expectations as a data quality filter to avoid non-expected data into your lake would be easier. In my case, I would use it as one of the steps of my ETL to ingest data, with some basics expectations that if the incoming data would not pass, I would not let this data go to the datalake but go to a different topic to raise an alert and check it later. Is there any chance that this topic gets reopened? @shakti-garg. |
Beta Was this translation helpful? Give feedback.
-
Hello Team |
Beta Was this translation helpful? Give feedback.
-
Hi Everyone, My name is Bobur and I am a Developer Advocate for GlassFlow. I'd like to jump into this engaging discussion which seems quite relevant to my recent experiments. Nowadays I am playing GX OSS Python SDK and how it can work with real-time data using Python frameworks for stream processing such as GlassFlow. From my understanding, you do not need a special data source connector to Kafka topics and you can do real-time data quality checks with the GreatExpectations at the transformation layer of the Python framework. As real-time data event comes, you map ever to in-memory Data Assets like Pandas, you load your expectations from the GreatExpectations cloud, run the validation process, and send the results to other downstream services. Python framework itself can connect to Kafka topics automatically. Let me know if you need my help to prove this concept. @martimors @shakti-garg @bhcastleton |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
Some of the teams are publishing their data using Kafka topics. We want to run data quality validations over them to find quality metrics (at run-time) which can be aggregated over time periods to provide historical view.
Describe the solution you'd like
Extend data source to a kafka specific data source.
Beta Was this translation helpful? Give feedback.
All reactions