-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tutorials shouldn't be susceptible to topic retention deleting sample data #1336
Comments
@davetroiano IIRC you had mentioned that it may be useful to have current timestamps in datagen. To be able to do that, it may be possible to extend https://github.com/confluentinc/avro-random-generator (which is used by https://github.com/confluentinc/kafka-connect-datagen) to provide that timestamp, which is part of the Avro specification: https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29 I started playing around with this today, and if we wanted to go that route, the Avro Random Generator diff could look something along these lines:
Test with:
Note that it returns a long but it's not a timestamp -- this requires further troubleshooting. But just wanted to seed the idea to see if it's worth exploring further. |
@ybyzek awesome 🔥 What about including for example, if an
Another idea is to randomize a bit. maybe each iteration can have multiple events generated randomly within each iteration. For example, the first 60 timestamps here would all be between
This would generate an event in each minute but randomly instead of every 60 sec, better looking for things like clickstream:
We can think about other ways to express these ideas but control and randomness are the things that jump to my mind. |
maybe not obvious, I think the ideas above are useful to get interesting data across time without respecting actual time. "quickly generate data across a month so that I can play with this aggregation idea in ksqlDB" |
@davetroiano agree, there are different options to think through on how to control which timestamps are emitted. |
@ybyzek @bbejeck this PR has some food for thought on a direction to go to make tutorials treat time more dynamically, mostly for tutorial quality. Recency feels better when running. But it opens up a bunch of problems called out in the PR. Worth saying we don't need to go down the road of that PR to fix the build or prevent users from hitting the same issue when running manually. The cheap fix to the problem that led to all of this is to just increase retention, i.e., add this here:
|
I’d suggest creating topics with infinite retention in each tutorial rather than setting the global default for retention in docker compose. We don’t really know where these tutorials are running, so scoping configs to topics seems like it will be the most portable. KsqlDB 0.29 will include the ability to set retention for the underlying topic inside the WITH statement, which could be a handy way to do this. |
The root cause of this issue is that the sample row timestamps are beyond the default topic retention period, so recipe flow breaks, e.g., final table output in that recipe sometimes won't show up because source data is gone.
This deserves a sweep of the repo (e.g., search for years -- 2020, 2021), and either fix (change retention, ksqlDB recipes could use
FROM_UNIXTIME(UNIX_TIMESTAMP())
) or open follow-on issues for complex cases. Automated tests may need to be reworked to support dynamic times.The text was updated successfully, but these errors were encountered: