Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tutorials shouldn't be susceptible to topic retention deleting sample data #1336

Open
davetroiano opened this issue Aug 2, 2022 · 6 comments

Comments

@davetroiano
Copy link
Contributor

The root cause of this issue is that the sample row timestamps are beyond the default topic retention period, so recipe flow breaks, e.g., final table output in that recipe sometimes won't show up because source data is gone.

This deserves a sweep of the repo (e.g., search for years -- 2020, 2021), and either fix (change retention, ksqlDB recipes could use FROM_UNIXTIME(UNIX_TIMESTAMP())) or open follow-on issues for complex cases. Automated tests may need to be reworked to support dynamic times.

@ybyzek
Copy link
Contributor

ybyzek commented Aug 3, 2022

@davetroiano IIRC you had mentioned that it may be useful to have current timestamps in datagen. To be able to do that, it may be possible to extend https://github.com/confluentinc/avro-random-generator (which is used by https://github.com/confluentinc/kafka-connect-datagen) to provide that timestamp, which is part of the Avro specification: https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29

I started playing around with this today, and if we wanted to go that route, the Avro Random Generator diff could look something along these lines:

diff --git a/src/main/java/io/confluent/avro/random/generator/Generator.java b/src/main/java/io/confluent/avro/random/generator/Generator.java
index 272476f..373ba46 100644
--- a/src/main/java/io/confluent/avro/random/generator/Generator.java
+++ b/src/main/java/io/confluent/avro/random/generator/Generator.java
@@ -58,6 +58,7 @@ import java.util.Map;
 import java.util.Objects;
 import java.util.Optional;
 import java.util.Random;
+import java.util.Date;
 
 /**
  * Generates Java objects according to an {@link Schema Avro Schema}.
@@ -198,6 +199,8 @@ public class Generator {
 
   static final String DECIMAL_LOGICAL_TYPE_NAME = "decimal";
 
+  static final String TIMESTAMP_LOGICAL_TYPE_NAME = "timestamp";
+
   private final Schema topLevelSchema;
   private final Random random;
   private final long generation;
@@ -439,6 +442,10 @@ public class Generator {
     return getLogicalType(schema, DECIMAL_LOGICAL_TYPE_NAME, LogicalTypes.Decimal.class);
   }
 
+  private LogicalTypes.TimestampMillis getTimestampLogicalType(Schema schema) {
+    return getLogicalType(schema, TIMESTAMP_LOGICAL_TYPE_NAME, LogicalTypes.TimestampMillis.class);
+  }
+
   private void enforceMutualExclusion(
       Map propertiesProp,
       String includedProp,
@@ -1224,6 +1231,10 @@ public class Generator {
     }
   }
 
+  private Long generateTimestamp(Map propertiesProp) {
+    return (new Date(System.currentTimeMillis())).getTime();
+  }
+
   private Float generateFloat(Map propertiesProp) {
     Object rangeProp = propertiesProp.get(RANGE_PROP);
     if (rangeProp != null) {

Test with:

./gradlew standalone       
java -jar bin/arg.jar -f src/test/resources/test-schemas/timestamp.json

Note that it returns a long but it's not a timestamp -- this requires further troubleshooting. But just wanted to seed the idea to see if it's worth exploring further.

@davetroiano
Copy link
Contributor Author

davetroiano commented Aug 3, 2022

@ybyzek awesome 🔥

What about including iteration concept or similar? something to give users more control over time.

for example, if iteration is missing then generate timestamps as in your diff.

an iteration example would look like:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1second"
      }
    }
  }
}

Another idea is to randomize a bit. maybe each iteration can have multiple events generated randomly within each iteration. For example, the first 60 timestamps here would all be between 2021-01-01 00:00:00 and 2021-01-01 00:00:59, but chosen randomly within that minute:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1minute"
        "timestamps_per_step": 60
      }
    }
  }
}

This would generate an event in each minute but randomly instead of every 60 sec, better looking for things like clickstream:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1minute"
        "timestamps_per_step": 1 (or omit, default is 1)
      }
    }
  }
}

We can think about other ways to express these ideas but control and randomness are the things that jump to my mind.

@davetroiano
Copy link
Contributor Author

maybe not obvious, I think the ideas above are useful to get interesting data across time without respecting actual time. "quickly generate data across a month so that I can play with this aggregation idea in ksqlDB"

@ybyzek
Copy link
Contributor

ybyzek commented Aug 3, 2022

@davetroiano agree, there are different options to think through on how to control which timestamps are emitted.

@davetroiano
Copy link
Contributor Author

davetroiano commented Aug 4, 2022

@ybyzek @bbejeck this PR has some food for thought on a direction to go to make tutorials treat time more dynamically, mostly for tutorial quality. Recency feels better when running. But it opens up a bunch of problems called out in the PR.

Worth saying we don't need to go down the road of that PR to fix the build or prevent users from hitting the same issue when running manually. The cheap fix to the problem that led to all of this is to just increase retention, i.e., add this here:

KAFKA_LOG_RETENTION_MS: -1

@chuck-alt-delete
Copy link

chuck-alt-delete commented Sep 27, 2022

I’d suggest creating topics with infinite retention in each tutorial rather than setting the global default for retention in docker compose. We don’t really know where these tutorials are running, so scoping configs to topics seems like it will be the most portable.

KsqlDB 0.29 will include the ability to set retention for the underlying topic inside the WITH statement, which could be a handy way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants