tutorials shouldn't be susceptible to topic retention deleting sample data #1336

davetroiano · 2022-08-02T14:06:53Z

The root cause of this issue is that the sample row timestamps are beyond the default topic retention period, so recipe flow breaks, e.g., final table output in that recipe sometimes won't show up because source data is gone.

This deserves a sweep of the repo (e.g., search for years -- 2020, 2021), and either fix (change retention, ksqlDB recipes could use FROM_UNIXTIME(UNIX_TIMESTAMP())) or open follow-on issues for complex cases. Automated tests may need to be reworked to support dynamic times.

The text was updated successfully, but these errors were encountered:

ybyzek · 2022-08-03T14:27:19Z

@davetroiano IIRC you had mentioned that it may be useful to have current timestamps in datagen. To be able to do that, it may be possible to extend https://github.com/confluentinc/avro-random-generator (which is used by https://github.com/confluentinc/kafka-connect-datagen) to provide that timestamp, which is part of the Avro specification: https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29

I started playing around with this today, and if we wanted to go that route, the Avro Random Generator diff could look something along these lines:

diff --git a/src/main/java/io/confluent/avro/random/generator/Generator.java b/src/main/java/io/confluent/avro/random/generator/Generator.java
index 272476f..373ba46 100644
--- a/src/main/java/io/confluent/avro/random/generator/Generator.java
+++ b/src/main/java/io/confluent/avro/random/generator/Generator.java
@@ -58,6 +58,7 @@ import java.util.Map;
 import java.util.Objects;
 import java.util.Optional;
 import java.util.Random;
+import java.util.Date;
 
 /**
  * Generates Java objects according to an {@link Schema Avro Schema}.
@@ -198,6 +199,8 @@ public class Generator {
 
   static final String DECIMAL_LOGICAL_TYPE_NAME = "decimal";
 
+  static final String TIMESTAMP_LOGICAL_TYPE_NAME = "timestamp";
+
   private final Schema topLevelSchema;
   private final Random random;
   private final long generation;
@@ -439,6 +442,10 @@ public class Generator {
     return getLogicalType(schema, DECIMAL_LOGICAL_TYPE_NAME, LogicalTypes.Decimal.class);
   }
 
+  private LogicalTypes.TimestampMillis getTimestampLogicalType(Schema schema) {
+    return getLogicalType(schema, TIMESTAMP_LOGICAL_TYPE_NAME, LogicalTypes.TimestampMillis.class);
+  }
+
   private void enforceMutualExclusion(
       Map propertiesProp,
       String includedProp,
@@ -1224,6 +1231,10 @@ public class Generator {
     }
   }
 
+  private Long generateTimestamp(Map propertiesProp) {
+    return (new Date(System.currentTimeMillis())).getTime();
+  }
+
   private Float generateFloat(Map propertiesProp) {
     Object rangeProp = propertiesProp.get(RANGE_PROP);
     if (rangeProp != null) {

Test with:

./gradlew standalone       
java -jar bin/arg.jar -f src/test/resources/test-schemas/timestamp.json

Note that it returns a long but it's not a timestamp -- this requires further troubleshooting. But just wanted to seed the idea to see if it's worth exploring further.

davetroiano · 2022-08-03T16:23:10Z

@ybyzek awesome 🔥

What about including iteration concept or similar? something to give users more control over time.

for example, if iteration is missing then generate timestamps as in your diff.

an iteration example would look like:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1second"
      }
    }
  }
}

Another idea is to randomize a bit. maybe each iteration can have multiple events generated randomly within each iteration. For example, the first 60 timestamps here would all be between 2021-01-01 00:00:00 and 2021-01-01 00:00:59, but chosen randomly within that minute:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1minute"
        "timestamps_per_step": 60
      }
    }
  }
}

This would generate an event in each minute but randomly instead of every 60 sec, better looking for things like clickstream:

{
  "name": "my_timestamp",
  "type": {
    "type": "timestamp",
    "arg.properties": {
      "format": "yyyy-MM-dd HH:mm:ss", (or "unix")
      "iteration": {
        "start": "2021-01-01 00:00:00", (or "now", or expressions including "now" like "now-5days")
        "step": "1minute"
        "timestamps_per_step": 1 (or omit, default is 1)
      }
    }
  }
}

We can think about other ways to express these ideas but control and randomness are the things that jump to my mind.

davetroiano · 2022-08-03T16:37:18Z

maybe not obvious, I think the ideas above are useful to get interesting data across time without respecting actual time. "quickly generate data across a month so that I can play with this aggregation idea in ksqlDB"

ybyzek · 2022-08-03T17:07:07Z

@davetroiano agree, there are different options to think through on how to control which timestamps are emitted.

davetroiano · 2022-08-04T13:50:49Z

@ybyzek @bbejeck this PR has some food for thought on a direction to go to make tutorials treat time more dynamically, mostly for tutorial quality. Recency feels better when running. But it opens up a bunch of problems called out in the PR.

Worth saying we don't need to go down the road of that PR to fix the build or prevent users from hitting the same issue when running manually. The cheap fix to the problem that led to all of this is to just increase retention, i.e., add this here:

KAFKA_LOG_RETENTION_MS: -1

chuck-alt-delete · 2022-09-27T15:29:02Z

I’d suggest creating topics with infinite retention in each tutorial rather than setting the global default for retention in docker compose. We don’t really know where these tutorials are running, so scoping configs to topics seems like it will be the most portable.

KsqlDB 0.29 will include the ability to set retention for the underlying topic inside the WITH statement, which could be a handy way to do this.

davetroiano mentioned this issue Aug 2, 2022

anomaly-detection recipe test periodically hangs #1333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorials shouldn't be susceptible to topic retention deleting sample data #1336

tutorials shouldn't be susceptible to topic retention deleting sample data #1336

davetroiano commented Aug 2, 2022

ybyzek commented Aug 3, 2022

davetroiano commented Aug 3, 2022 •

edited

Loading

davetroiano commented Aug 3, 2022

ybyzek commented Aug 3, 2022

davetroiano commented Aug 4, 2022 •

edited

Loading

chuck-alt-delete commented Sep 27, 2022 •

edited

Loading

tutorials shouldn't be susceptible to topic retention deleting sample data #1336

tutorials shouldn't be susceptible to topic retention deleting sample data #1336

Comments

davetroiano commented Aug 2, 2022

ybyzek commented Aug 3, 2022

davetroiano commented Aug 3, 2022 • edited Loading

davetroiano commented Aug 3, 2022

ybyzek commented Aug 3, 2022

davetroiano commented Aug 4, 2022 • edited Loading

chuck-alt-delete commented Sep 27, 2022 • edited Loading

davetroiano commented Aug 3, 2022 •

edited

Loading

davetroiano commented Aug 4, 2022 •

edited

Loading

chuck-alt-delete commented Sep 27, 2022 •

edited

Loading