Merge branch 'main' into crf-remove-exit-override

airbnb · Mar 1, 2024 · e6ed39a · e6ed39a
2 parents c0fb7b3 + 1e091c1
commit e6ed39a
Show file tree

Hide file tree

Showing 52 changed files with 1,705 additions and 320 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -41,33 +41,6 @@ jobs:
                         docker push houpy0829/chronon-ci:base
                     fi
 
-
-    "Scala 11 -- Spark 2 Tests":
-        executor: docker_baseimg_executor
-        steps:
-            - checkout
-            - run:
-                  name: Run Spark 2.4.0 tests
-                  shell: /bin/bash -leuxo pipefail
-                  command: |
-                      conda activate chronon_py
-                      # Increase if we see OOM.
-                      export SBT_OPTS="-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=4G -Xmx4G -Xms2G"
-                      sbt "++ 2.11.12 test"
-            - store_test_results:
-                  path: /chronon/spark/target/test-reports
-            - store_test_results:
-                  path: /chronon/aggregator/target/test-reports
-            - run:
-                  name: Compress spark-warehouse
-                  command: |
-                    cd /tmp/ && tar -czvf spark-warehouse.tar.gz chronon/spark-warehouse
-                  when: on_fail
-            - store_artifacts:
-                  path: /tmp/spark-warehouse.tar.gz
-                  destination: spark_warehouse.tar.gz
-                  when: on_fail
-
     "Scala 12 -- Spark 3 Tests":
         executor: docker_baseimg_executor
         steps:
@@ -167,9 +140,6 @@ workflows:
     build_test_deploy:
         jobs:
             - "Docker Base Build"
-            - "Scala 11 -- Spark 2 Tests":
-                  requires:
-                      - "Docker Base Build"
             - "Scala 12 -- Spark 3 Tests":
                   requires:
                       - "Docker Base Build"

diff --git a/.github/workflows/scala.yml b/.github/workflows/scala.yml
diff --git a/.gitignore b/.gitignore
@@ -20,6 +20,7 @@ api/py/test/sample/production/joins/quickstart/
 api/py/.coverage
 api/py/htmlcov/
 **/derby.log
+cs
 
 # Documentation builds
 docs/build/

diff --git a/CONTRIBUTE.md → CONTRIBUTING.md b/CONTRIBUTE.md → CONTRIBUTING.md
@@ -11,7 +11,7 @@ Everyone is welcome to contribute to Chronon. We value all forms of contribution
 - Test cases to make the codebase more robust
 - Tutorials, blog posts, talks that promote the project.
 - Functionality extensions, new features, etc.
-- Optimizations 
+- Optimizations
 - Support for new aggregations and data types
 - Support for connectors to different storage systems and event buses
 
@@ -22,11 +22,11 @@ In the interest of keeping Chronon a stable platform for users, some changes are
 - Changes that could break online fetching flows, including changing the timestamp watermarking or processing in the lambda architecture, or Serde logic.
 - Changes that would interfere with existing Airflow DAGs, for example changing the default schedule in a way that would cause breakage on recent versions of Airflow.
 
-There are exceptions to these general rules, however, please be sure to follow the “major change” guidelines if you wish to make such a change. 
+There are exceptions to these general rules, however, please be sure to follow the “major change” guidelines if you wish to make such a change.
 
 ## General Development Process
 
-Everyone in the community is welcome to send patches, documents, and propose new features to the project. 
+Everyone in the community is welcome to send patches, documents, and propose new features to the project.
 
 Code changes require a stamp of approval from Chronon contributors to be merged, as outlined in the project bylaws.
 
@@ -38,9 +38,9 @@ The process for reporting bugs and requesting smaller features is also outlined
 
 Pull Requests (PRs) should follow these guidelines as much as possible:
 
-**Code Guidelines**
+### Code Guidelines
 
-- Follow our (code style guidelines)[docs/source/Code_Guidelines.md]
+- Follow our [code style guidelines](docs/source/Code_Guidelines.md)
 - Well scoped (avoid multiple unrelated changes in the same PR)
 - Code should be rebased on the latest version of the latest version of the master branch
 - All lint checks and test cases should pass
@@ -56,18 +56,17 @@ Although these guidelines apply essentially to the PRs’ title and body message
 
 The rules below will help to achieve uniformity that has several benefits, both for review and for the code base maintenance as a whole, helping you to write commit messages with a good quality suitable for the Chronon project, allowing fast log searches, bisecting, and so on.
 
-**PR title**
+#### PR title
 
-Guarantee a title exists
-Don’t use Github usernames in the title, like @username (enforced);
-Include tags as a hint about what component(s) of the code the PRs / commits “touch”. For example [BugFix], [CI], [Streaming], [Spark], etc. If more than one tag exist, multiple brackets should be used, like [BugFix][CI]
+- Guarantee a title exists
+- Don’t use Github usernames in the title, like @username (enforced)
+- Include tags as a hint about what component(s) of the code the PRs / commits “touch”. For example [BugFix], [CI], [Streaming], [Spark], etc. If more than one tag exist, multiple brackets should be used, like [BugFix][CI]
 
-**PR body**
-
-Guarantee a body exists
-Include a simple and clear explanation of the purpose of the change
-Include any relevant information about how it was tested
+#### PR body
 
+- Guarantee a body exists
+- Include a simple and clear explanation of the purpose of the change
+- Include any relevant information about how it was tested
 
 ## Release Guidelines
 
@@ -83,23 +82,24 @@ Issues need to contain all relevant information based on the type of the issue.
 
 - Summary of what the user was trying to achieve
   - Sample data - Inputs, Expected Outputs (by the user) and Current Output
-  - Configuration - StagingQuery / GroupBy or Join 
-- Repro steps 
+  - Configuration - StagingQuery / GroupBy or Join
+- Repro steps
   - What commands were run and what was the full output of the command
 - PR guidelines
   - Includes a failing test case based on sample data
 
 ### Crash Reports
 
 - Summary of what the user was trying to achieve
-  - Sample data - Inputs, Expected Outputs (by the user) 
-  - Configuration - StagingQuery / GroupBy or Join 
-- Repro steps 
+  - Sample data - Inputs, Expected Outputs (by the user)
+  - Configuration - StagingQuery / GroupBy or Join
+- Repro steps
   - What commands were run and the output along with the error stack trace
 - PR guidelines
   - Includes a test case for the crash
 
 ## Feature requests and Optimization Requests
+
 We expect the proposer to create a CHIP / Chronon Improvement Proposal document as detailed below
 
 # Chronon Improvement Proposal (CHIP)
@@ -147,7 +147,6 @@ For the most part monitoring, command line tool changes, and configs are added w
 
 ## What should be included in a CHIP?
 
-
 A CHIP should contain the following sections:
 
 - Motivation: describe the problem to be solved
@@ -163,13 +162,13 @@ Anyone can initiate a CHIP but you shouldn't do it unless you have an intention
 ## Process
 
 Here is the process for making a CHIP:
-1. Create a PR in chronon/proposals with a single markdown file.Take the next available CHIP number and create a file “CHIP-42 Monoid caching for online & real-time feature fetches”. This is the document that you will iterate on. 
-2. Fill in the sections as described above and file a PR. These proposal document PRs are reviewed by the committer who is on-call. They usually get merged once there is enough detail and clarity. 
+
+1. Create a PR in chronon/proposals with a single markdown file.Take the next available CHIP number and create a file “CHIP-42 Monoid caching for online & real-time feature fetches”. This is the document that you will iterate on.
+2. Fill in the sections as described above and file a PR. These proposal document PRs are reviewed by the committer who is on-call. They usually get merged once there is enough detail and clarity.
 3. Start a [DISCUSS] issue on github. Please ensure that the subject of the thread is of the format [DISCUSS] CHIP-{your CHIP number} {your CHIP heading}. In the process of the discussion you may update the proposal. You should let people know the changes you are making.
-4. Once the proposal is finalized, tag the issue with the “voting-due” label.  These proposals are more serious than code changes and more serious even than release votes. In the weekly committee meetings we will vote for/against the CHIP - where Yes, Veto-no, Neutral are the choices. The criteria for acceptance is 3+ “yes” vote count by the members of the committee without a veto-no. Veto-no votes require in-depth technical justifications to be provided on the github issue 
+4. Once the proposal is finalized, tag the issue with the “voting-due” label.  These proposals are more serious than code changes and more serious even than release votes. In the weekly committee meetings we will vote for/against the CHIP - where Yes, Veto-no, Neutral are the choices. The criteria for acceptance is 3+ “yes” vote count by the members of the committee without a veto-no. Veto-no votes require in-depth technical justifications to be provided on the github issue.
 5. Please update the CHIP markdown doc to reflect the current stage of the CHIP after a vote. This acts as the permanent record indicating the result of the CHIP (e.g., Accepted or Rejected). Also report the result of the CHIP vote to the github issue thread.
 
-
 It's not unusual for a CHIP proposal to take long discussions to be finalized. Below are some general suggestions on driving CHIP towards consensus. Notice that these are hints rather than rules. Contributors should make pragmatic decisions in accordance with individual situations.
 
 - The progress of a CHIP should not be long blocked on an unresponsive reviewer. A reviewer who blocks a CHIP with dissenting opinions should try to respond to the subsequent replies timely, or at least provide a reasonable estimated time to respond.
@@ -180,40 +179,48 @@ It's not unusual for a CHIP proposal to take long discussions to be finalized. B
 # Resources
 
 Below is a list of resources that can be useful for development and debugging.
-## Docs
 
-(Docsite)[https://chronon.ai]
-(doc directory)[https://github.com/airbnb/chronon/tree/master/docs/source]
-(Code of conduct)[TODO]
+## Docs
 
-## Links: 
+[Docsite](https://chronon.ai)\
+[doc directory](https://github.com/airbnb/chronon/tree/main/docs/source)\
+[Code of conduct](TODO)
 
-(pip project)[https://pypi.org/project/chronon-ai/]
-(maven central)[https://mvnrepository.com/artifact/ai.chronon/]: (publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#publishing-all-the-artifacts-of-chronon]
-(Docsite: publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#chronon-artifacts-publish-process]
+## Links
 
+[pip project](https://pypi.org/project/chronon-ai/)\
+[maven central](https://mvnrepository.com/artifact/ai.chronon/): [publishing](https://github.com/airbnb/chronon/blob/main/devnotes.md#publishing-all-the-artifacts-of-chronon)\
+[Docsite: publishing](https://github.com/airbnb/chronon/blob/main/devnotes.md#chronon-artifacts-publish-process)
 
 ## Code Pointers
 
-Api - (Thrift)[https://github.com/airbnb/chronon/blob/master/api/thrift/api.thrift#L180], (Python)[https://github.com/airbnb/chronon/blob/master/api/py/ai/chronon/group_by.py]
-(CLI driver entry point for job launching.)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Driver.scala]
-
-**Offline flows that produce hive tables or file output**
-(GroupBy)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupBy.scala]
-(Staging Query)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala]
-(Join backfills)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Join.scala]
-(Metadata Export)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala]
-Online flows that update and read data & metadata from the kvStore
-(GroupBy window tail upload )[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala]
-(Streaming window head upload)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala]
-(Fetching)[https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala]
-Aggregations
-(time based aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala]
-(time independent aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala]
-(integration point with rest of chronon)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223]
-(Windowing)[https://github.com/airbnb/chronon/tree/master/aggregator/src/main/scala/ai/chronon/aggregator/windowing]
-
-**Testing**
-(Testing - sbt commands)[https://github.com/airbnb/chronon/blob/master/devnotes.md#testing]
-(Automated testing - circle-ci pipelines)[https://app.circleci.com/pipelines/github/airbnb/chronon]
-(Dev Setup)[https://github.com/airbnb/chronon/blob/master/devnotes.md#prerequisites]
+### API
+
+[Thrift](https://github.com/airbnb/chronon/blob/main/api/thrift/api.thrift#L180), [Python](https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/group_by.py)\
+[CLI driver entry point for job launching.](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Driver.scala)
+
+### Offline flows that produce hive tables or file output
+
+[GroupBy](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupBy.scala)\
+[Staging Query](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala)\
+[Join backfills](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Join.scala)\
+[Metadata Export](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala)
+
+### Online flows that update and read data & metadata from the kvStore
+
+[GroupBy window tail upload](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala)\
+[Streaming window head upload](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala)\
+[Fetching](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala)
+
+### Aggregations
+
+[time based aggregations](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala)\
+[time independent aggregations](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala)\
+[integration point with rest of chronon](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223)\
+[Windowing](https://github.com/airbnb/chronon/tree/main/aggregator/src/main/scala/ai/chronon/aggregator/windowing)
+
+### Testing
+
+[Testing - sbt commands](https://github.com/airbnb/chronon/blob/main/devnotes.md#testing)\
+[Automated testing - circle-ci pipelines](https://app.circleci.com/pipelines/github/airbnb/chronon)\
+[Dev Setup](https://github.com/airbnb/chronon/blob/main/devnotes.md#prerequisites)
diff --git a/README.md b/README.md
@@ -59,7 +59,7 @@ Does not include:
 
 ## Setup
 
-To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/master/docker-compose.yml) file and run it locally:
+To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/main/docker-compose.yml) file and run it locally:
 
 ```bash
 curl -o docker-compose.yml https://chronon.ai/docker-compose.yml
@@ -74,7 +74,7 @@ In this example, let's assume that we're a large online retailer, and we've dete
 
 ## Raw data sources
 
-Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/data) directory. It includes four tables:
+Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/data) directory. It includes four tables:
 
 1. Users - includes basic information about users such as account created date; modeled as a batch data source that updates daily
 2. Purchases - a log of all purchases by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart
@@ -141,11 +141,11 @@ v1 = GroupBy(
 )
 ```
 
-See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). 
+See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). 
 
 **Feature set 2: Returns data features**
 
-We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example.
+We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example.
 
 **Feature set 3: User data features**
 
@@ -167,7 +167,7 @@ v1 = GroupBy(
 ) 
 ```
 
-Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py).
+Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py).
 
 
 ### Step 2 - Join the features together
@@ -200,7 +200,7 @@ v1 = Join(
 )
 ```
 
-Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/quickstart/training_set.py). 
+Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py). 
 
 The `left` side of the join is what defines the timestamps and primary keys for the backfill (notice that it is built on top of the `checkout` event, as dictated by our use case).
 
@@ -370,7 +370,7 @@ Using chronon for your feature engineering work simplifies and improves your ML
 4. Chronon exposes easy endpoints for feature fetching.
 5. Consistency is guaranteed and measurable.
 
-For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/master?tab=readme-ov-file#benefits-of-chronon-over-other-approaches).
+For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/main?tab=readme-ov-file#benefits-of-chronon-over-other-approaches).
 
 
 # Benefits of Chronon over other approaches
@@ -417,7 +417,7 @@ With Chronon you can use any data available in your organization, including ever
 
 # Contributing
 
-We welcome contributions to the Chronon project! Please read our (CONTRIBUTING.md)[CONTRIBUTING.md] for details.
+We welcome contributions to the Chronon project! Please read [CONTRIBUTING](CONTRIBUTING.md) for details.
 
 # Support