Skip to content

Commit

Permalink
Merge branch 'main' into crf-remove-exit-override
Browse files Browse the repository at this point in the history
  • Loading branch information
cristianfr authored Mar 1, 2024
2 parents c0fb7b3 + 1e091c1 commit e6ed39a
Show file tree
Hide file tree
Showing 52 changed files with 1,705 additions and 320 deletions.
30 changes: 0 additions & 30 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,33 +41,6 @@ jobs:
docker push houpy0829/chronon-ci:base
fi
"Scala 11 -- Spark 2 Tests":
executor: docker_baseimg_executor
steps:
- checkout
- run:
name: Run Spark 2.4.0 tests
shell: /bin/bash -leuxo pipefail
command: |
conda activate chronon_py
# Increase if we see OOM.
export SBT_OPTS="-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=4G -Xmx4G -Xms2G"
sbt "++ 2.11.12 test"
- store_test_results:
path: /chronon/spark/target/test-reports
- store_test_results:
path: /chronon/aggregator/target/test-reports
- run:
name: Compress spark-warehouse
command: |
cd /tmp/ && tar -czvf spark-warehouse.tar.gz chronon/spark-warehouse
when: on_fail
- store_artifacts:
path: /tmp/spark-warehouse.tar.gz
destination: spark_warehouse.tar.gz
when: on_fail

"Scala 12 -- Spark 3 Tests":
executor: docker_baseimg_executor
steps:
Expand Down Expand Up @@ -167,9 +140,6 @@ workflows:
build_test_deploy:
jobs:
- "Docker Base Build"
- "Scala 11 -- Spark 2 Tests":
requires:
- "Docker Base Build"
- "Scala 12 -- Spark 3 Tests":
requires:
- "Docker Base Build"
Expand Down
21 changes: 0 additions & 21 deletions .github/workflows/scala.yml

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ api/py/test/sample/production/joins/quickstart/
api/py/.coverage
api/py/htmlcov/
**/derby.log
cs

# Documentation builds
docs/build/
Expand Down
115 changes: 61 additions & 54 deletions CONTRIBUTE.md → CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Everyone is welcome to contribute to Chronon. We value all forms of contribution
- Test cases to make the codebase more robust
- Tutorials, blog posts, talks that promote the project.
- Functionality extensions, new features, etc.
- Optimizations
- Optimizations
- Support for new aggregations and data types
- Support for connectors to different storage systems and event buses

Expand All @@ -22,11 +22,11 @@ In the interest of keeping Chronon a stable platform for users, some changes are
- Changes that could break online fetching flows, including changing the timestamp watermarking or processing in the lambda architecture, or Serde logic.
- Changes that would interfere with existing Airflow DAGs, for example changing the default schedule in a way that would cause breakage on recent versions of Airflow.

There are exceptions to these general rules, however, please be sure to follow the “major change” guidelines if you wish to make such a change.
There are exceptions to these general rules, however, please be sure to follow the “major change” guidelines if you wish to make such a change.

## General Development Process

Everyone in the community is welcome to send patches, documents, and propose new features to the project.
Everyone in the community is welcome to send patches, documents, and propose new features to the project.

Code changes require a stamp of approval from Chronon contributors to be merged, as outlined in the project bylaws.

Expand All @@ -38,9 +38,9 @@ The process for reporting bugs and requesting smaller features is also outlined

Pull Requests (PRs) should follow these guidelines as much as possible:

**Code Guidelines**
### Code Guidelines

- Follow our (code style guidelines)[docs/source/Code_Guidelines.md]
- Follow our [code style guidelines](docs/source/Code_Guidelines.md)
- Well scoped (avoid multiple unrelated changes in the same PR)
- Code should be rebased on the latest version of the latest version of the master branch
- All lint checks and test cases should pass
Expand All @@ -56,18 +56,17 @@ Although these guidelines apply essentially to the PRs’ title and body message

The rules below will help to achieve uniformity that has several benefits, both for review and for the code base maintenance as a whole, helping you to write commit messages with a good quality suitable for the Chronon project, allowing fast log searches, bisecting, and so on.

**PR title**
#### PR title

Guarantee a title exists
Don’t use Github usernames in the title, like @username (enforced);
Include tags as a hint about what component(s) of the code the PRs / commits “touch”. For example [BugFix], [CI], [Streaming], [Spark], etc. If more than one tag exist, multiple brackets should be used, like [BugFix][CI]
- Guarantee a title exists
- Don’t use Github usernames in the title, like @username (enforced)
- Include tags as a hint about what component(s) of the code the PRs / commits “touch”. For example [BugFix], [CI], [Streaming], [Spark], etc. If more than one tag exist, multiple brackets should be used, like [BugFix][CI]

**PR body**

Guarantee a body exists
Include a simple and clear explanation of the purpose of the change
Include any relevant information about how it was tested
#### PR body

- Guarantee a body exists
- Include a simple and clear explanation of the purpose of the change
- Include any relevant information about how it was tested

## Release Guidelines

Expand All @@ -83,23 +82,24 @@ Issues need to contain all relevant information based on the type of the issue.

- Summary of what the user was trying to achieve
- Sample data - Inputs, Expected Outputs (by the user) and Current Output
- Configuration - StagingQuery / GroupBy or Join
- Repro steps
- Configuration - StagingQuery / GroupBy or Join
- Repro steps
- What commands were run and what was the full output of the command
- PR guidelines
- Includes a failing test case based on sample data

### Crash Reports

- Summary of what the user was trying to achieve
- Sample data - Inputs, Expected Outputs (by the user)
- Configuration - StagingQuery / GroupBy or Join
- Repro steps
- Sample data - Inputs, Expected Outputs (by the user)
- Configuration - StagingQuery / GroupBy or Join
- Repro steps
- What commands were run and the output along with the error stack trace
- PR guidelines
- Includes a test case for the crash

## Feature requests and Optimization Requests

We expect the proposer to create a CHIP / Chronon Improvement Proposal document as detailed below

# Chronon Improvement Proposal (CHIP)
Expand Down Expand Up @@ -147,7 +147,6 @@ For the most part monitoring, command line tool changes, and configs are added w

## What should be included in a CHIP?


A CHIP should contain the following sections:

- Motivation: describe the problem to be solved
Expand All @@ -163,13 +162,13 @@ Anyone can initiate a CHIP but you shouldn't do it unless you have an intention
## Process

Here is the process for making a CHIP:
1. Create a PR in chronon/proposals with a single markdown file.Take the next available CHIP number and create a file “CHIP-42 Monoid caching for online & real-time feature fetches”. This is the document that you will iterate on.
2. Fill in the sections as described above and file a PR. These proposal document PRs are reviewed by the committer who is on-call. They usually get merged once there is enough detail and clarity.

1. Create a PR in chronon/proposals with a single markdown file.Take the next available CHIP number and create a file “CHIP-42 Monoid caching for online & real-time feature fetches”. This is the document that you will iterate on.
2. Fill in the sections as described above and file a PR. These proposal document PRs are reviewed by the committer who is on-call. They usually get merged once there is enough detail and clarity.
3. Start a [DISCUSS] issue on github. Please ensure that the subject of the thread is of the format [DISCUSS] CHIP-{your CHIP number} {your CHIP heading}. In the process of the discussion you may update the proposal. You should let people know the changes you are making.
4. Once the proposal is finalized, tag the issue with the “voting-due” label. These proposals are more serious than code changes and more serious even than release votes. In the weekly committee meetings we will vote for/against the CHIP - where Yes, Veto-no, Neutral are the choices. The criteria for acceptance is 3+ “yes” vote count by the members of the committee without a veto-no. Veto-no votes require in-depth technical justifications to be provided on the github issue
4. Once the proposal is finalized, tag the issue with the “voting-due” label. These proposals are more serious than code changes and more serious even than release votes. In the weekly committee meetings we will vote for/against the CHIP - where Yes, Veto-no, Neutral are the choices. The criteria for acceptance is 3+ “yes” vote count by the members of the committee without a veto-no. Veto-no votes require in-depth technical justifications to be provided on the github issue.
5. Please update the CHIP markdown doc to reflect the current stage of the CHIP after a vote. This acts as the permanent record indicating the result of the CHIP (e.g., Accepted or Rejected). Also report the result of the CHIP vote to the github issue thread.


It's not unusual for a CHIP proposal to take long discussions to be finalized. Below are some general suggestions on driving CHIP towards consensus. Notice that these are hints rather than rules. Contributors should make pragmatic decisions in accordance with individual situations.

- The progress of a CHIP should not be long blocked on an unresponsive reviewer. A reviewer who blocks a CHIP with dissenting opinions should try to respond to the subsequent replies timely, or at least provide a reasonable estimated time to respond.
Expand All @@ -180,40 +179,48 @@ It's not unusual for a CHIP proposal to take long discussions to be finalized. B
# Resources

Below is a list of resources that can be useful for development and debugging.
## Docs

(Docsite)[https://chronon.ai]
(doc directory)[https://github.com/airbnb/chronon/tree/master/docs/source]
(Code of conduct)[TODO]
## Docs

## Links:
[Docsite](https://chronon.ai)\
[doc directory](https://github.com/airbnb/chronon/tree/main/docs/source)\
[Code of conduct](TODO)

(pip project)[https://pypi.org/project/chronon-ai/]
(maven central)[https://mvnrepository.com/artifact/ai.chronon/]: (publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#publishing-all-the-artifacts-of-chronon]
(Docsite: publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#chronon-artifacts-publish-process]
## Links

[pip project](https://pypi.org/project/chronon-ai/)\
[maven central](https://mvnrepository.com/artifact/ai.chronon/): [publishing](https://github.com/airbnb/chronon/blob/main/devnotes.md#publishing-all-the-artifacts-of-chronon)\
[Docsite: publishing](https://github.com/airbnb/chronon/blob/main/devnotes.md#chronon-artifacts-publish-process)

## Code Pointers

Api - (Thrift)[https://github.com/airbnb/chronon/blob/master/api/thrift/api.thrift#L180], (Python)[https://github.com/airbnb/chronon/blob/master/api/py/ai/chronon/group_by.py]
(CLI driver entry point for job launching.)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Driver.scala]

**Offline flows that produce hive tables or file output**
(GroupBy)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupBy.scala]
(Staging Query)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala]
(Join backfills)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Join.scala]
(Metadata Export)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala]
Online flows that update and read data & metadata from the kvStore
(GroupBy window tail upload )[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala]
(Streaming window head upload)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala]
(Fetching)[https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala]
Aggregations
(time based aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala]
(time independent aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala]
(integration point with rest of chronon)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223]
(Windowing)[https://github.com/airbnb/chronon/tree/master/aggregator/src/main/scala/ai/chronon/aggregator/windowing]

**Testing**
(Testing - sbt commands)[https://github.com/airbnb/chronon/blob/master/devnotes.md#testing]
(Automated testing - circle-ci pipelines)[https://app.circleci.com/pipelines/github/airbnb/chronon]
(Dev Setup)[https://github.com/airbnb/chronon/blob/master/devnotes.md#prerequisites]
### API

[Thrift](https://github.com/airbnb/chronon/blob/main/api/thrift/api.thrift#L180), [Python](https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/group_by.py)\
[CLI driver entry point for job launching.](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Driver.scala)

### Offline flows that produce hive tables or file output

[GroupBy](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupBy.scala)\
[Staging Query](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala)\
[Join backfills](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Join.scala)\
[Metadata Export](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala)

### Online flows that update and read data & metadata from the kvStore

[GroupBy window tail upload](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala)\
[Streaming window head upload](https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala)\
[Fetching](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala)

### Aggregations

[time based aggregations](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala)\
[time independent aggregations](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala)\
[integration point with rest of chronon](https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223)\
[Windowing](https://github.com/airbnb/chronon/tree/main/aggregator/src/main/scala/ai/chronon/aggregator/windowing)

### Testing

[Testing - sbt commands](https://github.com/airbnb/chronon/blob/main/devnotes.md#testing)\
[Automated testing - circle-ci pipelines](https://app.circleci.com/pipelines/github/airbnb/chronon)\
[Dev Setup](https://github.com/airbnb/chronon/blob/main/devnotes.md#prerequisites)
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Does not include:

## Setup

To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/master/docker-compose.yml) file and run it locally:
To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/main/docker-compose.yml) file and run it locally:

```bash
curl -o docker-compose.yml https://chronon.ai/docker-compose.yml
Expand All @@ -74,7 +74,7 @@ In this example, let's assume that we're a large online retailer, and we've dete

## Raw data sources

Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/data) directory. It includes four tables:
Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/data) directory. It includes four tables:

1. Users - includes basic information about users such as account created date; modeled as a batch data source that updates daily
2. Purchases - a log of all purchases by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart
Expand Down Expand Up @@ -141,11 +141,11 @@ v1 = GroupBy(
)
```

See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data).
See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data).

**Feature set 2: Returns data features**

We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example.
We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example.

**Feature set 3: User data features**

Expand All @@ -167,7 +167,7 @@ v1 = GroupBy(
)
```

Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py).
Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py).


### Step 2 - Join the features together
Expand Down Expand Up @@ -200,7 +200,7 @@ v1 = Join(
)
```

Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/quickstart/training_set.py).
Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py).

The `left` side of the join is what defines the timestamps and primary keys for the backfill (notice that it is built on top of the `checkout` event, as dictated by our use case).

Expand Down Expand Up @@ -370,7 +370,7 @@ Using chronon for your feature engineering work simplifies and improves your ML
4. Chronon exposes easy endpoints for feature fetching.
5. Consistency is guaranteed and measurable.

For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/master?tab=readme-ov-file#benefits-of-chronon-over-other-approaches).
For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/main?tab=readme-ov-file#benefits-of-chronon-over-other-approaches).


# Benefits of Chronon over other approaches
Expand Down Expand Up @@ -417,7 +417,7 @@ With Chronon you can use any data available in your organization, including ever

# Contributing

We welcome contributions to the Chronon project! Please read our (CONTRIBUTING.md)[CONTRIBUTING.md] for details.
We welcome contributions to the Chronon project! Please read [CONTRIBUTING](CONTRIBUTING.md) for details.

# Support

Expand Down
Loading

0 comments on commit e6ed39a

Please sign in to comment.