Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to do what this project does, but for S3? #10

Open
rodfsoares opened this issue Aug 22, 2023 · 0 comments
Open

Is it possible to do what this project does, but for S3? #10

rodfsoares opened this issue Aug 22, 2023 · 0 comments

Comments

@rodfsoares
Copy link

Hello,

Apologies if this issue is a bit out of scope, but I feel like you guys might have some valuable input.

I've been testing out the Profile Sync feature and studying your DBT project (really cool btw, I had no idea this kind of templating/macroing tool existed for SQL!).

Although we really like the convenience of Profile Sync, unfortunately it presents a major issue for us: we have to pay for a dedicated Redshift cluster (or other Data Warehouse storage) to host a full copy of the "profile_traits" (plus other tables), and then sync that materialised data to our S3 bucket.

Our ultimate goal is really to have our Profiles on our S3 Data Lake, regardless of which Destination or feature we use.

Given the above: would it be possible to generate the "profile_traits" table (or some sort of equivalent) based on the data we get from the AWS S3 destination instead?

Here at Gympass, we would really love it if we could do something like this:

  1. Use the AWS S3 destination to get the raw event data (i.e. "Identifies" and "Tracks" events, in JSON format) into our raw S3 bucket.

  2. We then use Apache Hive to fit that data into an SQL schema (e.g. "identifies", "tracks" tables)

  3. We then run an Airflow job to generate "profile_traits", and the required intermediate tables (e.g. "id_graph_updates").

The big question here is really what SQL transformations do we need to perform in step 3. I've been studying the DBT models in this repo to try to answer this myself, but I've ran into a problem: the "id_graph_updates" table is always required (because it is the source of "id_graph", which is always used to materialize "profile_traits", even in the older versions of this repo), but only provided by Profile Sync, not by the AWS S3 destination.

How would you guys go about it? Have you faced a similar challenge while developing this project, do you happen to already have some golden SQL queries at hand to reconstruct Profiles based on raw "Identify" data?

I'm a bit of a noob to data engineering and data pipelines, so apologies in advance for any obvious questions 😅

Thank you for your time, any input you have will be highly appreciated!

Cheers,
Rodrigo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant