Is it possible to do what this project does, but for S3? #10

rodfsoares · 2023-08-22T08:33:14Z

Hello,

Apologies if this issue is a bit out of scope, but I feel like you guys might have some valuable input.

I've been testing out the Profile Sync feature and studying your DBT project (really cool btw, I had no idea this kind of templating/macroing tool existed for SQL!).

Although we really like the convenience of Profile Sync, unfortunately it presents a major issue for us: we have to pay for a dedicated Redshift cluster (or other Data Warehouse storage) to host a full copy of the "profile_traits" (plus other tables), and then sync that materialised data to our S3 bucket.

Our ultimate goal is really to have our Profiles on our S3 Data Lake, regardless of which Destination or feature we use.

Given the above: would it be possible to generate the "profile_traits" table (or some sort of equivalent) based on the data we get from the AWS S3 destination instead?

Here at Gympass, we would really love it if we could do something like this:

Use the AWS S3 destination to get the raw event data (i.e. "Identifies" and "Tracks" events, in JSON format) into our raw S3 bucket.
We then use Apache Hive to fit that data into an SQL schema (e.g. "identifies", "tracks" tables)
We then run an Airflow job to generate "profile_traits", and the required intermediate tables (e.g. "id_graph_updates").

The big question here is really what SQL transformations do we need to perform in step 3. I've been studying the DBT models in this repo to try to answer this myself, but I've ran into a problem: the "id_graph_updates" table is always required (because it is the source of "id_graph", which is always used to materialize "profile_traits", even in the older versions of this repo), but only provided by Profile Sync, not by the AWS S3 destination.

How would you guys go about it? Have you faced a similar challenge while developing this project, do you happen to already have some golden SQL queries at hand to reconstruct Profiles based on raw "Identify" data?

I'm a bit of a noob to data engineering and data pipelines, so apologies in advance for any obvious questions 😅

Thank you for your time, any input you have will be highly appreciated!

Cheers,
Rodrigo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to do what this project does, but for S3? #10

Is it possible to do what this project does, but for S3? #10

rodfsoares commented Aug 22, 2023

Is it possible to do what this project does, but for S3? #10

Is it possible to do what this project does, but for S3? #10

Comments

rodfsoares commented Aug 22, 2023