You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies if this issue is a bit out of scope, but I feel like you guys might have some valuable input.
I've been testing out the Profile Sync feature and studying your DBT project (really cool btw, I had no idea this kind of templating/macroing tool existed for SQL!).
Although we really like the convenience of Profile Sync, unfortunately it presents a major issue for us: we have to pay for a dedicated Redshift cluster (or other Data Warehouse storage) to host a full copy of the "profile_traits" (plus other tables), and then sync that materialised data to our S3 bucket.
Our ultimate goal is really to have our Profiles on our S3 Data Lake, regardless of which Destination or feature we use.
Given the above: would it be possible to generate the "profile_traits" table (or some sort of equivalent) based on the data we get from the AWS S3 destination instead?
Here at Gympass, we would really love it if we could do something like this:
Use the AWS S3 destination to get the raw event data (i.e. "Identifies" and "Tracks" events, in JSON format) into our raw S3 bucket.
We then use Apache Hive to fit that data into an SQL schema (e.g. "identifies", "tracks" tables)
We then run an Airflow job to generate "profile_traits", and the required intermediate tables (e.g. "id_graph_updates").
The big question here is really what SQL transformations do we need to perform in step 3. I've been studying the DBT models in this repo to try to answer this myself, but I've ran into a problem: the "id_graph_updates" table is always required (because it is the source of "id_graph", which is always used to materialize "profile_traits", even in the older versions of this repo), but only provided by Profile Sync, not by the AWS S3 destination.
How would you guys go about it? Have you faced a similar challenge while developing this project, do you happen to already have some golden SQL queries at hand to reconstruct Profiles based on raw "Identify" data?
I'm a bit of a noob to data engineering and data pipelines, so apologies in advance for any obvious questions 😅
Thank you for your time, any input you have will be highly appreciated!
Cheers,
Rodrigo
The text was updated successfully, but these errors were encountered:
Hello,
Apologies if this issue is a bit out of scope, but I feel like you guys might have some valuable input.
I've been testing out the Profile Sync feature and studying your DBT project (really cool btw, I had no idea this kind of templating/macroing tool existed for SQL!).
Although we really like the convenience of Profile Sync, unfortunately it presents a major issue for us: we have to pay for a dedicated Redshift cluster (or other Data Warehouse storage) to host a full copy of the "profile_traits" (plus other tables), and then sync that materialised data to our S3 bucket.
Our ultimate goal is really to have our Profiles on our S3 Data Lake, regardless of which Destination or feature we use.
Given the above: would it be possible to generate the "profile_traits" table (or some sort of equivalent) based on the data we get from the AWS S3 destination instead?
Here at Gympass, we would really love it if we could do something like this:
Use the AWS S3 destination to get the raw event data (i.e. "Identifies" and "Tracks" events, in JSON format) into our raw S3 bucket.
We then use Apache Hive to fit that data into an SQL schema (e.g. "identifies", "tracks" tables)
We then run an Airflow job to generate "profile_traits", and the required intermediate tables (e.g. "id_graph_updates").
The big question here is really what SQL transformations do we need to perform in step 3. I've been studying the DBT models in this repo to try to answer this myself, but I've ran into a problem: the "id_graph_updates" table is always required (because it is the source of "id_graph", which is always used to materialize "profile_traits", even in the older versions of this repo), but only provided by Profile Sync, not by the AWS S3 destination.
How would you guys go about it? Have you faced a similar challenge while developing this project, do you happen to already have some golden SQL queries at hand to reconstruct Profiles based on raw "Identify" data?
I'm a bit of a noob to data engineering and data pipelines, so apologies in advance for any obvious questions 😅
Thank you for your time, any input you have will be highly appreciated!
Cheers,
Rodrigo
The text was updated successfully, but these errors were encountered: