Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squash messages to one table for slack #349

Merged
merged 8 commits into from
Mar 1, 2024
Merged

Conversation

VioletM
Copy link
Contributor

@VioletM VioletM commented Feb 6, 2024

Tell us what you do here

  • implementing verified source (please link a relevant issue labeled as verified source)
  • fixing a bug (please link a relevant bug report)
  • improving, documenting, or customizing an existing source (please link an issue or describe below)
    • Now all messages are extracted into one table instead of a table per channel.
    • Add replies resource. Now replies are also extracted by slack source
  • anything else (please link an issue or describe below)

Now all messages are in one table
Add replies resource
sources/slack/__init__.py Outdated Show resolved Hide resolved
sources/slack/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@AstrakhantsevaAA AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work!

@AstrakhantsevaAA AstrakhantsevaAA self-requested a review February 7, 2024 10:28
@AstrakhantsevaAA AstrakhantsevaAA dismissed their stale review February 7, 2024 10:28

I just wanted to comment

@VioletM VioletM force-pushed the vmishechk/slack_update branch from 7f4cd66 to 1af7894 Compare February 13, 2024 16:59
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really high quality code! also thanks for all the tests!

my comments:

  1. current code to get replies will work with append write disposition. but in practice you'll probably won't see most of the replies - replies can be added to messages after we acquire them. those replies we'll never see.
    our solution is to always get last 7 or 14 days of messages by setting the start and end dates, in that case incremental loading is disables and we get the full range.

our code however sets write disposition to append. maybe it is a good default but imo we should change it to merge when end date is set.

you could write a test for that. try to run the pipeline with an explicit range. current code will duplicate the messages

  1. this code will have a problem when running on Airflow - similar to the problem we had on sql_database. in the @dlt.source function we fetch channels list from slack. on Airflow this will happen every 30 seconds when running the DAG. maybe this is not a big problem? in sql_database it was solved for a case where table names are provided by the user (we are fetching table properties in the resource function - which happens at run time)

sources/slack/__init__.py Show resolved Hide resolved
yield messages_channel | dlt.transformer(
get_thread_replies,
name=channel_name + "_replies",
table_name=partial(table_name_func, channel_name + "_replies"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_name_func needs type and/or subtype to be present. maybe you should add them to the context of get_pages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties are already present in page_data.

For example, so far I've seen only thread_broadcast subtype in replies and current source generates separate table channel_name_replies_thread_broadcast.

I've added get("type", "") though, in case there is some unexpected data.

Screenshot 2024-02-19 at 12 37 18

Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above

@rudolfix rudolfix self-assigned this Feb 26, 2024
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for all the tests!
we should deploy this with our slack to see how our moving merge window works. ping me with the deployment script (maybe we add it to the demo pipeline)

@VioletM VioletM merged commit 665b2aa into master Mar 1, 2024
14 checks passed
@rudolfix rudolfix deleted the vmishechk/slack_update branch April 10, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants