Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added extra option to add readOnly thrift HMS uri #308

Merged
merged 6 commits into from
Feb 8, 2024

Conversation

patduin
Copy link
Contributor

@patduin patduin commented Jan 23, 2024

tldr; Split traffic based on called HMS API method, e.g. getTable will go to a readOnly HMS and alterTable will go to readWrite HMS

The problem addressed here is running WD at scale. Generally our company deploys Waggle Dance as part of an Apiary Data lake: https://github.com/ExpediaGroup/apiary-data-lake.
This involves deploying ReadOnly and ReadWrite Metastores (HMS).
For the primary (local) metastore waggle dance is configured to the ReadWrite instance which connects to a ReadWrite RDS backend. This means all traffic both read and writes end up on our ReadWrite RDS instance. This PR tries to split that traffic and move read traffic to ReadOnly instance.
The benefit would be:

  • easily scale ReadOnly RDS instances to handle more load
  • Automatically redirect the traffic without user configuration changes. Lots of ETL do read and writes as part of their workflow it has proven difficult for users to fully switch to a ReadOnly instance only this PR makes the decision for them.

…d on read only calls for better spread of traffic
@patduin patduin marked this pull request as ready for review February 8, 2024 17:59
@patduin patduin requested a review from a team as a code owner February 8, 2024 17:59
README.md Outdated
@@ -167,6 +167,7 @@ The table below describes all the available configuration values for Waggle Danc
| `primary-meta-store.hive-metastore-filter-hook` | No | Name of the class which implements the `MetaStoreFilterHook` interface from Hive. This allows a metastore filter hook to be applied to the corresponding Hive metastore calls. Can be configured with the `configuration-properties` specified in the `waggle-dance-server.yml` configuration. They will be added in the HiveConf object that is given to the constructor of the `MetaStoreFilterHook` implementation you provide. |
| `primary-meta-store.database-name-mapping` | No | BiDirectional Map of database names and mapped name, where key=`<database name as known in the primary metastore>` and value=`<name that should be shown to a client>`. See the [Database Name Mapping](#database-name-mapping) section.|
| `primary-meta-store.glue-config` | No | Can be used instead of `remote-meta-store-uris` to federate to an AWS Glue Catalog ([AWS Glue](https://docs.aws.amazon.com/glue/index.html). See the [Federate to AWS Glue Catalog](#federate-to-aws-glue-catalog) section.|
| `primary-meta-store.read-only-remote-meta-store-uris` | No | Can be used to configure an extra read-only endpoint for the primary Metastore. This is an optimization if your environment runs separate Metastore endpoints and traffic needs to be differted efficiently. Waggle Dance will direct traffic to the read-write or read-only endpoints based on the call being done. For instance `get_table` will be a read-only call but `alter_table` will be forwarded to the read-write Metastore.|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo differentiated or diverted?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for diverted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diverted 🤦

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You gave us a mashup of both! Haha.

if (metaStore.getReadOnlyRemoteMetaStoreUris() != null) {
CloseableThriftHiveMetastoreIface readWrite = newHiveInstance(metaStore, name, metaStore.getRemoteMetaStoreUris(),
properties);
CloseableThriftHiveMetastoreIface readOnly = newHiveInstance(metaStore, name+"_ro",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing before and after +.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatted

@mroark1m mroark1m self-requested a review February 8, 2024 19:19
README.md Outdated
@@ -167,6 +167,7 @@ The table below describes all the available configuration values for Waggle Danc
| `primary-meta-store.hive-metastore-filter-hook` | No | Name of the class which implements the `MetaStoreFilterHook` interface from Hive. This allows a metastore filter hook to be applied to the corresponding Hive metastore calls. Can be configured with the `configuration-properties` specified in the `waggle-dance-server.yml` configuration. They will be added in the HiveConf object that is given to the constructor of the `MetaStoreFilterHook` implementation you provide. |
| `primary-meta-store.database-name-mapping` | No | BiDirectional Map of database names and mapped name, where key=`<database name as known in the primary metastore>` and value=`<name that should be shown to a client>`. See the [Database Name Mapping](#database-name-mapping) section.|
| `primary-meta-store.glue-config` | No | Can be used instead of `remote-meta-store-uris` to federate to an AWS Glue Catalog ([AWS Glue](https://docs.aws.amazon.com/glue/index.html). See the [Federate to AWS Glue Catalog](#federate-to-aws-glue-catalog) section.|
| `primary-meta-store.read-only-remote-meta-store-uris` | No | Can be used to configure an extra read-only endpoint for the primary Metastore. This is an optimization if your environment runs separate Metastore endpoints and traffic needs to be differted efficiently. Waggle Dance will direct traffic to the read-write or read-only endpoints based on the call being done. For instance `get_table` will be a read-only call but `alter_table` will be forwarded to the read-write Metastore.|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for diverted

@patduin patduin merged commit 19f57a5 into main Feb 8, 2024
4 checks passed
@patduin patduin deleted the feature/traffic_switch branch February 8, 2024 20:45
flaming-archer added a commit to flaming-archer/waggle-dance that referenced this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants