forked from datahub-project/datahub
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: merge cli guide (datahub-project#10464)
Co-authored-by: Harshal Sheth <[email protected]>
- Loading branch information
1 parent
6307eec
commit 49d1233
Showing
5 changed files
with
75 additions
and
142 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
116 changes: 0 additions & 116 deletions
116
docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,108 @@ | ||
# CLI Ingestion | ||
|
||
## Installing the CLI | ||
Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion](../docs/components.md#ingestion-framework) framework. | ||
The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata. | ||
|
||
Make sure you have installed DataHub CLI before following this guide. | ||
## Installing DataHub CLI | ||
|
||
```shell | ||
# Requires Python 3.8+ | ||
:::note Required Python Version | ||
Installing DataHub CLI requires Python 3.6+. | ||
::: | ||
|
||
Run the following commands in your terminal: | ||
|
||
``` | ||
python3 -m pip install --upgrade pip wheel setuptools | ||
python3 -m pip install --upgrade acryl-datahub | ||
# validate that the install was successful | ||
datahub version | ||
# If you see "command not found", try running this instead: python3 -m datahub version | ||
python3 -m datahub version | ||
``` | ||
|
||
Your command line should return the proper version of DataHub upon executing these commands successfully. | ||
|
||
|
||
Check out the [CLI Installation Guide](../docs/cli.md#installation) for more installation options and troubleshooting tips. | ||
|
||
After that, install the required plugin for the ingestion. | ||
|
||
## Installing Connector Plugins | ||
|
||
Our CLI follows a plugin architecture. You must install connectors for different data sources individually. | ||
For a list of all supported data sources, see [the open source docs](../docs/cli.md#sources). | ||
Once you've found the connectors you care about, simply install them using `pip install`. | ||
For example, to install the `mysql` connector, you can run | ||
|
||
```shell | ||
pip install 'acryl-datahub[datahub-rest]' # install the required plugin | ||
pip install --upgrade 'acryl-datahub[mysql]' | ||
``` | ||
|
||
Check out the [alternative installation options](../docs/cli.md#alternate-installation-options) for more reference. | ||
|
||
## Configuring a Recipe | ||
|
||
Create a `recipe.yml` file that defines the source and sink for metadata, as shown below. | ||
Create a [Recipe](recipe_overview.md) yaml file that defines the source and sink for metadata, as shown below. | ||
|
||
```yaml | ||
# recipe.yml | ||
# example-recipe.yml | ||
|
||
# MySQL source configuration | ||
source: | ||
type: <source_name> | ||
type: mysql | ||
config: | ||
option_1: <value> | ||
... | ||
username: root | ||
password: password | ||
host_port: localhost:3306 | ||
|
||
# Recipe sink configuration. | ||
sink: | ||
type: <sink_type_name> | ||
type: "datahub-rest" | ||
config: | ||
... | ||
server: "https://<your domain name>.acryl.io/gms" | ||
token: <Your API key> | ||
``` | ||
The **source** configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](source_overview.md) documentation. | ||
The **sink** configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](sink_overview.md) documentation. | ||
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below. | ||
A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance: | ||
|
||
For more information and examples on configuring recipes, please refer to [Recipes](recipe_overview.md). | ||
|
||
## Ingesting Metadata | ||
|
||
You can run ingestion using `datahub ingest` like below. | ||
### Using Recipes with Authentication | ||
In Acryl DataHub deployments, only the `datahub-rest` sink is supported, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are | ||
|
||
1. **server**: the location of the REST API exposed by your instance of DataHub | ||
2. **token**: a unique API key used to authenticate requests to your instance's REST API | ||
|
||
The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date. | ||
|
||
<p align="center"> | ||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/> | ||
</p> | ||
|
||
<p align="center"> | ||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/> | ||
</p> | ||
|
||
:::info Secure Your API Key | ||
Please keep Your API key secure & avoid sharing it. | ||
If you are on Acryl Cloud and your key is compromised for any reason, please reach out to the Acryl team at [email protected]. | ||
::: | ||
|
||
|
||
## Ingesting Metadata | ||
The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file. | ||
To do so, simply run `datahub ingest` with a pointer to your YAML recipe file: | ||
```shell | ||
datahub ingest -c <path/to/recipe.yml> | ||
``` | ||
|
||
## Scheduling Ingestion | ||
|
||
Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence. | ||
To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work. | ||
Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together. | ||
Learn more about scheduling ingestion in the [Scheduling Ingestion Guide](/metadata-ingestion/schedule_docs/intro.md). | ||
|
||
## Reference | ||
|
||
Please refer the following pages for advanced guids on CLI ingestion. | ||
|
@@ -59,8 +111,10 @@ Please refer the following pages for advanced guids on CLI ingestion. | |
- [UI Ingestion Guide](../docs/ui-ingestion.md) | ||
|
||
:::tip Compatibility | ||
|
||
DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version. | ||
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month. | ||
|
||
For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used. | ||
::: | ||
|