Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a basic generic API client #313

Merged
merged 124 commits into from
Mar 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
7d65701
Implement a basic generic API client; convert some sources to use thi…
burnash Dec 29, 2023
f7af81b
Move the paginaton loop into APIClient
burnash Jan 23, 2024
4bfdc13
Factor out common code
burnash Jan 23, 2024
3ce56c1
Refactor common code
burnash Jan 24, 2024
ce784d4
add generic rest source and an example pipeline
burnash Jan 31, 2024
8d7d0bd
Restructure the REST client add pagination detector
burnash Feb 5, 2024
8bde92a
fix the paginator detector
burnash Feb 5, 2024
4b686a0
Accept paginator instance
burnash Feb 5, 2024
2e297e2
Add Offset paginator
burnash Feb 5, 2024
f5db3aa
Add comments
burnash Feb 5, 2024
650c3ca
Factor out resources
burnash Feb 5, 2024
e61e304
Add Literal
burnash Feb 5, 2024
5e63d33
Remove the example
burnash Feb 5, 2024
d55ae29
Add logging
burnash Feb 6, 2024
01fe721
Handle depended resources
burnash Feb 7, 2024
8138403
Fix the bug with duplication of nested sources
burnash Feb 8, 2024
ec689fe
Add an alternative version that uses classes
burnash Feb 8, 2024
afce8d3
Rearrange config
burnash Feb 8, 2024
3fe3af7
REST API: support all authentication methods (#354)
willi-mueller Feb 15, 2024
b2e7cec
Generic API client: include parent fields in child resource (#355)
willi-mueller Feb 16, 2024
0ea0edb
Resource based config
burnash Feb 14, 2024
16c5a7a
Receive a custom Session instance
burnash Feb 16, 2024
c6015fe
Include data from parent resource in child resource: ported to a new …
burnash Feb 19, 2024
d4a3160
Rest API: Ends pagination if next page path is not in response.json()…
willi-mueller Feb 19, 2024
9f956ba
Allow specification of SinglePagePaginator and refactors redundancy (…
willi-mueller Feb 20, 2024
7063337
Use the resource name as an endpoint path if path is missing
burnash Feb 20, 2024
1e1e676
[REST Source] renames default_paginator argument to paginator (#367)
willi-mueller Feb 22, 2024
884120b
Remove the legacy version
burnash Feb 21, 2024
d5eaee1
Add `records_key` to `SinglePagePaginator`
burnash Feb 22, 2024
2f580a3
[REST Source] completes renaming of default_paginator to paginator (#…
willi-mueller Feb 22, 2024
168b11a
Add tests and pagination
burnash Feb 25, 2024
e13cff2
Temporary disable paginator type check
burnash Feb 26, 2024
f14f539
[REST source] test case for dependent resource (#371)
willi-mueller Feb 26, 2024
10bd716
Remove comments
burnash Feb 26, 2024
0b31301
Reuse MOCK_BASE_URL for all endpoints
burnash Feb 27, 2024
65c6617
Rename the config container
burnash Feb 27, 2024
a554efa
Add tests for valid source configurations
burnash Feb 27, 2024
e6e6927
Add Flask-style paginaton
burnash Feb 27, 2024
06a054e
[REST API source] adds function to check connection (#357)
willi-mueller Feb 27, 2024
726b204
[REST Source] allow skipping http errors (#365)
willi-mueller Feb 27, 2024
0b99ba5
added the possibility to pass HTTPBasicAuth objects (#377)
francescomucio Feb 27, 2024
0631c98
Factor out typings
burnash Feb 27, 2024
d1d25f3
Add response_actions to enable skipping responses by status code or c…
burnash Feb 28, 2024
76710ea
Move records extractor out of the paginator class
burnash Feb 28, 2024
f3ea829
[REST] Detailed error handler logging (#383)
willi-mueller Feb 29, 2024
16cb89a
Fixes records detection for header links paginator
burnash Feb 29, 2024
987fdfa
[REST source] header_links can extract from responses without a recor…
willi-mueller Feb 29, 2024
71b4682
[REST source] fixes deprecation warning (#380)
willi-mueller Mar 1, 2024
eeea3a8
Use update_dict_nested in place of deep_merge
burnash Mar 1, 2024
4d01f35
Update the lockfile
burnash Mar 1, 2024
d732976
Add requirements.txt
burnash Mar 1, 2024
ac39f62
Upgrade dlt version
burnash Mar 2, 2024
14748a4
Rename records_path to data_selector
burnash Mar 4, 2024
39af289
Mutate request objects in paginators
burnash Mar 4, 2024
3d9c87e
Merge branch 'master' into enh/api_helper
burnash Mar 4, 2024
12b3726
Regenerate lock
burnash Mar 4, 2024
69c1300
Remove `request_client` param from RESTClient; set `raise_for_status`…
burnash Mar 5, 2024
79030f8
Pass all incremental params from config
burnash Mar 5, 2024
46e3385
Refactor to argument unpacking
burnash Mar 5, 2024
4b58b7b
Add more auth classes
burnash Mar 5, 2024
13e21b8
Factor out records extractor logic
burnash Mar 6, 2024
efd0d80
Add tests for detectors
burnash Mar 6, 2024
dbe1b65
Remove UnspecifiedPaginator
burnash Mar 6, 2024
fefd704
[REST CLIENT] alt response extractor (#396)
rudolfix Mar 6, 2024
c535088
makes openapi friendly auth (#397)
rudolfix Mar 6, 2024
5a1f3b5
Bring detect_paginator back
burnash Mar 6, 2024
05e899e
Fix test case for nested key (next.url); format code
burnash Mar 6, 2024
0c2ddcf
Revert Notion source
burnash Mar 6, 2024
45be0cf
Revert Personio and Zendesk
burnash Mar 6, 2024
9ba18a7
Remove an unused file
burnash Mar 6, 2024
8734678
Restore personio settings
burnash Mar 6, 2024
739547c
Restore personio tests and
burnash Mar 6, 2024
4148a69
Add type annotations
burnash Mar 6, 2024
b15f2dd
bumps dlt to 0.4.6
rudolfix Mar 7, 2024
01a08cb
Type fixes and dlt session check
burnash Mar 7, 2024
e2aac86
[REST CLIENT] yields data pages with requests context (#399)
rudolfix Mar 8, 2024
3a67ac8
Fix paginator_config unpacking and RESTClient typing errors
burnash Mar 9, 2024
639a649
Fix more typing errors
burnash Mar 9, 2024
f29655e
Fix E741
burnash Mar 10, 2024
bd19ee2
Fix linting errors
burnash Mar 10, 2024
bdd6feb
Update the lock file
burnash Mar 10, 2024
6ce75e0
Extract build_resource_dependency_graph()
burnash Mar 10, 2024
4bc5340
Factor out create_resources()
burnash Mar 10, 2024
d0a22d9
Use requests hooks to handle response actions
burnash Mar 10, 2024
86eb1ed
Derive the response exception from DltException
burnash Mar 10, 2024
ed81495
Resolve poetry.lock conflict
burnash Mar 10, 2024
3ae0ba5
Fix black check
burnash Mar 10, 2024
2d336e9
Fix lint
burnash Mar 10, 2024
4a90485
Make default token expiration configurable
burnash Mar 10, 2024
55a5924
Add missing http headers
burnash Mar 10, 2024
dad1e96
Refactor paginator creation in RESTClient to use PaginatorFactory
burnash Mar 10, 2024
e39aaf9
Use frozensets
burnash Mar 10, 2024
b2715c9
Remove an unused import
burnash Mar 10, 2024
23b22a5
Update docstrings
burnash Mar 10, 2024
e8cfa32
Accept additional dlt source arguments in `rest_api_source()`
burnash Mar 10, 2024
24ec947
Add a workaround to pass test_dlt_init
burnash Mar 10, 2024
94beb5b
Extend config test with an auth class instance case
burnash Mar 12, 2024
ba46fda
Remove Any from PaginatorType
burnash Mar 12, 2024
82f3357
Upgrade dlt
burnash Mar 12, 2024
d278643
Update lock file
burnash Mar 12, 2024
3141ecc
Remove commented code
burnash Mar 12, 2024
94352ed
Refactor configuration setup into a dedicated module
burnash Mar 13, 2024
70496ec
Move response hooks setup and handling out of RESTClient
burnash Mar 13, 2024
2e3d50a
Remove unused imports
burnash Mar 13, 2024
b6d3794
Fix hooks typing
burnash Mar 14, 2024
3b7a0b6
Rename args of the OffsetPaginator
burnash Mar 20, 2024
b971f9b
Create a RESTClient per resource
burnash Mar 20, 2024
ebff65c
Handle both error statuses and response actions
burnash Mar 20, 2024
4fb4ce1
Initial version of the README.md (#389)
francescomucio Mar 21, 2024
59bba2e
Remove commented code
burnash Mar 20, 2024
390c233
Clean up docstrings
burnash Mar 20, 2024
5ecc426
Remove the useless conditional init for items
burnash Mar 20, 2024
6924233
Fix grammar in the README
burnash Mar 21, 2024
ff5ddd8
Remove UnspecifiedPaginator
burnash Mar 21, 2024
c24a5aa
Format README
burnash Mar 21, 2024
d63e30f
Add handling end_value and end_param
burnash Mar 21, 2024
5cbf863
Use NamedTuple for incremental params
burnash Mar 22, 2024
6b7a891
Move check_connection to utils
burnash Mar 22, 2024
6dab451
Instantiate auth based on type
burnash Mar 22, 2024
32e8327
Merge branch 'master' into enh/api_helper
burnash Mar 22, 2024
559f3a2
Update lock file
burnash Mar 22, 2024
cf8e266
Sthor/api helper updates (#400)
steinitzu Mar 22, 2024
b8e6b5d
Remove unused imports
burnash Mar 22, 2024
e6d927c
Use jsonpath for next_path; remove create_nested_accessor
burnash Mar 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
pull_request:
branches:
- master
- enh/api_helper
workflow_dispatch:

jobs:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
pull_request:
branches:
- master
- enh/api_helper
workflow_dispatch:

jobs:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/test_on_local_destinations.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ on:
branches:
- master
- devel
- enh/api_helper
workflow_dispatch:

env:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test_on_local_destinations_forks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ on:
pull_request_target:
branches:
- master
- devel
- enh/api_helper
types:
- opened
- synchronize
Expand Down
17 changes: 1 addition & 16 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ packages = [{include = "sources"}]
[tool.poetry.dependencies]
python = ">=3.8.1,<3.13"
dlt = {version = "0.4.7", allow-prereleases = true, extras = ["redshift", "bigquery", "postgres", "duckdb", "s3", "gs"]}
graphlib-backport = {version = "*", python = "<3.9"}

[tool.poetry.group.dev.dependencies]
mypy = "1.6.1"
Expand Down
218 changes: 218 additions & 0 deletions sources/rest_api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# REST API Generic Source
A declarative way to define dlt sources for REST APIs.

## What is this?
> Happy APIs are all alike
>
> \- E. T. Lev Tolstoy, Senior Data Engineer

This is a generic source that you can use to create a dlt source from a REST API using a declarative configuration. The majority of REST APIs behave in a similar way; this dlt source attempts to provide a declarative way to define a dlt source for those APIs.

## How to use it
Let's see how a source for the [Pokemon API](https://pokeapi.co/) would look like:

```python
pokemon_config = {
"client": {
"base_url": "https://pokeapi.co/api/v2/",
},
"resources": [
"berry",
"location",
{
"name": "pokemon_list",
"endpoint": "pokemon",
},
{
"name": "pokemon",
"endpoint": {
"path": "pokemon/{name}",
"params": {
"name": {
"type": "resolve",
"resource": "pokemon_list",
"field": "name",
},
},
},
},
],
}

pokemon_source = rest_api_source(pokemon_config)
```
Here's a short summary:
- The `client` node contains the base URL of the endpoints that we want to collect.
- The `resources` correspond to the API endpoints.

We have a couple of simple resources (`berry` and `location`). For them, the API endpoint is also the name of the dlt resource and the name of the destination table. They don't need additional configuration.

The next resource leverages some additional configuration. The endpoint `pokemon/` returns a list of pokemons, but it can also be used as `pokemon/{id or name}` to return a single pokemon. In this case, we want the list, so we decided to rename the resource to `pokemon_list`, while the endpoint stays `pokemon/`. We do not specify the name of the destination table, so it will match the resource name.

And now the `pokemon` one. This is actually a child endpoint of the `pokemon_list`: for each pokemon, we want to get further details. So we need to make this resource a bit more smart; the endpoint `path` needs to be explicit, and we have to specify how the value of `name` will be resolved from another resource; this is actually telling the generic source that `pokemon` needs to be queried for each pokemon in `pokemon_list`.

## Anatomy of the config object

> **_TIP:_** Import `RESTAPIConfig` from the `rest_api` module to have convenient tips.

The config object passed to the REST API Generic Source has three main elements:

```python
my_config: RESTAPIConfig = {
"client": {
...
},
"resource_defaults": {
...
},
"resources": {
...
},
}
```

`client` contains the configuration to connect to the API's endpoints (e.g., base URL, authentication method, default behavior for the paginator, and more).

`resource_defaults` contains the default values to configure the dlt resources returned by this source.

`resources` object contains the configuration for each resource.

The configuration with a smaller scope will overwrite the one with the wider one:

Resource Configuration > Resource Defaults Configuration > Client Configuration

## Reference

### `client`

#### `auth` [optional]
Use the auth property to pass a token or a `HTTPBasicAuth` object for more complex authentication methods. Here are some practical examples:

1. Simple token (read from the `.dlt/secrets.toml` file):
```python
my_api_config: RESTAPIConfig = {
"client": {
"base_url": "https://my_api.com/api/v1/",
"auth": {
"token": dlt.secrets["sources.my_api.access_token"],
},
},
...
}
```

2.
```python
from requests.auth import HTTPBasicAuth

basic_auth = HTTPBasicAuth(dlt.secrets["sources.my_api.api_key"], dlt.secrets["sources.my_api.api_secret"])

my_api_config: RESTAPIConfig = {
"client": {
"base_url": "https://my_api.com/api/v1/",
"auth": basic_auth,
},
...
}
```

#### `base_url`
The base URL that will be prepended to the endpoints specified in the `resources` objects. Example:

```python
"base_url": "https://my_api.com/api/v1/",
```

#### `paginator` [optional]
The paginator property specifies the default paginator to be used for the endpoint responses.

Possible paginators are:
| Paginator | String Alias | Note |
| --------- | ------------ | ---- |
| BasePaginator | | |
| HeaderLinkPaginator | `header_links` | |
| JSONResponsePaginator | `json_links` | The pagination metainformation is in a node of the JSON response (see example below) |
| SinglePagePaginator | `single_page` | The response will be interpreted as a single-page response, ignoring possible pagination metadata |

Usage example of the `JSONResponsePaginator`, for a response with the URL of the next page located at `paging.next`:
```python
"paginator": JSONResponsePaginator(
next_key=["paging", "next"]
)
```


#### `session` [optional]

This property allows you to pass a custom `Session` object.


### `resource_defaults`
This property allows you to pass default properties and behavior to the dlt resources created by the REST API Generic Source. Besides the properties mentioned in this documentation, a resource accepts all the arguments that usually are passed to a [dlt resource](https://dlthub.com/docs/general-usage/resource).

#### `endpoint`
A string indicating the endpoint or an `endpoint` object (see [below](#endpoint-1)).

#### `include_from_parent` [optional]
A list of fields, from the parent resource, which will be included in the resource output.

#### `name`
The name of the dlt `resource` and the name of the associated table that will be created.

#### `params`
The query parameters for the endpoint URL.

For child resources, you can use values from the parent resource for params. The syntax is the following:

```python
"PARAM_NAME": {
"type": "resolve",
"resource": "PARENT_RESOURCE_NAME",
"field": "PARENT_RESOURCE_FIELD",
},
```

An example of use:
```python
"endpoint": {
"path": "pokemon/{name}",
"params": {
"name": {
"type": "resolve",
"resource": "pokemon_list",
"field": "name",
},
},
},
```

#### `path`
The URL of the endpoint. If you need to include URL parameters, they can be included using `{}`, for example:
```python
"path": "pokemon/{name}",
```
In case you need to include query parameters, use the [params](#params) property.


### `resources`
An array of resources. Each resource is a string or a resource object.

Simple resources with their name corresponding to the endpoint can be simple strings. For example:
```python
"resources": [
"berry",
"location",
]
```
Resources with the name different from the endpoint string will be:
```python
"resources": [
{
"name": "pokemon_list",
"endpoint": "pokemon",
},
]
```
In case you need to have a resource with a name different from the table created, you can pass the property `table_name` too.

For the other properties, see the [resource_defaults](#resource_defaults) above.
Loading
Loading