feat: add download_decoder + download_extractor #50

artem1205 · 2024-11-14T14:04:28Z

What

resolving https://github.com/airbytehq/airbyte-internal-issues/issues/10429

Summary by CodeRabbit

New Features
- Introduced download_extractor and download_decoder properties in the asynchronous data retrieval process, enhancing flexibility for record extraction and decoding.
- Added a new ResponseToFileExtractor class for handling responses that require extraction and decompression.
- Implemented a custom CustomRecordExtractor class for improved record extraction from JSON responses.
- Enhanced testing capabilities with a fixture for handling large datasets in the ResponseToFileExtractor tests.
Bug Fixes
- Improved error handling for the download_extractor instantiation to ensure default functionality when not explicitly defined.

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai · 2024-11-14T14:12:52Z

📝 Walkthrough

Walkthrough

The changes in this pull request enhance the AsyncRetriever class and its related components in the airbyte_cdk library. Two new properties, download_extractor and download_decoder, are introduced to facilitate asynchronous data retrieval from URLs. These properties allow for greater flexibility in defining how records are extracted and decoded. The modifications also include updates to the ModelToComponentFactory class to utilize these new properties, ensuring robust error handling and customization in the extraction process.

Changes

File Path	Change Summary
`airbyte_cdk/sources/declarative/declarative_component_schema.yaml`	- Added properties: `download_extractor` and `download_decoder` in `AsyncRetriever` definition.
`airbyte_cdk/sources/declarative/models/declarative_component_schema.py`	- Updated `AsyncRetriever` class to include optional fields: `download_extractor` and `download_decoder`.
	- Invoked `update_forward_refs()` for several classes to resolve forward references.
`airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`	- Modified `create_async_retriever` method to use `download_extractor` and `download_decoder`.
	- Added error handling for `download_extractor` instantiation, defaulting to `DpathExtractor` if not provided.
`airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py`	- Updated `ResponseToFileExtractor` to use `@dataclass` and modified its initialization process.
`airbyte_cdk/sources/declarative/requesters/http_job_repository.py`	- Changed `record_extractor` initialization to use `ResponseToFileExtractor({})`.
`unit_tests/sources/declarative/parsers/test_model_to_component_factory.py`	- Added `CustomRecordExtractor` class and corresponding tests for its instantiation.
`unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py`	- Updated test initialization for `ResponseToFileExtractor` to include an empty dictionary.
`unit_tests/sources/declarative/requesters/test_http_job_repository.py`	- Modified test instantiation of `RecordSelector` to use `ResponseToFileExtractor({})`.

Would you like to explore any specific part of these changes further?

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant AsyncRetriever
    participant ModelToComponentFactory
    participant ResponseToFileExtractor

    User->>ModelToComponentFactory: create_async_retriever(model)
    ModelToComponentFactory->>AsyncRetriever: initialize(download_extractor, download_decoder)
    AsyncRetriever->>ResponseToFileExtractor: create_response_to_file_extractor(parameters)
    ResponseToFileExtractor->>AsyncRetriever: return extractor instance
    AsyncRetriever->>User: return initialized AsyncRetriever

Possibly related PRs

feat(airbyte-cdk): add gzipjson decoder #20: The changes in this PR involve adding the GzipJsonDecoder and updating the AsyncRetriever to include it as a possible decoder type, which directly relates to the modifications made in the main PR regarding the download_decoder property in the AsyncRetriever.

Suggested reviewers

maxi297

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
2031-2042: Consider adding validation for the field_path parameter in the default DpathExtractor.

The code handles both custom and default extractors well, but when creating the default DpathExtractor, we're passing an empty list for field_path. Should we validate or document why an empty field_path is acceptable here? wdyt?

2052-2053: Nice cleanup removing the hardcoded ResponseToFileExtractor!

The replacement of the hardcoded ResponseToFileExtractor with a configurable download_extractor improves flexibility. The commented-out old code could be removed for cleanliness, wdyt?
-                # extractor=ResponseToFileExtractor(),# old one
2043-2055: Consider adding error handling for download_retriever configuration.

The download_retriever setup looks good, but it might be worth adding some validation to ensure the download_requester and record_selector are properly configured for the specific download use case. For instance, we could validate that the extractor's configuration is compatible with the download operation. What are your thoughts on this?
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2575-2584: LGTM! The download_decoder property looks good.

The property supports a comprehensive set of decoder types. What do you think about enhancing the description to mention the supported formats explicitly? Something like:

"Component decoding the download response so records can be extracted. Supports JSON, JSONL, Iterable, XML, and Gzip compressed formats through their respective decoders."

wdyt? 🤔
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1679-1681: Consider adding a title to the download_extractor field for consistency.

I noticed that the download_decoder field includes a title parameter in its Field() definition. Perhaps adding a title to the download_extractor field would enhance readability and maintain consistency across your model definitions. What do you think?

Here's how you might update it:
 download_extractor: Optional[Union[CustomRecordExtractor, DpathExtractor]] = Field(
     None,
     description="Responsible for fetching the records from provided urls.",
+    title="Download Extractor",
 )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 39786d2 and 6304780.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1 hunks)

🔇 Additional comments (3)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2026-2030: LGTM! The download_decoder initialization looks good.

The code properly handles both cases where a custom decoder is provided or falls back to a default JsonDecoder. The parameters are correctly passed through.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2516-2520: LGTM! The download_extractor property is well-defined.

The property is properly structured and provides good flexibility by supporting both CustomRecordExtractor and DpathExtractor implementations.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1732-1745: The addition of download_decoder looks great!

You've correctly defined the download_decoder field with appropriate types, description, and title. This addition enhances the functionality of the AsyncRetriever. Well done!

maxi297

Should we add a test to validate that custom decoders are supported?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

…ver-split-decoders

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
2053-2054: Remove commented code

The old implementation reference can be safely removed since it's tracked in version control:
-                # extractor=ResponseToFileExtractor(),# old one
                 extractor=download_extractor,
2034-2038: Fix indentation for better readability

The indentation seems a bit off. Would you like to adjust it to match the surrounding code? wdyt?
-                model=model.download_extractor,
-                config=config,
-                decoder=download_decoder,
-                parameters=model.parameters,
+            model=model.download_extractor,
+            config=config,
+            decoder=download_decoder,
+            parameters=model.parameters,
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2516-2520: The download_extractor implementation looks good! Would you consider enhancing the documentation?

The implementation correctly allows for both CustomRecordExtractor and DpathExtractor. Perhaps we could make the description more detailed by adding examples or common use cases? For instance, explaining when to use CustomRecordExtractor vs DpathExtractor for URL-based record extraction, wdyt?
       download_extractor:
-        description: Responsible for fetching the records from provided urls.
+        description: |
+          Responsible for fetching the records from provided urls. Use CustomRecordExtractor for complex extraction logic 
+          or DpathExtractor for simple path-based extraction.
+          
+          Example using DpathExtractor:
+            download_extractor:
+              type: DpathExtractor
+              field_path: ["data", "records"]
2575-2584: The download_decoder implementation looks good! Would you consider enhancing the documentation?

The implementation correctly supports all necessary decoder types. Perhaps we could make the description more detailed by explaining when to use each decoder type for downloaded content? For instance, when to use GzipJsonDecoder vs JsonDecoder, wdyt?
       download_decoder:
         title: Download Decoder
-        description: Component decoding the download response so records can be extracted.
+        description: |
+          Component decoding the download response so records can be extracted. Choose the decoder based on the response format:
+          - JsonDecoder: For standard JSON responses
+          - JsonlDecoder: For newline-delimited JSON
+          - GzipJsonDecoder: For gzip-compressed JSON
+          - XmlDecoder: For XML responses
+          - IterableDecoder: For line-by-line processing
+          - CustomDecoder: For custom decoding logic

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6304780 and 43e03fb.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1 hunks)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

2027-2031: LGTM: Clean initialization of download_decoder with fallback

The code properly handles both cases where download_decoder is provided or needs to fall back to JsonDecoder.

2032-2043: Consider validating empty field path

The fallback case creates a DpathExtractor with an empty field path list []. Should we validate this or provide a default path? wdyt?

✅ Verification successful

Let me check the implementation of DpathExtractor to understand if empty paths are handled properly.

Let me try to find the DpathExtractor implementation using ripgrep instead.

Let me check the actual implementation details of the DpathExtractor to see how it handles empty paths.

Let me check the extract method to see how it handles the empty path.

Empty field path is handled correctly, no validation needed

Looking at the implementation, empty field path is a valid use case and is explicitly handled in the extract_records method. When field_path is empty, it returns the entire response body as-is. This behavior is also documented in the class docstring with an example showing empty field path usage. The test cases we found earlier confirm this is an intended behavior.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if DpathExtractor is used with empty paths elsewhere
ast-grep --pattern 'DpathExtractor($$$, field_path=[], $$$)'

Length of output: 1531

Script:

#!/bin/bash
# Check DpathExtractor implementation
ast-grep --pattern 'class DpathExtractor'

Length of output: 43

Script:

#!/bin/bash
# Search for DpathExtractor class definition and its usage
rg "class DpathExtractor" -A 10

Length of output: 2583

Script:

#!/bin/bash
# Get the full implementation of DpathExtractor
rg "class DpathExtractor" -A 50 airbyte_cdk/sources/declarative/extractors/dpath_extractor.py

Length of output: 1909

Script:

#!/bin/bash
# Get the extract method implementation
rg "def extract_records" -A 20 airbyte_cdk/sources/declarative/extractors/dpath_extractor.py

Length of output: 861

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1334-1350: Documentation needs improvement, wdyt?

The ResponseToFileExtractor component's documentation could be enhanced:

The description is incomplete: "Record extractor that downloads a headless"

The title mentions "CSV" but the description doesn't specify file formats

Consider adding:

Supported file formats

When to use this extractor vs others

Example usage

Purpose of needs_decompression

Line range hint 2533-2602: Consider enhancing the AsyncRetriever documentation, what do you think?

The new properties look good, but their documentation could be more helpful:

For download_extractor, consider explaining:

The relationship with urls_extractor

When to use each extractor type

Example configurations

For download_decoder, consider clarifying:

How it differs from the main decoder property

When a separate download decoder is needed

Typical use cases for each decoder type

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 43e03fb and 9c0c45e.

📒 Files selected for processing (4)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (4 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)

👮 Files not reviewed due to content moderation or server errors (3)

airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (2)

3280-3286: Consider adding error handling for JSON decoding, wdyt?

The extract_records method directly yields from response.json() which could raise a JSONDecodeError if the response contains invalid JSON. Consider adding try-except to handle potential errors gracefully.

def extract_records(
    self,
    response: requests.Response,
) -> Iterable[Mapping[str, Any]]:
-    yield from response.json()
+    try:
+        yield from response.json()
+    except requests.exceptions.JSONDecodeError as e:
+        raise ValueError(f"Failed to decode response as JSON: {str(e)}")

3288-3294: Consider adding more test cases, wdyt?

The test verifies the basic instantiation but could be enhanced to test the actual functionality:

Test with valid JSON response
Test with invalid JSON response
Test with empty response

def test_create_custom_record_extractor():
    # Test instantiation
    definition = {
        "type": "CustomRecordExtractor",
        "class_name": "unit_tests.sources.declarative.parsers.test_model_to_component_factory.CustomRecordExtractor",
    }
    component = factory.create_component(CustomRecordExtractorModel, definition, {})
    assert isinstance(component, CustomRecordExtractor)

    # Test with valid JSON
    mock_response = MagicMock(spec=requests.Response)
    mock_response.json.return_value = [{"id": 1}, {"id": 2}]
    records = list(component.extract_records(mock_response))
    assert records == [{"id": 1}, {"id": 2}]

    # Test with invalid JSON
    mock_response.json.side_effect = requests.exceptions.JSONDecodeError("msg", "doc", 0)
    with pytest.raises(ValueError):
        list(component.extract_records(mock_response))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9c0c45e and a3d95bb.

📒 Files selected for processing (1)

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (4 hunks)

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1)

45-45: Consider adding a docstring for the config parameter?

Since we're introducing a config parameter to ResponseToFileExtractor, would it be helpful to add documentation about its purpose and expected values? This could help future developers understand the configuration options, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a3d95bb and 7aa10a0.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1 hunks)
unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py (1 hunks)
unit_tests/sources/declarative/requesters/test_http_job_repository.py (1 hunks)

🔇 Additional comments (4)

unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py (2)

17-17: LGTM! Constructor update is consistent.

The change aligns with the updated ResponseToFileExtractor implementation.

Line range hint 98-98: LGTM! Constructor update is consistent.

The change maintains consistency with the updated implementation.

airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1)

45-45: LGTM! Constructor update is consistent.

The change maintains consistency with the updated ResponseToFileExtractor implementation across the codebase.

unit_tests/sources/declarative/requesters/test_http_job_repository.py (1)

98-98: LGTM! Constructor update is consistent.

The change maintains consistency with the updated implementation.

Signed-off-by: Artem Inzhyyants <[email protected]>

maxi297

My main concern is regarding backward compatibility with source-sendgrid. Once we figure this out, I'll approve

airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py

airbyte_cdk/sources/declarative/declarative_component_schema.yaml

…ver-split-decoders

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1334-1346: Consider enhancing the component description for better clarity.

The description could be more detailed about:

When to use this extractor (e.g., size thresholds)
File lifecycle management (creation, cleanup)
Error handling for disk operations
Performance implications

What do you think about expanding the description like this?

-    description: A record extractor designed for handling large responses that may exceed memory limits (to prevent OOM issues). It downloads a CSV file to disk, reads the data from disk, and deletes the file once it has been fully processed.
+    description: |
+      A record extractor designed for handling large responses that may exceed memory limits (to prevent OOM issues). 
+      
+      Key features:
+      - Downloads response content to a temporary file on disk
+      - Streams records from disk to minimize memory usage
+      - Automatically cleans up temporary files after processing
+      - Suitable for responses larger than available memory
+      
+      Usage considerations:
+      - Recommended for responses > 500MB
+      - Requires sufficient disk space
+      - May have slower processing due to disk I/O

2529-2534: The download_extractor property looks good but could use an example.

The property is well-defined with proper references to available extractors. Consider adding an example to illustrate common usage patterns?

What do you think about adding usage examples like this?

      download_extractor:
        description: Responsible for fetching the records from provided urls.
        anyOf:
          - "$ref": "#/definitions/CustomRecordExtractor"
          - "$ref": "#/definitions/DpathExtractor"
          - "$ref": "#/definitions/ResponseToFileExtractor"
+        examples:
+          - type: ResponseToFileExtractor  # For large file downloads
+          - type: DpathExtractor          # For JSON responses
+            field_path: ["data", "records"]

2589-2598: The download_decoder property looks good but could benefit from usage guidance.

The property supports all standard decoders but could use guidance on when to use each type.

What do you think about adding usage guidance like this?

      download_decoder:
        title: Download Decoder
-        description: Component decoding the download response so records can be extracted.
+        description: |
+          Component decoding the download response so records can be extracted.
+          
+          Common decoder choices:
+          - JsonDecoder: For JSON responses
+          - JsonlDecoder: For newline-delimited JSON files
+          - GzipJsonDecoder: For compressed JSON responses
+          - XmlDecoder: For XML responses
+          - IterableDecoder: For line-by-line processing
        anyOf:
          - "$ref": "#/definitions/CustomDecoder"
          - "$ref": "#/definitions/JsonDecoder"
          - "$ref": "#/definitions/JsonlDecoder"
          - "$ref": "#/definitions/IterableDecoder"
          - "$ref": "#/definitions/XmlDecoder"
          - "$ref": "#/definitions/GzipJsonDecoder"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce89a2f and b36933a.

📒 Files selected for processing (4)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (3 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)

👮 Files not reviewed due to content moderation or server errors (3)

airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

…ver-split-decoders # Conflicts: # airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1454-1460: Consider adding type hints for better code clarity?

The method looks good functionally, but would adding type hints for the return type make it more consistent with other factory methods, wdyt?
    def create_response_to_file_extractor(
        self,
        model: ResponseToFileExtractorModel,
        **kwargs: Any,
-    ) -> ResponseToFileExtractor:
+    ) -> ResponseToFileExtractor[Any]:
        return ResponseToFileExtractor(parameters=model.parameters or {})
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)
3281-3287: Consider adding docstring and type hints for better code clarity?

The implementation looks good, but would adding a docstring and return type hint make it more maintainable, wdyt?
class CustomRecordExtractor(RecordExtractor):
+    """Test implementation of RecordExtractor that yields JSON response directly."""
    def extract_records(
        self,
        response: requests.Response,
-    ) -> Iterable[Mapping[str, Any]]:
+    ) -> Iterable[Mapping[str, Any]]:
        yield from response.json()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b36933a and 2510d97.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (4 hunks)

🔇 Additional comments (3)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

273-275: LGTM!

Clean import addition for the new ResponseToFileExtractorModel.

433-433: LGTM!

Clean addition of the new model-to-constructor mapping.

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

3289-3295: LGTM!

Clean test implementation that verifies the factory's ability to create custom record extractors.

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
1454-1460: Should we enhance the ResponseToFileExtractor factory method?

The current implementation is simpler than other factory methods. Consider adding:

Config parameter usage (it's passed to most other extractors)

Error handling for invalid parameters

Decoder handling (similar to other extractors)

What do you think about making it consistent with other extractor factory methods?
 def create_response_to_file_extractor(
     self,
     model: ResponseToFileExtractorModel,
+    config: Config,
+    decoder: Optional[Decoder] = None,
     **kwargs: Any,
 ) -> ResponseToFileExtractor:
-    return ResponseToFileExtractor(parameters=model.parameters or {})
+    return ResponseToFileExtractor(
+        config=config,
+        parameters=model.parameters or {},
+        decoder=decoder or JsonDecoder(parameters={})
+    )
2043-2059: Should we improve parameter consistency and naming?

A few suggestions to enhance readability and consistency:

The parameters are passed to download_extractor but not to download_decoder - should we make this consistent?

Consider renaming download_decoder to download_response_decoder to be more descriptive about its purpose

The empty list [] in the DpathExtractor fallback could be more explicit about what it represents (field_path)

WDYT about these improvements?
 download_decoder = (
     self._create_component_from_model(model=model.download_decoder, config=config)
     if model.download_decoder
-    else JsonDecoder(parameters={})
+    else JsonDecoder(parameters=model.parameters or {})
 )
 download_extractor = (
     self._create_component_from_model(
         model=model.download_extractor,
         config=config,
         decoder=download_decoder,
         parameters=model.parameters,
     )
     if model.download_extractor
     else DpathExtractor(
-        [], 
+        field_path=[],
         config=config,
         decoder=download_decoder,
         parameters=model.parameters
     )
 )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2510d97 and 118d1f3.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)

🔇 Additional comments (1)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

273-275: LGTM! Clean addition of ResponseToFileExtractor to the component mappings

The new component is properly integrated into the factory's type system.

Also applies to: 433-433

artem1205 · 2024-12-02T17:13:11Z

@maxi297
tested on sendgrid with this change

Index: airbyte-integrations/connectors/source-sendgrid/manifest.yaml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/airbyte-integrations/connectors/source-sendgrid/manifest.yaml b/airbyte-integrations/connectors/source-sendgrid/manifest.yaml
--- a/airbyte-integrations/connectors/source-sendgrid/manifest.yaml	(revision e47ad300dee26c2ff964e856d90a751cbc76969d)
+++ b/airbyte-integrations/connectors/source-sendgrid/manifest.yaml	(date 1733159536552)
@@ -640,6 +640,8 @@
           path: "{{stream_slice['url']}}"
           url_base: ""
           http_method: GET
+        download_extractor:
+          type: ResponseToFileExtractor
       schema_loader:
         type: InlineSchemaLoader
         schema:

maxi297

LGTM! Thanks Artem. I'm eager to see this applied to amazon-ads

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1454-1460: Consider adding a docstring for consistency?

The implementation looks good, but other create methods in this class have docstrings. Would you consider adding one here as well? wdyt?

 def create_response_to_file_extractor(
     self,
     model: ResponseToFileExtractorModel,
     **kwargs: Any,
 ) -> ResponseToFileExtractor:
+    """Creates a ResponseToFileExtractor component.
+
+    Args:
+        model: The model containing the component configuration
+        **kwargs: Additional keyword arguments
+
+    Returns:
+        ResponseToFileExtractor: The created component
+    """
     return ResponseToFileExtractor(parameters=model.parameters or {})

2043-2062: LGTM! Consider extracting decoder/extractor creation?

The implementation looks good with proper null checks and fallbacks. However, the download_decoder and download_extractor creation logic is quite similar to other parts of the code. Would you consider extracting these into helper methods to improve readability? wdyt?

Something like:

+    def _create_download_decoder(
+        self,
+        model: AsyncRetrieverModel,
+        config: Config,
+    ) -> Decoder:
+        return (
+            self._create_component_from_model(model=model.download_decoder, config=config)
+            if model.download_decoder
+            else JsonDecoder(parameters={})
+        )
+
+    def _create_download_extractor(
+        self,
+        model: AsyncRetrieverModel,
+        config: Config,
+        decoder: Decoder,
+        parameters: Dict[str, Any],
+    ) -> Union[ResponseToFileExtractor, DpathExtractor]:
+        return (
+            self._create_component_from_model(
+                model=model.download_extractor,
+                config=config,
+                decoder=decoder,
+                parameters=parameters,
+            )
+            if model.download_extractor
+            else DpathExtractor(
+                [],
+                config=config,
+                decoder=decoder,
+                parameters=parameters or {},
+            )
+        )

 def create_async_retriever(
     # ... existing parameters ...
 ) -> AsyncRetriever:
     # ... existing code ...
-    download_decoder = (
-        self._create_component_from_model(model=model.download_decoder, config=config)
-        if model.download_decoder
-        else JsonDecoder(parameters={})
-    )
-    download_extractor = (
-        self._create_component_from_model(
-            model=model.download_extractor,
-            config=config,
-            decoder=download_decoder,
-            parameters=model.parameters,
-        )
-        if model.download_extractor
-        else DpathExtractor(
-            [],
-            config=config,
-            decoder=download_decoder,
-            parameters=model.parameters or {},
-        )
-    )
+    download_decoder = self._create_download_decoder(model, config)
+    download_extractor = self._create_download_extractor(
+        model, config, download_decoder, model.parameters
+    )
     # ... rest of the code ...

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 118d1f3 and c1ef338.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

273-275: LGTM!

The import follows the existing pattern and is correctly grouped with other model imports.

433-433: LGTM!

The mapping is correctly added and maintains alphabetical ordering.

Airbyte CDK: add download_decoder + extractor

6304780

Signed-off-by: Artem Inzhyyants <[email protected]>

artem1205 self-assigned this Nov 14, 2024

github-actions bot added the enhancement New feature or request label Nov 14, 2024

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

coderabbitai bot approved these changes Nov 14, 2024

View reviewed changes

artem1205 requested a review from maxi297 November 14, 2024 15:28

maxi297 reviewed Nov 26, 2024

View reviewed changes

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into artem1205/asyncretrie…

43e03fb

…ver-split-decoders

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

CDK: add ResponseToFileExtractor to async retriever extractors

9c0c45e

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

CDK: add unittest

a3d95bb

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

CDK: fix unittest

7aa10a0

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

artem1205 added 2 commits December 2, 2024 13:55

CDK: fix unittest

6ecb0a7

Signed-off-by: Artem Inzhyyants <[email protected]>

CDK: fix unittest

ce89a2f

Signed-off-by: Artem Inzhyyants <[email protected]>

artem1205 requested a review from maxi297 December 2, 2024 13:30

maxi297 reviewed Dec 2, 2024

View reviewed changes

airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/declarative/declarative_component_schema.yaml Outdated Show resolved Hide resolved

artem1205 added 2 commits December 2, 2024 16:13

Merge remote-tracking branch 'origin/main' into artem1205/asyncretrie…

077ae15

…ver-split-decoders

CDK: rev decompression logic

b36933a

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into artem1205/asyncretrie…

2510d97

…ver-split-decoders # Conflicts: # airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

CDK: fix parser for async retriever

118d1f3

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

artem1205 requested a review from maxi297 December 2, 2024 17:57

maxi297 approved these changes Dec 2, 2024

View reviewed changes

artem1205 added 2 commits December 2, 2024 22:16

CDK: fix mypy

92c7b54

Signed-off-by: Artem Inzhyyants <[email protected]>

CDK: fmt

c1ef338

Signed-off-by: Artem Inzhyyants <[email protected]>

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

artem1205 merged commit c90eea1 into main Dec 3, 2024
20 of 21 checks passed

artem1205 deleted the artem1205/asyncretriever-split-decoders branch December 3, 2024 09:16

This was referenced Dec 6, 2024

feat(low-code cdk): add config component resolver #149

Merged

feat: Adds ZipfileDecoder component #169

Open

coderabbitai bot mentioned this pull request Dec 13, 2024

fix(Low-Code Concurrent CDK): Refactor the low-code AsyncRetriever to use an underlying StreamSlicer #170

Merged

This was referenced Dec 23, 2024

feat: add Composite Raw Decoder #179

Merged

fix: fix sorting & __init__ imports #189

Merged

This was referenced Dec 31, 2024

feat: Publish the response to process in the context of the record's transformation and filtering steps #193

Open

fix: Async Retriever change url path for download retriever #192

Open

fix(airbyte-cdk): unable to create custom retriever #198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add download_decoder + download_extractor #50

feat: add download_decoder + download_extractor #50

artem1205 commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2024 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

maxi297 left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

maxi297 left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

artem1205 commented Dec 2, 2024

maxi297 left a comment

coderabbitai bot left a comment

feat: add download_decoder + download_extractor #50

feat: add download_decoder + download_extractor #50

Conversation

artem1205 commented Nov 14, 2024 • edited by coderabbitai bot Loading

What

Summary by CodeRabbit

Summary by CodeRabbit

coderabbitai bot commented Nov 14, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

artem1205 commented Dec 2, 2024

maxi297 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

artem1205 commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)