Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add download_decoder + download_extractor #50

Merged
merged 13 commits into from
Dec 3, 2024

Conversation

artem1205
Copy link
Contributor

@artem1205 artem1205 commented Nov 14, 2024

What

resolving https://github.com/airbytehq/airbyte-internal-issues/issues/10429

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced download_extractor and download_decoder properties in the asynchronous data retrieval process, enhancing flexibility for record extraction and decoding.
    • Added a new ResponseToFileExtractor class for handling responses that require extraction and decompression.
    • Implemented a custom CustomRecordExtractor class for improved record extraction from JSON responses.
    • Enhanced testing capabilities with a fixture for handling large datasets in the ResponseToFileExtractor tests.
  • Bug Fixes

    • Improved error handling for the download_extractor instantiation to ensure default functionality when not explicitly defined.

@artem1205 artem1205 self-assigned this Nov 14, 2024
@github-actions github-actions bot added the enhancement New feature or request label Nov 14, 2024
Copy link
Contributor

coderabbitai bot commented Nov 14, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in this pull request enhance the AsyncRetriever class and its related components in the airbyte_cdk library. Two new properties, download_extractor and download_decoder, are introduced to facilitate asynchronous data retrieval from URLs. These properties allow for greater flexibility in defining how records are extracted and decoded. The modifications also include updates to the ModelToComponentFactory class to utilize these new properties, ensuring robust error handling and customization in the extraction process.

Changes

File Path Change Summary
airbyte_cdk/sources/declarative/declarative_component_schema.yaml - Added properties: download_extractor and download_decoder in AsyncRetriever definition.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py - Updated AsyncRetriever class to include optional fields: download_extractor and download_decoder.
- Invoked update_forward_refs() for several classes to resolve forward references.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py - Modified create_async_retriever method to use download_extractor and download_decoder.
- Added error handling for download_extractor instantiation, defaulting to DpathExtractor if not provided.
airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py - Updated ResponseToFileExtractor to use @dataclass and modified its initialization process.
airbyte_cdk/sources/declarative/requesters/http_job_repository.py - Changed record_extractor initialization to use ResponseToFileExtractor({}).
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py - Added CustomRecordExtractor class and corresponding tests for its instantiation.
unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py - Updated test initialization for ResponseToFileExtractor to include an empty dictionary.
unit_tests/sources/declarative/requesters/test_http_job_repository.py - Modified test instantiation of RecordSelector to use ResponseToFileExtractor({}).

Would you like to explore any specific part of these changes further?

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant AsyncRetriever
    participant ModelToComponentFactory
    participant ResponseToFileExtractor

    User->>ModelToComponentFactory: create_async_retriever(model)
    ModelToComponentFactory->>AsyncRetriever: initialize(download_extractor, download_decoder)
    AsyncRetriever->>ResponseToFileExtractor: create_response_to_file_extractor(parameters)
    ResponseToFileExtractor->>AsyncRetriever: return extractor instance
    AsyncRetriever->>User: return initialized AsyncRetriever
Loading

Possibly related PRs

  • feat(airbyte-cdk): add gzipjson decoder #20: The changes in this PR involve adding the GzipJsonDecoder and updating the AsyncRetriever to include it as a possible decoder type, which directly relates to the modifications made in the main PR regarding the download_decoder property in the AsyncRetriever.

Suggested reviewers

  • maxi297

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

2031-2042: Consider adding validation for the field_path parameter in the default DpathExtractor.

The code handles both custom and default extractors well, but when creating the default DpathExtractor, we're passing an empty list for field_path. Should we validate or document why an empty field_path is acceptable here? wdyt?


2052-2053: Nice cleanup removing the hardcoded ResponseToFileExtractor!

The replacement of the hardcoded ResponseToFileExtractor with a configurable download_extractor improves flexibility. The commented-out old code could be removed for cleanliness, wdyt?

-                # extractor=ResponseToFileExtractor(),# old one

2043-2055: Consider adding error handling for download_retriever configuration.

The download_retriever setup looks good, but it might be worth adding some validation to ensure the download_requester and record_selector are properly configured for the specific download use case. For instance, we could validate that the extractor's configuration is compatible with the download operation. What are your thoughts on this?

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2575-2584: LGTM! The download_decoder property looks good.

The property supports a comprehensive set of decoder types. What do you think about enhancing the description to mention the supported formats explicitly? Something like:

"Component decoding the download response so records can be extracted. Supports JSON, JSONL, Iterable, XML, and Gzip compressed formats through their respective decoders."

wdyt? 🤔

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1679-1681: Consider adding a title to the download_extractor field for consistency.

I noticed that the download_decoder field includes a title parameter in its Field() definition. Perhaps adding a title to the download_extractor field would enhance readability and maintain consistency across your model definitions. What do you think?

Here's how you might update it:

 download_extractor: Optional[Union[CustomRecordExtractor, DpathExtractor]] = Field(
     None,
     description="Responsible for fetching the records from provided urls.",
+    title="Download Extractor",
 )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 39786d2 and 6304780.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1 hunks)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2026-2030: LGTM! The download_decoder initialization looks good.

The code properly handles both cases where a custom decoder is provided or falls back to a default JsonDecoder. The parameters are correctly passed through.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2516-2520: LGTM! The download_extractor property is well-defined.

The property is properly structured and provides good flexibility by supporting both CustomRecordExtractor and DpathExtractor implementations.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1732-1745: The addition of download_decoder looks great!

You've correctly defined the download_decoder field with appropriate types, description, and title. This addition enhances the functionality of the AsyncRetriever. Well done!

@artem1205 artem1205 requested a review from maxi297 November 14, 2024 15:28
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test to validate that custom decoders are supported?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

2053-2054: Remove commented code

The old implementation reference can be safely removed since it's tracked in version control:

-                # extractor=ResponseToFileExtractor(),# old one
                 extractor=download_extractor,

2034-2038: Fix indentation for better readability

The indentation seems a bit off. Would you like to adjust it to match the surrounding code? wdyt?

-                model=model.download_extractor,
-                config=config,
-                decoder=download_decoder,
-                parameters=model.parameters,
+            model=model.download_extractor,
+            config=config,
+            decoder=download_decoder,
+            parameters=model.parameters,
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2516-2520: The download_extractor implementation looks good! Would you consider enhancing the documentation?

The implementation correctly allows for both CustomRecordExtractor and DpathExtractor. Perhaps we could make the description more detailed by adding examples or common use cases? For instance, explaining when to use CustomRecordExtractor vs DpathExtractor for URL-based record extraction, wdyt?

       download_extractor:
-        description: Responsible for fetching the records from provided urls.
+        description: |
+          Responsible for fetching the records from provided urls. Use CustomRecordExtractor for complex extraction logic 
+          or DpathExtractor for simple path-based extraction.
+          
+          Example using DpathExtractor:
+            download_extractor:
+              type: DpathExtractor
+              field_path: ["data", "records"]

2575-2584: The download_decoder implementation looks good! Would you consider enhancing the documentation?

The implementation correctly supports all necessary decoder types. Perhaps we could make the description more detailed by explaining when to use each decoder type for downloaded content? For instance, when to use GzipJsonDecoder vs JsonDecoder, wdyt?

       download_decoder:
         title: Download Decoder
-        description: Component decoding the download response so records can be extracted.
+        description: |
+          Component decoding the download response so records can be extracted. Choose the decoder based on the response format:
+          - JsonDecoder: For standard JSON responses
+          - JsonlDecoder: For newline-delimited JSON
+          - GzipJsonDecoder: For gzip-compressed JSON
+          - XmlDecoder: For XML responses
+          - IterableDecoder: For line-by-line processing
+          - CustomDecoder: For custom decoding logic
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6304780 and 43e03fb.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1 hunks)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

2027-2031: LGTM: Clean initialization of download_decoder with fallback

The code properly handles both cases where download_decoder is provided or needs to fall back to JsonDecoder.


2032-2043: Consider validating empty field path

The fallback case creates a DpathExtractor with an empty field path list []. Should we validate this or provide a default path? wdyt?

✅ Verification successful

Let me check the implementation of DpathExtractor to understand if empty paths are handled properly.


Let me try to find the DpathExtractor implementation using ripgrep instead.


Let me check the actual implementation details of the DpathExtractor to see how it handles empty paths.


Let me check the extract method to see how it handles the empty path.


Empty field path is handled correctly, no validation needed

Looking at the implementation, empty field path is a valid use case and is explicitly handled in the extract_records method. When field_path is empty, it returns the entire response body as-is. This behavior is also documented in the class docstring with an example showing empty field path usage. The test cases we found earlier confirm this is an intended behavior.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if DpathExtractor is used with empty paths elsewhere
ast-grep --pattern 'DpathExtractor($$$, field_path=[], $$$)'

Length of output: 1531


Script:

#!/bin/bash
# Check DpathExtractor implementation
ast-grep --pattern 'class DpathExtractor'

Length of output: 43


Script:

#!/bin/bash
# Search for DpathExtractor class definition and its usage
rg "class DpathExtractor" -A 10

Length of output: 2583


Script:

#!/bin/bash
# Get the full implementation of DpathExtractor
rg "class DpathExtractor" -A 50 airbyte_cdk/sources/declarative/extractors/dpath_extractor.py

Length of output: 1909


Script:

#!/bin/bash
# Get the extract method implementation
rg "def extract_records" -A 20 airbyte_cdk/sources/declarative/extractors/dpath_extractor.py

Length of output: 861

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1334-1350: Documentation needs improvement, wdyt?

The ResponseToFileExtractor component's documentation could be enhanced:

  1. The description is incomplete: "Record extractor that downloads a headless"
  2. The title mentions "CSV" but the description doesn't specify file formats
  3. Consider adding:
    • Supported file formats
    • When to use this extractor vs others
    • Example usage
    • Purpose of needs_decompression

Line range hint 2533-2602: Consider enhancing the AsyncRetriever documentation, what do you think?

The new properties look good, but their documentation could be more helpful:

  1. For download_extractor, consider explaining:

    • The relationship with urls_extractor
    • When to use each extractor type
    • Example configurations
  2. For download_decoder, consider clarifying:

    • How it differs from the main decoder property
    • When a separate download decoder is needed
    • Typical use cases for each decoder type
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 43e03fb and 9c0c45e.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (4 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
  • airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

Signed-off-by: Artem Inzhyyants <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (2)

3280-3286: Consider adding error handling for JSON decoding, wdyt?

The extract_records method directly yields from response.json() which could raise a JSONDecodeError if the response contains invalid JSON. Consider adding try-except to handle potential errors gracefully.

def extract_records(
    self,
    response: requests.Response,
) -> Iterable[Mapping[str, Any]]:
-    yield from response.json()
+    try:
+        yield from response.json()
+    except requests.exceptions.JSONDecodeError as e:
+        raise ValueError(f"Failed to decode response as JSON: {str(e)}")

3288-3294: Consider adding more test cases, wdyt?

The test verifies the basic instantiation but could be enhanced to test the actual functionality:

  1. Test with valid JSON response
  2. Test with invalid JSON response
  3. Test with empty response
def test_create_custom_record_extractor():
    # Test instantiation
    definition = {
        "type": "CustomRecordExtractor",
        "class_name": "unit_tests.sources.declarative.parsers.test_model_to_component_factory.CustomRecordExtractor",
    }
    component = factory.create_component(CustomRecordExtractorModel, definition, {})
    assert isinstance(component, CustomRecordExtractor)

    # Test with valid JSON
    mock_response = MagicMock(spec=requests.Response)
    mock_response.json.return_value = [{"id": 1}, {"id": 2}]
    records = list(component.extract_records(mock_response))
    assert records == [{"id": 1}, {"id": 2}]

    # Test with invalid JSON
    mock_response.json.side_effect = requests.exceptions.JSONDecodeError("msg", "doc", 0)
    with pytest.raises(ValueError):
        list(component.extract_records(mock_response))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9c0c45e and a3d95bb.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (4 hunks)

Signed-off-by: Artem Inzhyyants <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1)

45-45: Consider adding a docstring for the config parameter?

Since we're introducing a config parameter to ResponseToFileExtractor, would it be helpful to add documentation about its purpose and expected values? This could help future developers understand the configuration options, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a3d95bb and 7aa10a0.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py (1 hunks)
  • unit_tests/sources/declarative/requesters/test_http_job_repository.py (1 hunks)
🔇 Additional comments (4)
unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py (2)

17-17: LGTM! Constructor update is consistent.

The change aligns with the updated ResponseToFileExtractor implementation.


Line range hint 98-98: LGTM! Constructor update is consistent.

The change maintains consistency with the updated implementation.

airbyte_cdk/sources/declarative/requesters/http_job_repository.py (1)

45-45: LGTM! Constructor update is consistent.

The change maintains consistency with the updated ResponseToFileExtractor implementation across the codebase.

unit_tests/sources/declarative/requesters/test_http_job_repository.py (1)

98-98: LGTM! Constructor update is consistent.

The change maintains consistency with the updated implementation.

Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
@artem1205 artem1205 requested a review from maxi297 December 2, 2024 13:30
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is regarding backward compatibility with source-sendgrid. Once we figure this out, I'll approve

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1334-1346: Consider enhancing the component description for better clarity.

The description could be more detailed about:

  1. When to use this extractor (e.g., size thresholds)
  2. File lifecycle management (creation, cleanup)
  3. Error handling for disk operations
  4. Performance implications

What do you think about expanding the description like this?

-    description: A record extractor designed for handling large responses that may exceed memory limits (to prevent OOM issues). It downloads a CSV file to disk, reads the data from disk, and deletes the file once it has been fully processed.
+    description: |
+      A record extractor designed for handling large responses that may exceed memory limits (to prevent OOM issues). 
+      
+      Key features:
+      - Downloads response content to a temporary file on disk
+      - Streams records from disk to minimize memory usage
+      - Automatically cleans up temporary files after processing
+      - Suitable for responses larger than available memory
+      
+      Usage considerations:
+      - Recommended for responses > 500MB
+      - Requires sufficient disk space
+      - May have slower processing due to disk I/O

2529-2534: The download_extractor property looks good but could use an example.

The property is well-defined with proper references to available extractors. Consider adding an example to illustrate common usage patterns?

What do you think about adding usage examples like this?

      download_extractor:
        description: Responsible for fetching the records from provided urls.
        anyOf:
          - "$ref": "#/definitions/CustomRecordExtractor"
          - "$ref": "#/definitions/DpathExtractor"
          - "$ref": "#/definitions/ResponseToFileExtractor"
+        examples:
+          - type: ResponseToFileExtractor  # For large file downloads
+          - type: DpathExtractor          # For JSON responses
+            field_path: ["data", "records"]

2589-2598: The download_decoder property looks good but could benefit from usage guidance.

The property supports all standard decoders but could use guidance on when to use each type.

What do you think about adding usage guidance like this?

      download_decoder:
        title: Download Decoder
-        description: Component decoding the download response so records can be extracted.
+        description: |
+          Component decoding the download response so records can be extracted.
+          
+          Common decoder choices:
+          - JsonDecoder: For JSON responses
+          - JsonlDecoder: For newline-delimited JSON files
+          - GzipJsonDecoder: For compressed JSON responses
+          - XmlDecoder: For XML responses
+          - IterableDecoder: For line-by-line processing
        anyOf:
          - "$ref": "#/definitions/CustomDecoder"
          - "$ref": "#/definitions/JsonDecoder"
          - "$ref": "#/definitions/JsonlDecoder"
          - "$ref": "#/definitions/IterableDecoder"
          - "$ref": "#/definitions/XmlDecoder"
          - "$ref": "#/definitions/GzipJsonDecoder"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce89a2f and b36933a.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (3 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
  • airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

…ver-split-decoders

# Conflicts:
#	airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1454-1460: Consider adding type hints for better code clarity?

The method looks good functionally, but would adding type hints for the return type make it more consistent with other factory methods, wdyt?

    def create_response_to_file_extractor(
        self,
        model: ResponseToFileExtractorModel,
        **kwargs: Any,
-    ) -> ResponseToFileExtractor:
+    ) -> ResponseToFileExtractor[Any]:
        return ResponseToFileExtractor(parameters=model.parameters or {})
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

3281-3287: Consider adding docstring and type hints for better code clarity?

The implementation looks good, but would adding a docstring and return type hint make it more maintainable, wdyt?

class CustomRecordExtractor(RecordExtractor):
+    """Test implementation of RecordExtractor that yields JSON response directly."""
    def extract_records(
        self,
        response: requests.Response,
-    ) -> Iterable[Mapping[str, Any]]:
+    ) -> Iterable[Mapping[str, Any]]:
        yield from response.json()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b36933a and 2510d97.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (4 hunks)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

273-275: LGTM!

Clean import addition for the new ResponseToFileExtractorModel.


433-433: LGTM!

Clean addition of the new model-to-constructor mapping.

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

3289-3295: LGTM!

Clean test implementation that verifies the factory's ability to create custom record extractors.

Signed-off-by: Artem Inzhyyants <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1454-1460: Should we enhance the ResponseToFileExtractor factory method?

The current implementation is simpler than other factory methods. Consider adding:

  1. Config parameter usage (it's passed to most other extractors)
  2. Error handling for invalid parameters
  3. Decoder handling (similar to other extractors)

What do you think about making it consistent with other extractor factory methods?

 def create_response_to_file_extractor(
     self,
     model: ResponseToFileExtractorModel,
+    config: Config,
+    decoder: Optional[Decoder] = None,
     **kwargs: Any,
 ) -> ResponseToFileExtractor:
-    return ResponseToFileExtractor(parameters=model.parameters or {})
+    return ResponseToFileExtractor(
+        config=config,
+        parameters=model.parameters or {},
+        decoder=decoder or JsonDecoder(parameters={})
+    )

2043-2059: Should we improve parameter consistency and naming?

A few suggestions to enhance readability and consistency:

  1. The parameters are passed to download_extractor but not to download_decoder - should we make this consistent?
  2. Consider renaming download_decoder to download_response_decoder to be more descriptive about its purpose
  3. The empty list [] in the DpathExtractor fallback could be more explicit about what it represents (field_path)

WDYT about these improvements?

 download_decoder = (
     self._create_component_from_model(model=model.download_decoder, config=config)
     if model.download_decoder
-    else JsonDecoder(parameters={})
+    else JsonDecoder(parameters=model.parameters or {})
 )
 download_extractor = (
     self._create_component_from_model(
         model=model.download_extractor,
         config=config,
         decoder=download_decoder,
         parameters=model.parameters,
     )
     if model.download_extractor
     else DpathExtractor(
-        [], 
+        field_path=[],
         config=config,
         decoder=download_decoder,
         parameters=model.parameters
     )
 )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2510d97 and 118d1f3.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

273-275: LGTM! Clean addition of ResponseToFileExtractor to the component mappings

The new component is properly integrated into the factory's type system.

Also applies to: 433-433

@artem1205
Copy link
Contributor Author

@maxi297
tested on sendgrid with this change

Index: airbyte-integrations/connectors/source-sendgrid/manifest.yaml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/airbyte-integrations/connectors/source-sendgrid/manifest.yaml b/airbyte-integrations/connectors/source-sendgrid/manifest.yaml
--- a/airbyte-integrations/connectors/source-sendgrid/manifest.yaml	(revision e47ad300dee26c2ff964e856d90a751cbc76969d)
+++ b/airbyte-integrations/connectors/source-sendgrid/manifest.yaml	(date 1733159536552)
@@ -640,6 +640,8 @@
           path: "{{stream_slice['url']}}"
           url_base: ""
           http_method: GET
+        download_extractor:
+          type: ResponseToFileExtractor
       schema_loader:
         type: InlineSchemaLoader
         schema:

@artem1205 artem1205 requested a review from maxi297 December 2, 2024 17:57
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Artem. I'm eager to see this applied to amazon-ads

Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1454-1460: Consider adding a docstring for consistency?

The implementation looks good, but other create methods in this class have docstrings. Would you consider adding one here as well? wdyt?

 def create_response_to_file_extractor(
     self,
     model: ResponseToFileExtractorModel,
     **kwargs: Any,
 ) -> ResponseToFileExtractor:
+    """Creates a ResponseToFileExtractor component.
+
+    Args:
+        model: The model containing the component configuration
+        **kwargs: Additional keyword arguments
+
+    Returns:
+        ResponseToFileExtractor: The created component
+    """
     return ResponseToFileExtractor(parameters=model.parameters or {})

2043-2062: LGTM! Consider extracting decoder/extractor creation?

The implementation looks good with proper null checks and fallbacks. However, the download_decoder and download_extractor creation logic is quite similar to other parts of the code. Would you consider extracting these into helper methods to improve readability? wdyt?

Something like:

+    def _create_download_decoder(
+        self,
+        model: AsyncRetrieverModel,
+        config: Config,
+    ) -> Decoder:
+        return (
+            self._create_component_from_model(model=model.download_decoder, config=config)
+            if model.download_decoder
+            else JsonDecoder(parameters={})
+        )
+
+    def _create_download_extractor(
+        self,
+        model: AsyncRetrieverModel,
+        config: Config,
+        decoder: Decoder,
+        parameters: Dict[str, Any],
+    ) -> Union[ResponseToFileExtractor, DpathExtractor]:
+        return (
+            self._create_component_from_model(
+                model=model.download_extractor,
+                config=config,
+                decoder=decoder,
+                parameters=parameters,
+            )
+            if model.download_extractor
+            else DpathExtractor(
+                [],
+                config=config,
+                decoder=decoder,
+                parameters=parameters or {},
+            )
+        )

 def create_async_retriever(
     # ... existing parameters ...
 ) -> AsyncRetriever:
     # ... existing code ...
-    download_decoder = (
-        self._create_component_from_model(model=model.download_decoder, config=config)
-        if model.download_decoder
-        else JsonDecoder(parameters={})
-    )
-    download_extractor = (
-        self._create_component_from_model(
-            model=model.download_extractor,
-            config=config,
-            decoder=download_decoder,
-            parameters=model.parameters,
-        )
-        if model.download_extractor
-        else DpathExtractor(
-            [],
-            config=config,
-            decoder=download_decoder,
-            parameters=model.parameters or {},
-        )
-    )
+    download_decoder = self._create_download_decoder(model, config)
+    download_extractor = self._create_download_extractor(
+        model, config, download_decoder, model.parameters
+    )
     # ... rest of the code ...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 118d1f3 and c1ef338.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

273-275: LGTM!

The import follows the existing pattern and is correctly grouped with other model imports.


433-433: LGTM!

The mapping is correctly added and maintains alphabetical ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants