Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage with incremental #1710

Closed
Deninc opened this issue Aug 21, 2024 · 7 comments
Closed

High memory usage with incremental #1710

Deninc opened this issue Aug 21, 2024 · 7 comments

Comments

@Deninc
Copy link

Deninc commented Aug 21, 2024

dlt version

0.5.3

Describe the problem

I've found that the extraction phase is hogging the memory if I enabled the dlt.sources.incremental and primary_key=().

Expected behavior

I'm not sure if this is a bug. Is there a way I can limit the memory usage?

Steps to reproduce

My test is with a test.jsonl file of 2.76 million rows, around 3.66GB in size.

[data_writer]
buffer_max_items=100000
file_max_items=100000
@dlt.resource(
    standalone=True,
    table_name="event",
    write_disposition="merge",
    merge_key="event_date",
    max_table_nesting=0,
    # primary_key=(),
)
def event_resource(
    event_time=dlt.sources.incremental(
        "event_date",
        initial_value=date.fromisoformat("2023-04-01"),
    ),
):
    with open('test.jsonl', 'r') as file:
        for line in file:
            yield json.loads(line)
            
pipeline = dlt.pipeline(
    pipeline_name="benchmark",
    destination="filesystem",
    dataset_name="benchmark",
    progress="log",
)
resource = event_resource()
pipeline.extract(resource)

The first case the memory usage is low (179.00 MB), but it takes forever to run (rate: 33.07/s).

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 398.74s | Rate: 0.00/s
event: 13188  | Time: 398.73s | Rate: 33.07/s
Memory usage: 179.00 MB (37.30%) | CPU usage: 0.00%

After that I add primary_key=() to disable deduplication. It runs much faster (rate: 20345.09/s), but now the memory usage is too high (12208.89 MB).

------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 135.88s | Rate: 0.01/s
event: 2764522  | Time: 135.88s | Rate: 20345.09/s
Memory usage: 12208.89 MB (64.80%) | CPU usage: 0.00%

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response

@Deninc
Copy link
Author

Deninc commented Aug 21, 2024

I've also changed these settings. But the benchmark stays around the same.

#[data_writer]
#buffer_max_items=100000
#file_max_items=100000

@rudolfix
Copy link
Collaborator

rudolfix commented Aug 21, 2024

@Deninc are you able to post one item from your json file? disabling deduplication should make runs faster and decrease the memory usage. is it possible that "event_date" is not very granular? ie you have millions of records with the same date

btw. do batching for better performance: https://dlthub.com/docs/reference/performance#yield-pages-instead-of-rows
if your json if well formed and not nested you may try to parse it with pyarrow or duckdb and yield arrow batches instead. then you get maybe 30x or 100x speedup...

@Deninc
Copy link
Author

Deninc commented Aug 22, 2024

Hi @rudolfix yes basically for this dataset all event_date is the same. The API I'm loading accept from_date and to_date only, so I'm doing a daily batch merge (delete-insert).

disabling deduplication should make runs faster and decrease the memory usage

Here it actually increase the memory usage significantly, I'm not sure why?

@Deninc
Copy link
Author

Deninc commented Aug 22, 2024

@rudolfix I can confirm using datetime instead of date solve the issue.

event_time=dlt.sources.incremental(
    "time",
    initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"),
    primary_key=(),
),
------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 65.46s | Rate: 0.02/s
Memory usage: 89.05 MB (30.40%) | CPU usage: 0.00%

@Deninc
Copy link
Author

Deninc commented Aug 22, 2024

Updated, the above benchmark was wrong. I used initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"), which is a future date.

The correct benchmark is here.

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 121.90s | Rate: 0.00/s
event: 2745596  | Time: 121.90s | Rate: 22524.05/s
Memory usage: 4486.80 MB (52.70%) | CPU usage: 0.00%

@rudolfix
Copy link
Collaborator

@Deninc I think we'll disable boundary deduplication by default in next major release

@rudolfix
Copy link
Collaborator

#1131

@github-project-automation github-project-automation bot moved this from Todo to Done in dlt core library Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants