High memory usage with incremental #1710

Deninc · 2024-08-21T13:51:16Z

dlt version

0.5.3

Describe the problem

I've found that the extraction phase is hogging the memory if I enabled the dlt.sources.incremental and primary_key=().

Expected behavior

I'm not sure if this is a bug. Is there a way I can limit the memory usage?

Steps to reproduce

My test is with a test.jsonl file of 2.76 million rows, around 3.66GB in size.

[data_writer]
buffer_max_items=100000
file_max_items=100000

@dlt.resource(
    standalone=True,
    table_name="event",
    write_disposition="merge",
    merge_key="event_date",
    max_table_nesting=0,
    # primary_key=(),
)
def event_resource(
    event_time=dlt.sources.incremental(
        "event_date",
        initial_value=date.fromisoformat("2023-04-01"),
    ),
):
    with open('test.jsonl', 'r') as file:
        for line in file:
            yield json.loads(line)
            
pipeline = dlt.pipeline(
    pipeline_name="benchmark",
    destination="filesystem",
    dataset_name="benchmark",
    progress="log",
)
resource = event_resource()
pipeline.extract(resource)

The first case the memory usage is low (179.00 MB), but it takes forever to run (rate: 33.07/s).

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 398.74s | Rate: 0.00/s
event: 13188  | Time: 398.73s | Rate: 33.07/s
Memory usage: 179.00 MB (37.30%) | CPU usage: 0.00%

After that I add primary_key=() to disable deduplication. It runs much faster (rate: 20345.09/s), but now the memory usage is too high (12208.89 MB).

------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 135.88s | Rate: 0.01/s
event: 2764522  | Time: 135.88s | Rate: 20345.09/s
Memory usage: 12208.89 MB (64.80%) | CPU usage: 0.00%

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

Deninc · 2024-08-21T14:14:51Z

I've also changed these settings. But the benchmark stays around the same.

#[data_writer]
#buffer_max_items=100000
#file_max_items=100000

rudolfix · 2024-08-21T20:10:30Z

@Deninc are you able to post one item from your json file? disabling deduplication should make runs faster and decrease the memory usage. is it possible that "event_date" is not very granular? ie you have millions of records with the same date

btw. do batching for better performance: https://dlthub.com/docs/reference/performance#yield-pages-instead-of-rows
if your json if well formed and not nested you may try to parse it with pyarrow or duckdb and yield arrow batches instead. then you get maybe 30x or 100x speedup...

Deninc · 2024-08-22T01:13:10Z

Hi @rudolfix yes basically for this dataset all event_date is the same. The API I'm loading accept from_date and to_date only, so I'm doing a daily batch merge (delete-insert).

disabling deduplication should make runs faster and decrease the memory usage

Here it actually increase the memory usage significantly, I'm not sure why?

Deninc · 2024-08-22T02:53:21Z

@rudolfix I can confirm using datetime instead of date solve the issue.

event_time=dlt.sources.incremental(
    "time",
    initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"),
    primary_key=(),
),

------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 65.46s | Rate: 0.02/s
Memory usage: 89.05 MB (30.40%) | CPU usage: 0.00%

Deninc · 2024-08-22T05:56:36Z

Updated, the above benchmark was wrong. I used initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"), which is a future date.

The correct benchmark is here.

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 121.90s | Rate: 0.00/s
event: 2745596  | Time: 121.90s | Rate: 22524.05/s
Memory usage: 4486.80 MB (52.70%) | CPU usage: 0.00%

rudolfix · 2024-08-25T20:31:10Z

@Deninc I think we'll disable boundary deduplication by default in next major release

rudolfix · 2024-08-25T20:36:53Z

#1131

github-project-automation bot added this to dlt core library Aug 21, 2024

github-project-automation bot moved this to Todo in dlt core library Aug 21, 2024

rudolfix closed this as completed Sep 11, 2024

github-project-automation bot moved this from Todo to Done in dlt core library Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage with incremental #1710

High memory usage with incremental #1710

Deninc commented Aug 21, 2024 •

edited

Loading

Deninc commented Aug 21, 2024

rudolfix commented Aug 21, 2024 •

edited

Loading

Deninc commented Aug 22, 2024 •

edited

Loading

Deninc commented Aug 22, 2024

Deninc commented Aug 22, 2024

rudolfix commented Aug 25, 2024

rudolfix commented Aug 25, 2024

High memory usage with incremental #1710

High memory usage with incremental #1710

Comments

Deninc commented Aug 21, 2024 • edited Loading

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

Deninc commented Aug 21, 2024

rudolfix commented Aug 21, 2024 • edited Loading

Deninc commented Aug 22, 2024 • edited Loading

Deninc commented Aug 22, 2024

Deninc commented Aug 22, 2024

rudolfix commented Aug 25, 2024

rudolfix commented Aug 25, 2024

Deninc commented Aug 21, 2024 •

edited

Loading

rudolfix commented Aug 21, 2024 •

edited

Loading

Deninc commented Aug 22, 2024 •

edited

Loading