Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348

MalekSolta · 2024-11-11T12:52:26Z

**Bug Description
The NeMo Curator DataCurator class does not seem to work as expected when running a data curation script in the NVIDIA NeMo Docker container (nvcr.io/nvidia/nemo:24.07). The script loads a JSONL file, applies filters, and writes curated results to a new JSONL file, but fails during the curation process.

Steps to Reproduce the Bug
Pull the NVIDIA NeMo container:

bash

sudo docker pull nvcr.io/nvidia/nemo:24.07
Run the container with volume mounts:

bash

sudo docker run -it --rm
-v /home/aghadghadi/script:/workspace
-v /home/aghadghadi/test:/workspace/data
nvcr.io/nvidia/nemo:24.07
Inside the container, execute the Python script using the following code:

python
Copier le code
from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor

curator = DataCurator()
deduplication_filter = DeduplicationFilter()
quality_filter = QualityFilter(min_quality_score=0.8)
content_filter = ContentFilter(keywords=["relevant_topic"], exclude_keywords=["unwanted_topic"])
toxicity_filter = ToxicityFilter(max_toxicity_score=0.2)

curator.add_filter(deduplication_filter)
curator.add_filter(quality_filter)
curator.add_filter(content_filter)
curator.add_filter(toxicity_filter)

normalization_processor = TextNormalizationProcessor(lowercase=True, remove_punctuation=True)
curator.add_processor(normalization_processor)

input_path = '/workspace/data/falcon.jsonl'
output_path = '/workspace/data/falcon_sampled.jsonl'

curated_data = []
with open(input_path, 'r', encoding='utf-8') as f:
for line in f:
data_entry = json.loads(line) # Parse each line as JSON
data = curator.curate_data(data_entry)
if data is not None:
curated_data.append(json.dumps(data))

with open(output_path, 'w', encoding='utf-8') as f:
for entry in curated_data:
f.write(entry + "\n")

print("Data curation complete. Curated data saved to", output_path)
Expected Behavior
The script should:

Load each line of the JSONL file, apply defined filters, and normalize text.
Write the curated data back to a new JSONL file (falcon_sampled.jsonl) without issues.
Actual Behavior
The script fails during the curation process, with errors related to loading or filtering the JSON data lines.

Environment Overview
Environment location: Docker
Docker run command:
bash
Copier le code
sudo docker run -it --rm
-v /home/aghadghadi/script:/workspace
-v /home/aghadghadi/test:/workspace/data
nvcr.io/nvidia/nemo:24.07
NeMo Curator Installation: Pre-installed in Docker image nvcr.io/nvidia/nemo:24.07
Environment Details
Since the NVIDIA Docker image is used, the base OS version, Dask, and Python versions are not modified from the container.

Additional Context
The JSONL file is large and contains structured JSON lines.
Mounted volumes are set correctly, and file paths are accessible within the container.
Questions or Suggestions:

Please confirm compatibility with the DataCurator API in this container image.
Could there be an issue with JSON parsing or filter application within curate_data?

ryantwolf · 2024-11-12T17:23:21Z

Thanks for opening the issue! Quick question though, I'm looking at your code and I see these import statements:

from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor

Why are you doing these import statements?nemo.collections.nlp.data.data_curator is not a module. That points to NeMo, but all of the data curation pieces in the container are in the nemo_curator package. Are you following a tutorial/documentation from somewhere? I'm wondering if there is some out of date documentation. The user guide should show how to properly run NeMo Curator, and you can check out the tutorials for some examples.

MalekSolta · 2024-11-13T10:44:43Z

Hello Ryan,

I have a JSONL file with the following structure:

json

{
"id": "48e29ec3-570b-4349-bbb9-5f83e35dfac1",
"text": "In today’s ever-shrinking world, how do you create the culture of your family? As the Jewish High Holidays are being celebrated, I think about the traditions created and cherished during this time. Although I couldn’t participate this year, I enjoy joining local temples that welcome visitors. It's incredible to witness history being honored and children being loved and celebrated through these traditions. Even if they don’t fully understand the meanings, they appreciate how special these moments are. Celebrating year after year, children come to value these traditions deeply.\n\nI want that for my children, maybe not Yom Kippur specifically, though I will celebrate it with them when possible, but traditions and celebrations that they can look forward to. No matter what happens, these traditions stand the test of time. Through loss, change, and challenges, these constants can provide reassurance and strength for children.\n\nI’m curious to hear from others who are blending and creating their own family cultures. What traditions do you uphold? How are you creating your culture? Reflect on these questions to understand how you may already be creating a culture of purpose and love:\n\n1) What traditions from your parents do you continue with your family? For example, do you open presents on Christmas Eve with friends and family or have Saturday morning cleaning rituals?\n\n2) Are there spiritual or cultural holidays you celebrate? How do you celebrate them? For example, Christmas, Yom Kippur, Kwanzaa, or Three Kings Day—do you follow traditional customs, or have you created your own blend of celebrations?\n\n3) Does your family include individuals from different cultural, religious, ethnic, or racial backgrounds? How do you blend the various traditions and celebrations? For example, do you have a Christmas tree next to a Hanukkah bush or celebrate both Christmas and Kwanzaa followed by Three Kings Day?\n\nI look forward to hearing your insights. Thanks in advance.\n\nPosted on Monday, September 28, 2009",
"src": "http://createyourculture.blogspot.com/2009/09/how-do-you-create-your-culture.html",
"type": "Eng",
"title": "How Do You Create Your Culture"
}

I’d like to integrate NVIDIA NeMo into my code to improve data quality. Could you provide guidance on the best way to accomplish this?

ryantwolf · 2024-11-13T16:49:01Z

I’d like to integrate NVIDIA NeMo into my code to improve data quality. Could you provide guidance on the best way to accomplish this?

The best procedure depends a lot on what you are trying to do and where you got the data from. Are you trying to pretrain, fine-tune, build a RAG system, or do something else entirely? Is this data scraped from the web, or does it come from a source that has already processed the data in some way?

In general, you might want to try performing heuristic filtering or classifier filtering with a fastText model to filter out low-quality data. Then, you can run fuzzy deduplication to identify similar documents (Note: you'll need an NVIDIA GPU for this step). However, if your dataset is small (<= 100,000 documents) you may not benefit from filtering at all.

I also would appreciate if you could tell me where you got these import statements from and if they work at all:

from nemo.collections.nlp.data.data_curator import DataCurator
from nemo.collections.nlp.data.data_curator.filters import (
DeduplicationFilter,
QualityFilter,
ContentFilter,
ToxicityFilter,
)
from nemo.collections.nlp.data.data_curator.processors import TextNormalizationProcessor

ryantwolf · 2024-11-13T16:57:26Z

I can also recommend some tutorials that show how to use features of NeMo Curator. The tutorials folder has them all, but here are a few highlights that are focused around pre-trainining.

pretraining-data-curation focuses on filtering the raw common crawl snapshots used to create the Red Pajama V2 Dataset.
zyda2-tutorial allows you to directly reproduce the Zyda 2 dataset for LLM pretraining from Zyphra.
single_node_tutorial gives a general overview of a lot of the features in NeMo Curator by curating Thai Wikipedia.

MalekSolta added the bug Something isn't working label Nov 11, 2024

ryantwolf self-assigned this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348

Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348

MalekSolta commented Nov 11, 2024 •

edited

Loading

ryantwolf commented Nov 12, 2024 •

edited

Loading

MalekSolta commented Nov 13, 2024

ryantwolf commented Nov 13, 2024 •

edited

Loading

ryantwolf commented Nov 13, 2024

Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348

Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files #348

Comments

MalekSolta commented Nov 11, 2024 • edited Loading

ryantwolf commented Nov 12, 2024 • edited Loading

MalekSolta commented Nov 13, 2024

ryantwolf commented Nov 13, 2024 • edited Loading

ryantwolf commented Nov 13, 2024

MalekSolta commented Nov 11, 2024 •

edited

Loading

ryantwolf commented Nov 12, 2024 •

edited

Loading

ryantwolf commented Nov 13, 2024 •

edited

Loading