Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add correct text ordering and column merging to CL #83

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

conceptofmind
Copy link
Contributor

@conceptofmind conceptofmind commented Jun 4, 2024

The columns in CL need to be coalesced in the order:

"html_with_citations"
"html_columbia"
"html_lawbox"
"xml_harvard"
"html_anon_2020"
"html"
"plain_text"

The plain_text only contains a smaller low-quality subset of the total data. Using only the plain_text column misses much of the information that is available in the other columns. Not all rows contain information. To get the largest quantity of proper CL data possible without additionally scraping the .txt files it is required to merge the columns in the order above.

Harvard in cold cases and CAP chose to use regex to do initial text extraction. This can be replaced with something like Trafilatura in one line.

I tried to keep this similar to what zhenlinx had. I did not remove their script.

The script:

  • Reads the decompressed CL data dump
  • Adds metadata and source information
  • Drops unnecessary columns based on Harvard CAP input
  • Combines columns in the correct order based on CL documentation
  • Drops repeated XML/HTML columns
  • Extracts HTML/XML following Harvard CAP
  • Merges extracted text and plain text properly
  • Handles null values based on text row
  • Writes out to dolma format

I iterated through the jsonl.gz file to make sure it looked ~normal.

Additional filtering and boilerplate removal will need to be added in another PR.

This leaves you with the "correct" initial CL data following guidelines by Harvard CAP and CL/Mike Lissner.

@blester125

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

A side note. If I am to add Spark processing I will need to make additional adjustments to other parts of the scripts.

The combines are separated by what needs to be extracted and the plain text. This was in case a different text extractor was chosen. Trafilatura seemed to over filter the plain text if not done independently.

I have run this semi-processed version and uploaded parquets.

@craffel craffel requested a review from blester125 June 4, 2024 16:13
@craffel
Copy link
Collaborator

craffel commented Jun 4, 2024

Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?

@craffel craffel requested a review from wildphoton June 4, 2024 16:13
@conceptofmind
Copy link
Contributor Author

Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?

Sounds good. I will merge it into the existing one.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I listed out several changes that should be made, currently the output format is not correct.

I can't see needing something like SPARK for this dataset. If there are OoM or slowness issues due to pandas reading the whole CSV at once it should be easy to update to 1) Read the CSV line-by-line via a generator that uses the csv from stdlib 2) parallelized the conversion from csv record to dolma record with multiprocessing.Pool.imap 3) pass the resulting iterator to the to_dolma function.

That should avoid reading the whole file at once and leverage the embarrassingly parallel nature of the task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a modification of the csv_to_dolma.py instead of a new file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will merge.

parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
parser.add_argument(
"--output_dir",
default=f"data/courtlistener/v0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "data/courtlistener/v0/documents"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )

parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
parser.add_argument(
"--output_dir",
default=f"data/courtlistener/v0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be an f-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )

)
parser.add_argument(
"--input_file",
default="./data/courtlistener/raw/opinions-2022-08-02.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a default for this, especially as the date doesn't match the latest date. Lets use required=True and remove the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser.add_argument(
    "--input_file",
    required=True,
    help="The path to the csv file to convert.",
)


def main(args):
example = process_court_listener(args.input_file)
output_file_base_name = os.path.basename(args.input_file).replace(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't have multiple dated files we should have the base filename as an cli argument with a default like courtlistener.jsonl.gz instead of basing it on the input file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)

...

parser.add_argument(
  "--output_file_base_name",
  default="courtlistener.jsonl.gz",
  help="The base name of the output file.",
)


# extract text from html and xml following Harvard CAP
# They used r"<.+?>", ""
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can get vectorized via built in pandas instead of using apply, i.e.,:

df["text"] = df["text"].str.replace(r"<.+?>", "", regex=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dependent on the extractor we choose.


# extract text from html and xml following Harvard CAP
# They used r"<.+?>", ""
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any auditing tools for the output? It can be ok if you have a strict well-defined subset of things you are looking for, but generally you shouldn't parse HTML/XML with a regex https://stackoverflow.com/a/1732454

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trafilatura is the extractor that is commonly used.

I left what Harvard CAP did with regex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what it would be with trafilatura:

    # extract text from html and xml using trafilatura
    df["text"] = df["text"].apply(lambda x: extract(x))

# They used r"<.+?>", ""
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))

# combine merge plain text and extracted text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, merge is a bad name for this operation.

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can list it as coalesce instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure coalesce is a good name either, that also implies you are creating one value from multiple column values. I hate that SQL picked such a bad name lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes LOL. It is a bad name. I will comment after thinking about it a bit.

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this:

    # Combine DataFrame objects by filling null values 
    # with non-null values from the other selected DataFrames.

Adapted from docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Combine DataFrame objects by filling null values 
    # with non-null values from the other selected DataFrames.
    df["text"] = df["text"].combine_first(df["plain_text"])

df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))

# combine merge plain text and extracted text
df["text"] = df["text"].combine_first(df["plain_text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't do anything, there is already a .dropna(subset=["text"]) call so there won't be any line where the text column is None and the plain_text column is non-null.

Additionally, .combine_first doesn't consider an empty string as null so even if there was a text column that was totally removed based on the regex above, it wouldn't get replaced by the plain_text column.

The plain_text column should be part of the initial .combine_first chain or the .dropna above should be removed and rows with an empty string for text column should be converted to a None so the plain_text column will actually replace the text column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of using a different extractor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trafilatura will over-filter and cause null if not handled explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the case with standard regex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Trafilatura is so overzelous that it turns real data into null that sounds like a red flag. Also if the plain_text column is good enough for those cases is there a reason it isn't good enough to always use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is a red flag. I am writing my own explicit extractor for this dataset but that will not be out until later.

Trafilatura ranks as the highest-quality extractor given a general task.

The extractor should be run on every column which is not plain text. Plain text is considered the lowest quality data in terms of coalescing. It should only be used if the other columns do not exist and can not be extracted.

df["text"] = df["text"].combine_first(df["plain_text"])

# drop plain text column and text null values
df = df.drop(columns=["plain_text"]).dropna(subset=["text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, these are super different operations so we should explicitly separate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can split and add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # drop the plain text column
    df = df.drop(columns=["plain_text"])
    
    # drop null values in the text column
    df = df.dropna(subset=["text"])

@conceptofmind
Copy link
Contributor Author

I listed out several changes that should be made, currently the output format is not correct.

I can't see needing something like SPARK for this dataset. If there are OoM or slowness issues due to pandas reading the whole CSV at once it should be easy to update to 1) Read the CSV line-by-line via a generator that uses the csv from stdlib 2) parallelized the conversion from csv record to dolma record with multiprocessing.Pool.imap 3) pass the resulting iterator to the to_dolma function.

That should avoid reading the whole file at once and leverage the embarrassingly parallel nature of the task.

I tried to keep it as close to the original as possible. There are numerous different changes I would have personally made. I can integrate those instead as well.

@blester125
Copy link
Collaborator

Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here 9df6741)

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

I will resolve each of the comments. However, the text extractor should be decided upon before I do so. This is a bare minimal reproduction of what was done previously in Spark but following what was already there in the original PR for this project/dolma format.

@conceptofmind
Copy link
Contributor Author

Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here 9df6741)

That was a "fat finger" haha. Noticed after it and hadn't squashed.

@conceptofmind
Copy link
Contributor Author

@blester125 Do you have opinion on trafilatura?

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

@blester125

Given the comments on parallelization.

The csv can be iterated through every 20k rows following something like the below:

csv_file = f'{opinions_string}.csv'

chunk_size = 20000 

# Read the CSV file in chunks
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
    # processing code

I am not sure if this is what you had in mind.

Harvard CAP handled loading the entire file initially and then repartitioning.

There are a few ways to handle the csv and I am not opinionated.

We can also preprocess the large csv into smaller files as a separate initial step.

There also should not be OOM or slowness as the file is only around 240 GB csv.

@blester125
Copy link
Collaborator

The most straightforward way of parallelism would probably be something like this:

data = read_csv_generator(csv_file)
with mp.Pool(args.processors) as pool:
  dolma = pool.imap(convert_record_to_dolma, data)
  to_dolma(dolma, ...)

You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like dolma = itertools.chain(*dolma) before the to_dolma call.

Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

The most straightforward way of parallelism would probably be something like this:

data = read_csv_generator(csv_file)
with mp.Pool(args.processors) as pool:
  dolma = pool.imap(convert_record_to_dolma, data)
  to_dolma(dolma, ...)

You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like dolma = itertools.chain(*dolma) before the to_dolma call.

Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient

One dump is around 240 GB csv decompressed. Many mid-sized machines could likely load it and process in one go.

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

There is the option to pre-chunk the entire large dataset into smaller ones as well.

Very basic implementation:

opinions_string = "opinions-2023-12-04"

command = f"bunzip2 {opinions_string}.csv.bz2"
subprocess.run(command, capture_output=True, text=True, shell=True,)

csv_file = f'{opinions_string}.csv'

chunk_size = 20000 

output_dir = f'{opinions_string}'
os.makedirs(output_dir, exist_ok=True)

for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
    csv_file = os.path.join(output_dir, f'chunk_{i+1}.csv')
    chunk.to_csv(csv_file)
    print(f"Processed chunk {i+1}")

os.remove(f"{opinions_string}.csv")

This would speed up the mp at the cost of a preprocessing step.

Before I add something to the PR, I am going to comment here.

As an aside, I am trying to convince them to not upload in one giant csv.bz2 which can't be read in parallel for the major frameworks like Spark.

@wildphoton
Copy link
Collaborator

wildphoton commented Jun 4, 2024

Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the process_row(). Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125

with open(file_path, "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Yield a dictionary for each row
    for row in reader:
        dolma_dict = process_row(row)
        yield dolma_dict

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the process_row(). Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125

with open(file_path, "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Yield a dictionary for each row
    for row in reader:
        dolma_dict = process_row(row)
        yield row

The above pandas read in chunks runs on my local desktop which does not even have close to enough RAM to load the entire 250GB into memory. I would not think that the pandas read csv would handle mem mapping that well to fit such a large dataset into memory but I could be wrong.

The read in chunks should function similarly to an iterator.

I agree that iterating across the entire file is fast enough. I also have no strong opinion on this.

I am perfectly fine with what works best. If we really want people to process this locally very easily maybe adding that additional preprocessing step to separate the single large csv file into multiple is worth it.

We could probably even "ensure" that each smaller file is around 128MB - 256MB.

@blester125
Copy link
Collaborator

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

We don't really need to worry about this, we just need to output dolma files with the to_dolma function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

We don't really need to worry about this, we just need to output dolma files with the to_dolma function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files

Ok. I am going to comment some changes above before making changes to the PR. So that a few things can be ironed out first.

Docs about pandas csv chunksize:

iteratorbool, default False
Return TextFileReader object for iteration or getting chunks with get_chunk().

chunksizeint, optional
Number of lines to read from the file per chunk. Passing a value will cause the function to return a TextFileReader object for iteration. See the [IO Tools docs](https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking) for more information on iterator and chunksize.

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

@blester125 Responded to each of the comments.

An updated example script for loading the entire file would look like:

import argparse
import logging
import re
from datetime import datetime

import pandas as pd

from licensed_pile.licenses import PermissiveLicenses
from licensed_pile.logs import configure_logging
from licensed_pile.write import to_dolma


def process_court_listener(file_path):
    """
    This function performs several operations on a given file:

    1. Reads the decompressed Court Listener (CL) data dump from the provided file path.
    2. Adds metadata and source information to the data.
    3. Drops unnecessary columns based on Harvard CAP input.
    4. Selects the first non-null value when iterating over columns based on CL documentation.
    5. Drops repeated XML/HTML columns.
    6. Extracts HTML/XML following Harvard CAP guidelines.
    7. Selects the first non-null value when iterating over extracted text and plain text columns.
    8. Handles null values based on text row.
    9. Writes out the processed data to dolma format.

    Parameters:
    file_path (str): The path to the file to be processed.
    """

    logger = logging.getLogger("court-listener-opinion")

    df = pd.read_csv(file_path)

    df["source"] = args.source_name
    df["added"] = datetime.utcnow().isoformat()

    # rename the column date_created -> created
    df = df.rename(columns={"date_created": "created"})

    # add the metadata column including the author_str, license, and url
    df["metadata"] = df.apply(
        lambda x: {
            "license": str(PermissiveLicenses.PD),
            "url": x["download_url"],
            "author": x["author_str"],
        },
        axis=1,
    )

    df = df.drop(
        columns=[
            "date_modified",
            "author_str",
            "per_curiam",
            "joined_by_str",
            "type",
            "sha1",
            "page_count",
            "local_path",
            "extracted_by_ocr",
            "author_id",
            "cluster_id",
            "download_url",
        ]
    )

    # For each row, select the first non-null value when iterating over columns in this order:
    # html_with_citations
    # html_columbia
    # html_lawbox
    # xml_harvard
    # html_anon_2020
    # html
    # plain_text
    # This follows the procedure in the Court Listener documentation.

    df["text"] = (
        df["html_with_citations"]
        .combine_first(df["html_columbia"])
        .combine_first(df["html_lawbox"])
        .combine_first(df["xml_harvard"])
        .combine_first(df["html_anon_2020"])
        .combine_first(df["html"])
    )

    # drop the columns from combine_first to avoid redundancy
    df = df.drop(
        columns=[
            "html",
            "html_anon_2020",
            "html_lawbox",
            "html_columbia",
            "xml_harvard",
            "html_with_citations",
        ]
    )

    # drop null values in the text column
    df = df.dropna(subset=["text"])

    # extract text from html and xml following Harvard CAP
    # They used r"<.+?>", ""
    df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))

    # Combine DataFrame objects by filling null values
    # with non-null values from the other selected DataFrames.
    df["text"] = df["text"].combine_first(df["plain_text"])

    # drop the plain text column
    df = df.drop(columns=["plain_text"])

    # drop null values in the text column
    df = df.dropna(subset=["text"])

    # return a dictionary for each row - dolma format
    return df.to_dict(orient="records")


def main(args):
    example = process_court_listener(args.input_file)
    to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)
    logger.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")


if __name__ == "__main__":
    logger = configure_logging("court-listener-opinion")

    parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )
    parser.add_argument(
        "--shard_size",
        default=1000,
        help="The number of documents to store in each shard.",
    )
    parser.add_argument(
        "--input_file",
        required=True,
        help="The path to the csv file to convert.",
    )
    parser.add_argument(
        "--output_file_base_name",
        default="courtlistener.jsonl.gz",
        help="The base name of the output file.",
    )
    parser.add_argument(
        "--source_name",
        default="SOURCE_NAME",
        help="The name of the source.",
    )
    args = parser.parse_args()
    main(args)

Text extraction comment related to trafilatura needs to be decided. If not using trafilatura can remove the other combine_first. This is what trafilatura text extraction looks like: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

Example output:

{"id": 6881, "created": "2010-04-25 05:22:21+00", "source": "SOURCE_NAME", "added": "2024-06-04T22:16:15.833924", "metadata": {"license": "Public Domain", "url": "http://www.ca5.uscourts.gov/opinions\\\\pub\\\\94/94-60337.CV0.wpd.pdf", "author": NaN}, "text": "United States Court of Appeals,\\n\\n ......."}

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 4, 2024

If you want to iterate then something like this should work I believe:

    for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
        df = chunk
        ...
        yield df.to_dict(orient="records")

@blester125
Copy link
Collaborator

Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go

@conceptofmind
Copy link
Contributor Author

Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go

Ok I will add that to the main function.

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 5, 2024

I tested something like this given your comment:

def iterate_over_chunks(file_path):
    for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
        yield chunk

def main(args):
    data = iterate_over_chunks(args.input_file)
    with mp.Pool(args.processors) as pool:
        dolma_chunks = pool.imap(process_court_listener, data)
        dolma = itertools.chain(*dolma_chunks)
        to_dolma(dolma, args.output_dir, args.output_file_base_name, args.shard_size

The speedup is not massively significant. Limitations of pandas may be shown here.

As noted by Zhenlin Xu. Single-process is still reasonable and can be done quickly as well. Even with just iterating alone.

Tried chaining from the iterable as well but this was even slower:

dolma_chunks = pool.imap(process_court_listener, ((chunk) for chunk in data))
dolma = itertools.chain.from_iterable(dolma_chunks)

Can always load into something like duckdb and coalesce instead of using spark. Requires a larger machine there though too.

@blester125
Copy link
Collaborator

I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!

@conceptofmind
Copy link
Contributor Author

I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!

Will do

@wildphoton
Copy link
Collaborator

I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 11, 2024

I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125

Hi,

Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP.

BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

@wildphoton
Copy link
Collaborator

wildphoton commented Jun 12, 2024

Hi,

Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP.
BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these. I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please? A stand-alone dataset won't be very helpful. Thanks.

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 12, 2024

Hi,
Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP.
BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these.

In the issue, I stated a separate PR is required to handle additional processing. This requires the new CAP data dumps to have the opinionated fixes by CL applied and streamlined into the bulk CL data dump. Jack Cushman, Kevin Ramirez, Mike Lissner, Bill Palin, and I are working on the most efficient way to provide this set of fixes to the community.

I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please?

You can review the precision and F1 of different text extractors here:

| Model         | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura   | 0.913          | 0.883   | 0.989            | 0.957     |
| DOM Distiller | 0.894          | 0.858   | 0.983            | 0.959     |
| Web2Text      | 0.797          | 0.841   | 0.885            | 0.917     |
| Boilerpipe    | 0.908          | 0.834   | 0.973            | 0.946     |
| Dragnet       | 0.901          | 0.823   | 0.980            | 0.943     |
| BTE           | 0.796          | 0.817   | 0.927            | 0.936     |
| Newspaper3k   | 0.896          | 0.816   | 0.994            | 0.958     |
| news-please   | 0.895          | 0.815   | 0.994            | 0.958     |
| Goose3        | 0.899          | 0.810   | 0.999            | 0.940     |
| BoilerNet     | 0.840          | 0.798   | 0.944            | 0.895     |
| ExtractNet    | 0.858          | 0.791   | 0.963            | 0.911     |
| jusText       | 0.794          | 0.759   | 0.949            | 0.904     |
| lxml Cleaner  | 0.615          | 0.717   | 0.670            | 0.798     |
| html_text     | 0.567          | 0.683   | 0.506            | 0.667     |
| BS4           | 0.563          | 0.680   | 0.506            | 0.669     |
| inscriptis    | 0.557          | 0.673   | 0.483            | 0.649     |
| XPath Text    | 0.550          | 0.664   | 0.510            | 0.674     |

Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the above code, trafilatura.extract(filecontent, favor_precision=True).

A stand-alone dataset won't be very helpful. Thanks.

For clarification:

CAP has its own group chat and fixes
CL has its own group chat and fixes
There are parts in which both are not clearly communicated due to "differences", I would call it
There are pieces of processing in which I am providing input directly to both of them for improving model training
Mike and I are improving the documentation for CL so there are parts that are more clear to the community
There are upstreams that need to be independently made from CAP to CL
There are numerous instances of boilerplate and malformed text which we have already post-processed out

There is a bit of jumping around when trying to communicate all this

@blester125
Copy link
Collaborator

How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 12, 2024

How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything

The most variability between CAP and CL is currently in "xml_harvard" which contains a lot of useful information. These documents require opinionated fixes from places like here: https://github.com/freelawproject/courtlistener/blob/a3bc7b472a1736bc0cb0e92cfa096dec12960e17/cl/corpus_importer/management/commands/harvard_opinions.py#L286

Which can be seen in a merged set of json here: https://github.com/freelawproject/opinionated/tree/main/data/harvard

This needs to be upstreamed in the new current CAP dumps to CL. We are working on this.

Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">.

There are numerous different instances of these cases. Once the upstreamed dumps are merged then CAP said this issue should not occur. I have written individual regex to cover as many of these edge cases as I could find. There are additionally child nodes that are being removed in relation to the edge case above as well.

There are also places in which I am writing a regex to improve tag '<jurissystem' -> '<div' standardization for extraction. These need to be handled independently as if you remove the incorrect tags then applying the upstream fixes between CAP and CL will have issues.

Other than that it is cleaning out standard things like boilerplate and OCR errors which we have been annotating.

There are other things that have been fixed and additional input I am waiting on from CAP/CL before adding to the next PR.

I messaged with Colin and will work on documenting this as clearly as possible.

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jul 9, 2024

Additional changes will need to be made to drop the Harvard XML column and replace it with the new fixed HTML column from the updated dumps here: https://huggingface.co/datasets/conceptofmind/CAP

This will require the two datasets to be merged after appropriate column ordering from CAP.

This fixes the star pagination issue: Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">.

It would be preferred to get this merged into the actual CL dumps, but this is pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants