-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add correct text ordering and column merging to CL #83
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should add a simple docstring to the module |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
import argparse | ||
import csv | ||
import logging | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unneeded import There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not think much of this is needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed import |
||
import os | ||
import re | ||
import sys | ||
|
||
import pandas as pd | ||
|
||
from licensed_pile.licenses import PermissiveLicenses | ||
from licensed_pile.logs import configure_logging | ||
from licensed_pile.write import to_dolma | ||
|
||
SOURCE_NAME = "CourtListenerOpinion" | ||
|
||
csv.field_size_limit(sys.maxsize) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed now that we aren't using the csv stdlib? If it isn't there are several unused imports to remove. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Deleted this and removed imports. |
||
|
||
logger = configure_logging("court-listener-opinion") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prefer moving the configure_logging call under There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is what you meant: def process_court_listener(file_path):
logger = logging.getLogger("court-listener-opinion")
... if __name__ == "__main__":
logger = configure_logging("court-listener-opinion")
... |
||
|
||
|
||
def process_court_listener(file_path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing Docstring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add a docstring following something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about something like: def process_court_listener(file_path):
"""
This function performs several operations on a given file:
1. Reads the decompressed Court Listener (CL) data dump from the provided file path.
2. Adds metadata and source information to the data.
3. Drops unnecessary columns based on Harvard CAP input.
4. Selects the first non-null value when iterating over columns based on CL documentation.
5. Drops repeated XML/HTML columns.
6. Extracts HTML/XML following Harvard CAP guidelines.
7. Selects the first non-null value when iterating over extracted text and plain text columns.
8. Handles null values based on text row.
9. Writes out the processed data to dolma format.
Parameters:
file_path (str): The path to the file to be processed.
""" |
||
df = pd.read_csv(file_path) | ||
|
||
# add metadata column | ||
df["metadata"] = str(PermissiveLicenses.PD) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The metadata value should be a dict in the dolma format, not a string There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will create a metadata dict as noted above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to metadata dict: "license": str(PermissiveLicenses.PD), |
||
|
||
# add source column | ||
df["source"] = SOURCE_NAME | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WIll do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. def process_court_listener(file_path):
...
df["source"] = args.source_name parser.add_argument(
"--source_name",
default="SOURCE_NAME",
help="The name of the source.",
) |
||
|
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment doesn't add anything, the list of columns to remove is in the code below There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed comment def process_court_listener(file_path):
...
df["source"] = args.source_name |
||
Remove columns: | ||
date_modified | ||
author_str | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should include the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added this: "author": x["author_str"], |
||
per_curiam | ||
joined_by_str | ||
type | ||
sha1 | ||
page_count | ||
local_path | ||
extracted_by_ocr | ||
author_id | ||
cluster_id | ||
""" | ||
df = df.drop( | ||
columns=[ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not something to be addressed, but I hate that pandas would do something totally different if this was a tuple instead of a list (and it would be better as a tuple as it is read only lol) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. LOL |
||
"date_modified", | ||
"author_str", | ||
"per_curiam", | ||
"joined_by_str", | ||
"type", | ||
"sha1", | ||
"page_count", | ||
"local_path", | ||
"extracted_by_ocr", | ||
"author_id", | ||
"cluster_id", | ||
] | ||
) | ||
|
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prefer multiple single line comments over multiline strings as comments (syntax highlighting doesn't handle the string as a comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. # For each row, select the first non-null value when iterating over columns in this order:
# html_with_citations
# html_columbia
# html_lawbox
# xml_harvard
# html_anon_2020
# html
# plain_text
# This follows the procedure in the Court Listener documentation.
df["text"] = (
df["html_with_citations"]
.combine_first(df["html_columbia"])
.combine_first(df["html_lawbox"])
.combine_first(df["xml_harvard"])
.combine_first(df["html_anon_2020"])
.combine_first(df["html"])
) |
||
Merge columns based on Court Listener documentation: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should re-write this comment. "Merge" isn't the name for this operation. Something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coalesce is what is used in spark. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no opinion on this. We can add that comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. # For each row, select the first non-null value when iterating over columns in this order:
# html_with_citations
# html_columbia
# html_lawbox
# xml_harvard
# html_anon_2020
# html
# plain_text
# This follows the procedure in the Court Listener documentation.
df["text"] = (
df["html_with_citations"]
.combine_first(df["html_columbia"])
.combine_first(df["html_lawbox"])
.combine_first(df["xml_harvard"])
.combine_first(df["html_anon_2020"])
.combine_first(df["HTML"])
) |
||
html_with_citations | ||
html_columbia | ||
html_lawbox | ||
xml_harvard | ||
html_anon_2020 | ||
html | ||
plain_text | ||
""" | ||
df["text"] = ( | ||
df["html_with_citations"] | ||
.combine_first(df["html_columbia"]) | ||
.combine_first(df["html_lawbox"]) | ||
.combine_first(df["xml_harvard"]) | ||
.combine_first(df["html_anon_2020"]) | ||
.combine_first(df["html"]) | ||
) | ||
|
||
# keep only the text columns and drop null values | ||
df = df.drop( | ||
columns=[ | ||
"html", | ||
"html_anon_2020", | ||
"html_lawbox", | ||
"html_columbia", | ||
"xml_harvard", | ||
"html_with_citations", | ||
] | ||
).dropna(subset=["text"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Drop NA is a very different operation than dropping columns. We should separate them into two different lines so the dropna doesn't get lost in a refactor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do and add comments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about something like this: # drop the columns used in combine_first to avoid redundancy
df = df.drop(
columns=[
"html",
"html_anon_2020",
"html_lawbox",
"html_columbia",
"xml_harvard",
"html_with_citations",
]
)
# drop null values in the text column
df = df.dropna(subset=["text"]) |
||
|
||
# extract text from html and xml following Harvard CAP | ||
# They used r"<.+?>", "" | ||
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can get vectorized via built in pandas instead of using apply, i.e.,: df["text"] = df["text"].str.replace(r"<.+?>", "", regex=True) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is dependent on the extractor we choose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have any auditing tools for the output? It can be ok if you have a strict well-defined subset of things you are looking for, but generally you shouldn't parse HTML/XML with a regex https://stackoverflow.com/a/1732454 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trafilatura is the extractor that is commonly used. I left what Harvard CAP did with regex. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what it would be with trafilatura: # extract text from html and xml using trafilatura
df["text"] = df["text"].apply(lambda x: extract(x)) |
||
|
||
# combine merge plain text and extracted text | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, merge is a bad name for this operation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can list it as coalesce instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure coalesce is a good name either, that also implies you are creating one value from multiple column values. I hate that SQL picked such a bad name lol There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes LOL. It is a bad name. I will comment after thinking about it a bit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe something like this:
Adapted from docs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. # Combine DataFrame objects by filling null values
# with non-null values from the other selected DataFrames.
df["text"] = df["text"].combine_first(df["plain_text"]) |
||
df["text"] = df["text"].combine_first(df["plain_text"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line doesn't do anything, there is already a Additionally, The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an artifact of using a different extractor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trafilatura will over-filter and cause null if not handled explicitly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not the case with standard regex. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Trafilatura is so overzelous that it turns real data into null that sounds like a red flag. Also if the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it is a red flag. I am writing my own explicit extractor for this dataset but that will not be out until later. Trafilatura ranks as the highest-quality extractor given a general task. The extractor should be run on every column which is not plain text. Plain text is considered the lowest quality data in terms of coalescing. It should only be used if the other columns do not exist and can not be extracted. |
||
|
||
# drop plain text column and text null values | ||
df = df.drop(columns=["plain_text"]).dropna(subset=["text"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, these are super different operations so we should explicitly separate them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can split and add comments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. # drop the plain text column
df = df.drop(columns=["plain_text"])
# drop null values in the text column
df = df.dropna(subset=["text"]) |
||
|
||
# return a dictionary for each row - dolma format | ||
return df.to_dict(orient="records") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't result in the dolma formatted data. See https://github.com/allenai/dolma/blob/main/docs/data-format.md#dolma-document-format Issues include:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will create a metadata dict for the recommended changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me know if this is what you were looking for: df = pd.read_csv(file_path)
df["source"] = args.source_name
df["added"] = datetime.utcnow().isoformat()
# rename the column date_created -> created
df = df.rename(columns={"date_created": "created"})
# add the metadata column including the author_str, license, and url
df["metadata"] = df.apply(
lambda x: {
"license": str(PermissiveLicenses.PD),
"url": x["download_url"],
"author": x["author_str"],
},
axis=1,
) |
||
|
||
|
||
def main(args): | ||
example = process_court_listener(args.input_file) | ||
output_file_base_name = os.path.basename(args.input_file).replace( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that we don't have multiple dated files we should have the base filename as an cli argument with a default like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)
...
parser.add_argument(
"--output_file_base_name",
default="courtlistener.jsonl.gz",
help="The base name of the output file.",
) |
||
".csv", ".jsonl.gz" | ||
) | ||
to_dolma(example, args.output_dir, output_file_base_name, args.shard_size) | ||
logger.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}") | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Convert csv data to dolma.") | ||
parser.add_argument( | ||
"--output_dir", | ||
default=f"data/courtlistener/v0", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parser.add_argument(
"--output_dir",
default="data/courtlistener/v0/documents",
help="Where the dolma formatted data goes.",
) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't need to be an f-string There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parser.add_argument(
"--output_dir",
default="data/courtlistener/v0/documents",
help="Where the dolma formatted data goes.",
) |
||
help="Where the dolma formatted data goes.", | ||
) | ||
parser.add_argument( | ||
"--shard_size", | ||
default=1000, | ||
help="The number of documents to store in each shard.", | ||
) | ||
parser.add_argument( | ||
"--input_file", | ||
default="./data/courtlistener/raw/opinions-2022-08-02.csv", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't have a default for this, especially as the date doesn't match the latest date. Lets use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parser.add_argument(
"--input_file",
required=True,
help="The path to the csv file to convert.",
) |
||
help="The path to the csv file to convert.", | ||
) | ||
args = parser.parse_args() | ||
main(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a modification of the csv_to_dolma.py instead of a new file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes will merge.