Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USGPO #70

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

USGPO #70

wants to merge 10 commits into from

Conversation

nkandpa2
Copy link
Collaborator

This PR uses the USGPO developer API to collect documents published by the USGPO. This will close #64.

@nkandpa2 nkandpa2 marked this pull request as draft May 13, 2024 15:30
package_queue = queue.Queue()
metadata_queue = queue.Queue()

with ThreadPoolExecutor(max_workers=args.workers + 2) as executor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parallelism is pretty complicated, it is possible to fetch all the packages in parallel, collect results, and then fetch the metadata from each package in parallel afterwards?

@StellaAthena
Copy link
Collaborator

What is still "WIP" about this? What do we need to add so it's ready to go?

@nkandpa2 nkandpa2 marked this pull request as ready for review June 10, 2024 15:52
@nkandpa2
Copy link
Collaborator Author

This PR should be ready to go. I've started a run going back to 1990. So far about 70K document links have been collected with at least a handful of documents coming from each year and no failures.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes and the data format is currently wrong afaict

@@ -0,0 +1,115 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc string

# Most documents are primarily pre-formatted text inside of the a <pre> tag
# If so, just take the contents of that tag instead of the whole document
soup = BeautifulSoup(text, "html.parser")
pre_tag = soup.find("pre")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed_text = pre_tag.get_text() if (pre_tag := soup.find("pre")) else text

raw_html = download_file(api_key, file_url)
parsed_text = parse_html(raw_html)

return {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data if formatted incorrectly,

title author publisher, categoryshould all be in themetadatafield.date->created`

Do we need the html field? If we want to give people access to it, it would probably be to change the code so this outputs a dolma file here the html is in the text field and the path is .../raw/documents. Then have a preprocess script that uses the dolma parallel processors to convert the html to text and save it to .../v0/documents



def generate_records(args):
with jsonlines.open(args.links_file, mode="r") as reader:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a speed boost from using jsonlines? It seems simple enough to just call json.loads on each line.


def generate_records(args):
with jsonlines.open(args.links_file, mode="r") as reader:
with ThreadPoolExecutor(max_workers=args.workers) as executor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use the ThreadPoolExecutor over multiprocessing.dummy.Pool?

return args


def api_query(endpoint, headers, params):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably move to a utils file to be shared with the download-files.py script.

pbar.update(1)

url = output["nextPage"]
offset_mark = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document this offset_mark a bit, being "*" the first time and None the rest is a bit confusing.



def construct_record(api_key, file):
file_url = file["links"].get("txtLink")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is confusing here, links makes it sound like a list of links but .get makes it seems like a dict representing one link?

)

# One thread for writing out the package metadata to disk
executor.submit(write_metadata, args.output_dir, metadata_queue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have a race condition in this parallelism. Let the package_queue have 2 packages followed the None sentinel, and there are 2 two workers. 1 worker gets the links quickly while the second worker gets a 429 and sleeps for an hour. worker 1 writes the package metadata to the metadata_queue and gets the next item from the package queue, the None, and writes it to the metadata queue. This means the writer will see the None before the metadata for package 2, the writer will close, and the package 2 metadata will be lost.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way people generally do this is each worker needs to add a sentinel to the writer queue and the write only stops when it sees #workers sentinels

parser.add_argument(
"--collections",
nargs="+",
default=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer a tuple for read only things like this.

nkandpa2 added 2 commits June 20, 2024 14:05
- Changed from producer/consumer parallelism to (1) collect all packages
and then (2) collect all metadata with thread pool
- Changed from BeautifulSoup hardcoded html parsing of <pre> tag to more
flexible `trafilatura.extract` since some documents are more complex
html
- wrap download-files construct_record in try/catch
- if <pre> tag is available, use pre-formatted text. else use
trafilatura.extract to convert to markdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

US Government Publishing Office
3 participants