-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USGPO #70
base: main
Are you sure you want to change the base?
Conversation
usgpo/get-links.py
Outdated
package_queue = queue.Queue() | ||
metadata_queue = queue.Queue() | ||
|
||
with ThreadPoolExecutor(max_workers=args.workers + 2) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parallelism is pretty complicated, it is possible to fetch all the packages in parallel, collect results, and then fetch the metadata from each package in parallel afterwards?
What is still "WIP" about this? What do we need to add so it's ready to go? |
- fix missing imports
This PR should be ready to go. I've started a run going back to 1990. So far about 70K document links have been collected with at least a handful of documents coming from each year and no failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small changes and the data format is currently wrong afaict
@@ -0,0 +1,115 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc string
# Most documents are primarily pre-formatted text inside of the a <pre> tag | ||
# If so, just take the contents of that tag instead of the whole document | ||
soup = BeautifulSoup(text, "html.parser") | ||
pre_tag = soup.find("pre") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsed_text = pre_tag.get_text() if (pre_tag := soup.find("pre")) else text
usgpo/download-files.py
Outdated
raw_html = download_file(api_key, file_url) | ||
parsed_text = parse_html(raw_html) | ||
|
||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data if formatted incorrectly,
title
author
publisher,
categoryshould all be in the
metadatafield.
date->
created`
Do we need the html
field? If we want to give people access to it, it would probably be to change the code so this outputs a dolma file here the html is in the text
field and the path is .../raw/documents
. Then have a preprocess script that uses the dolma parallel processors to convert the html to text and save it to .../v0/documents
|
||
|
||
def generate_records(args): | ||
with jsonlines.open(args.links_file, mode="r") as reader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a speed boost from using jsonlines? It seems simple enough to just call json.loads
on each line.
|
||
def generate_records(args): | ||
with jsonlines.open(args.links_file, mode="r") as reader: | ||
with ThreadPoolExecutor(max_workers=args.workers) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use the ThreadPoolExecutor over multiprocessing.dummy.Pool
?
usgpo/get-links.py
Outdated
return args | ||
|
||
|
||
def api_query(endpoint, headers, params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably move to a utils file to be shared with the download-files.py
script.
pbar.update(1) | ||
|
||
url = output["nextPage"] | ||
offset_mark = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document this offset_mark
a bit, being "*"
the first time and None
the rest is a bit confusing.
usgpo/download-files.py
Outdated
|
||
|
||
def construct_record(api_key, file): | ||
file_url = file["links"].get("txtLink") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming is confusing here, links
makes it sound like a list of links but .get
makes it seems like a dict representing one link?
usgpo/get-links.py
Outdated
) | ||
|
||
# One thread for writing out the package metadata to disk | ||
executor.submit(write_metadata, args.output_dir, metadata_queue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have a race condition in this parallelism. Let the package_queue have 2 packages followed the None
sentinel, and there are 2 two workers. 1 worker gets the links quickly while the second worker gets a 429
and sleeps for an hour. worker 1 writes the package metadata to the metadata_queue and gets the next item from the package queue, the None
, and writes it to the metadata queue. This means the writer will see the None
before the metadata for package 2, the writer will close, and the package 2 metadata will be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way people generally do this is each worker needs to add a sentinel to the writer queue and the write only stops when it sees #workers sentinels
parser.add_argument( | ||
"--collections", | ||
nargs="+", | ||
default=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer a tuple for read only things like this.
- Changed from producer/consumer parallelism to (1) collect all packages and then (2) collect all metadata with thread pool - Changed from BeautifulSoup hardcoded html parsing of <pre> tag to more flexible `trafilatura.extract` since some documents are more complex html
- wrap download-files construct_record in try/catch - if <pre> tag is available, use pre-formatted text. else use trafilatura.extract to convert to markdown
This PR uses the USGPO developer API to collect documents published by the USGPO. This will close #64.