USGPO #70

nkandpa2 · 2024-05-13T14:54:26Z

This PR uses the USGPO developer API to collect documents published by the USGPO. This will close #64.

blester125 · 2024-05-25T22:47:41Z

usgpo/get-links.py

+    package_queue = queue.Queue()
+    metadata_queue = queue.Queue()
+
+    with ThreadPoolExecutor(max_workers=args.workers + 2) as executor:


This parallelism is pretty complicated, it is possible to fetch all the packages in parallel, collect results, and then fetch the metadata from each package in parallel afterwards?

StellaAthena · 2024-05-28T18:31:23Z

What is still "WIP" about this? What do we need to add so it's ready to go?

- fix missing imports

nkandpa2 · 2024-06-10T15:57:08Z

This PR should be ready to go. I've started a run going back to 1990. So far about 70K document links have been collected with at least a handful of documents coming from each year and no failures.

blester125

Some small changes and the data format is currently wrong afaict

blester125 · 2024-06-10T18:56:09Z

usgpo/download-files.py

@@ -0,0 +1,115 @@
+import argparse


Missing doc string

blester125 · 2024-06-10T18:57:10Z

usgpo/download-files.py

+    # Most documents are primarily pre-formatted text inside of the a <pre> tag
+    # If so, just take the contents of that tag instead of the whole document
+    soup = BeautifulSoup(text, "html.parser")
+    pre_tag = soup.find("pre")


parsed_text = pre_tag.get_text() if (pre_tag := soup.find("pre")) else text

blester125 · 2024-06-10T19:01:10Z

usgpo/download-files.py

+    raw_html = download_file(api_key, file_url)
+    parsed_text = parse_html(raw_html)
+
+    return {


The data if formatted incorrectly,

title author publisher, categoryshould all be in themetadatafield.date->created`

Do we need the html field? If we want to give people access to it, it would probably be to change the code so this outputs a dolma file here the html is in the text field and the path is .../raw/documents. Then have a preprocess script that uses the dolma parallel processors to convert the html to text and save it to .../v0/documents

blester125 · 2024-06-10T19:01:57Z

usgpo/download-files.py

+
+
+def generate_records(args):
+    with jsonlines.open(args.links_file, mode="r") as reader:


Is there a speed boost from using jsonlines? It seems simple enough to just call json.loads on each line.

blester125 · 2024-06-10T19:03:42Z

usgpo/download-files.py

+
+def generate_records(args):
+    with jsonlines.open(args.links_file, mode="r") as reader:
+        with ThreadPoolExecutor(max_workers=args.workers) as executor:


Is there a reason to use the ThreadPoolExecutor over multiprocessing.dummy.Pool?

blester125 · 2024-06-10T19:04:24Z

usgpo/get-links.py

+    return args
+
+
+def api_query(endpoint, headers, params):


This should probably move to a utils file to be shared with the download-files.py script.

blester125 · 2024-06-10T19:07:59Z

usgpo/get-links.py

+                pbar.update(1)
+
+            url = output["nextPage"]
+            offset_mark = None


Can you document this offset_mark a bit, being "*" the first time and None the rest is a bit confusing.

blester125 · 2024-06-10T19:11:42Z

usgpo/download-files.py

+
+
+def construct_record(api_key, file):
+    file_url = file["links"].get("txtLink")


The naming is confusing here, links makes it sound like a list of links but .get makes it seems like a dict representing one link?

blester125 · 2024-06-10T19:20:53Z

usgpo/get-links.py

+            )
+
+        # One thread for writing out the package metadata to disk
+        executor.submit(write_metadata, args.output_dir, metadata_queue)


I think you have a race condition in this parallelism. Let the package_queue have 2 packages followed the None sentinel, and there are 2 two workers. 1 worker gets the links quickly while the second worker gets a 429 and sleeps for an hour. worker 1 writes the package metadata to the metadata_queue and gets the next item from the package queue, the None, and writes it to the metadata queue. This means the writer will see the None before the metadata for package 2, the writer will close, and the package 2 metadata will be lost.

I think the way people generally do this is each worker needs to add a sentinel to the writer queue and the write only stops when it sees #workers sentinels

blester125 · 2024-06-10T19:24:29Z

usgpo/get-links.py

+    parser.add_argument(
+        "--collections",
+        nargs="+",
+        default=[


prefer a tuple for read only things like this.

- Changed from producer/consumer parallelism to (1) collect all packages and then (2) collect all metadata with thread pool - Changed from BeautifulSoup hardcoded html parsing of <pre> tag to more flexible `trafilatura.extract` since some documents are more complex html

- wrap download-files construct_record in try/catch - if <pre> tag is available, use pre-formatted text. else use trafilatura.extract to convert to markdown

nkandpa2 added 6 commits April 15, 2024 09:56

initial commit for usgpo

8ed6b61

Write out to Dolma

509fe0d

Driver script

78aced6

fix logging

f06e1e8

Set default start-date parameter as 01/01/1990

b7def76

Add README

931ddd1

nkandpa2 marked this pull request as draft May 13, 2024 15:30

blester125 reviewed May 25, 2024

View reviewed changes

- fix the dataset output location to be in usgpo/

6e1e3ee

- fix missing imports

nkandpa2 marked this pull request as ready for review June 10, 2024 15:52

black

68e44b9

blester125 requested changes Jun 10, 2024

View reviewed changes

nkandpa2 added 2 commits June 20, 2024 14:05

- black + isort

8d46b15

- wrap download-files construct_record in try/catch - if <pre> tag is available, use pre-formatted text. else use trafilatura.extract to convert to markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USGPO #70

USGPO #70

nkandpa2 commented May 13, 2024

blester125 May 25, 2024

StellaAthena commented May 28, 2024

nkandpa2 commented Jun 10, 2024

blester125 left a comment

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024



		def generate_records(args):
		with jsonlines.open(args.links_file, mode="r") as reader:



		def construct_record(api_key, file):
		file_url = file["links"].get("txtLink")

USGPO #70

Are you sure you want to change the base?

USGPO #70

Conversation

nkandpa2 commented May 13, 2024

Choose a reason for hiding this comment

StellaAthena commented May 28, 2024

nkandpa2 commented Jun 10, 2024

blester125 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment