Skip to content

Commit

Permalink
process updates in batches of 500
Browse files Browse the repository at this point in the history
Batch processing:
* updates just the declared license in the DB documents using `collection.bulk_write()`
* updates denitions using service API `POST /definitions?force=true`

_NOTE: Updating the DB makes the fix of the declared license immediately available.  When the `POST /definitions` request completes, the full DB document will be updated to be in sync with the blob definition._

Additional changes:
* moves global variable definitions based on .env to the initialize() function
* adds DRYRUN flag to check what would run and how many records would be evaluated
* add estimated time to complete
* adds script and function level documentation
* includes timestamps to make it easier to estimate how long it will take to complete a run
* generate filename based on date range and offset to avoid overwriting output files

_NOTE: Azure only supports fetching one blob at a time. Not able to optimize that part of the code. _

_NOTE: Batch size of 500 was selected because that is the max number of coordinates supported in calls to service API `POST /definitions`._
  • Loading branch information
elrayle committed Jul 24, 2024
1 parent 615c940 commit d364503
Show file tree
Hide file tree
Showing 2 changed files with 168 additions and 83 deletions.
3 changes: 2 additions & 1 deletion tools/analyze_data_synchronization/.env_example
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@ MONGO_CONNECTION_STRING="mongodb://localhost:27017/"
BASE_AZURE_BLOB_URL = "https://clearlydefineddev.blob.core.windows.net"
AZURE_CONTAINER_NAME = "develop-definition"
SERVICE_API_URL = "http://dev-api.clearlydefined.io/"
OUTPUT_FILE = "invalid-data.json"
BASE_OUTPUT_FILENAME = "invalid-data"
# START_DATE = "2024-06-21"
# END_DATE = "2024-06-28"
START_MONTH = str(os.environ.get("START_MONTH", "2024-06"))
END_MONTH = str(os.environ.get("END_MONTH", "2024-06"))
INITIAL_SKIP = 0
PAGE_SIZE = 1000
REPAIR = false
DRYRUN = false
Loading

0 comments on commit d364503

Please sign in to comment.