Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV-458 end-to-end automated test for metadata workflow #46

Draft
wants to merge 114 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
890b4ae
DEV-458 end-to-end automated test for metadata workflow
moseshll Nov 13, 2024
066806b
Minor README adjustments.
moseshll Nov 14, 2024
a6093f3
Add log level and DATA_ROOT to docker-compose.
moseshll Nov 14, 2024
88cafaa
Add Canister and Climate Control gems; add logger service.
moseshll Nov 14, 2024
270f428
Add Journal class and write it from post_zephir.rb.
moseshll Nov 14, 2024
cccf97a
Reconcile README and updated default TMPDIR.
moseshll Nov 14, 2024
3f50009
Prototype Verifier.
moseshll Nov 14, 2024
9bdc2ee
Flesh out enough details for others to evaluate.
moseshll Nov 26, 2024
e0107ce
DEV-1421 (WIP): verify delete files
aelkiss Dec 2, 2024
787d056
DEV-1421: verify delete files
aelkiss Dec 2, 2024
8429ed7
DEV-1421: rubocop fixes
aelkiss Dec 2, 2024
93ddd2e
- spec for `Verifier` class.
moseshll Dec 2, 2024
48c1b58
added verify_rights_file_format, some tests, and test helpers
mwarin Dec 2, 2024
48e41b3
more tests for verify_rights_file_format
mwarin Dec 2, 2024
8ae1a90
Use verifier.errors and with_test_environment
aelkiss Dec 3, 2024
eadc0cd
Use extracted methods for checking delete file
aelkiss Dec 3, 2024
2e1c7d0
move with_temp_file to spec_helper and correct
aelkiss Dec 3, 2024
78d572d
DEV-1414: stub out verification & tests for hathifiles
aelkiss Dec 4, 2024
4a6b693
DEV-1414: hathifiles field verification
aelkiss Dec 4, 2024
d85c1df
fixup: run standardrb
aelkiss Dec 4, 2024
2c231d6
fixup: add hathifile fixture
aelkiss Dec 5, 2024
abcb0e4
- Add Dockerfile/Gemfile support for Sequel
moseshll Dec 5, 2024
5565629
DEV-1414: Test HathifilesVerifier methods directly
aelkiss Dec 5, 2024
a2623bc
DEV-1414: compute catalog file for given hathifile name
aelkiss Dec 5, 2024
62c7b63
DEV-1414: Test hathifile line count
aelkiss Dec 5, 2024
4b16ab5
fix regex to disallow uppercase in volid
mwarin Dec 5, 2024
c59adda
dev-1420 dry out some tests
mwarin Dec 5, 2024
4c2bf05
DEV-1414: end-to-end hathifile test
aelkiss Dec 6, 2024
669c7b2
DEV-1414 - clean up hathifiles verification
aelkiss Dec 6, 2024
1b161f4
initial DEV-1415 commit, todo: check json file
mwarin Dec 6, 2024
7f69164
DEV-1418: WIP - verify catalog indexing
aelkiss Dec 6, 2024
97dabc9
DEV-1415: method, fixture & test for json listing
mwarin Dec 9, 2024
db38f03
standardrb fix
mwarin Dec 9, 2024
fa69fff
using spec helper method fixture()
mwarin Dec 9, 2024
a70d8c6
Initial implementation and unit tests for HathifilesDatabaseVerifier
moseshll Dec 9, 2024
6942c57
env vars for DEV-1417
mwarin Dec 10, 2024
f061925
Finish up DEV-1416 hathifiles database tests
moseshll Dec 10, 2024
889ed91
DEV-1417, hathifiles redirects
mwarin Dec 10, 2024
74adfe7
standardrb
mwarin Dec 10, 2024
2820297
[DEV-1417] more tests, setting date in initialize
mwarin Dec 11, 2024
12d12e9
Finish spec for DEV-1413 PopulateRightsVerifier
moseshll Dec 11, 2024
d4cbf41
moved gzip_linecount from verifier/hathifiles_database_verifier.rb to…
mwarin Dec 11, 2024
ccb4abc
[DEV-1422] added input/output line count check + test
mwarin Dec 11, 2024
eb8a28d
DEV-1418: verify update files for catalog indexing
aelkiss Dec 11, 2024
7fa70ca
De-constantize gzip_linecount tests relying on too-wide constant scope
moseshll Dec 12, 2024
7c4d3a6
[DEV-1422] made tests for PostZephirVerifier.verify_catalog_archive m…
mwarin Dec 12, 2024
ea0d5da
[DEV-1422] verify_catalog_archive: use dated_derivative, make more re…
mwarin Dec 12, 2024
9fbb65d
[DEV-1422] added Verifier.verify_parseable_ndj, use in post_zephir_ve…
mwarin Dec 12, 2024
1e61dec
test for verify_parseable_ndj
mwarin Dec 12, 2024
88990fa
unbreak test
mwarin Dec 12, 2024
2d9701b
More PZP verifier tests (not done yet)
moseshll Dec 12, 2024
c061899
Fix expectation string in PZP verifier spec
moseshll Dec 12, 2024
a04d90f
Add missing deletes fixture
moseshll Dec 12, 2024
fc92477
DEV-1418: test catalog indexing for full file
aelkiss Dec 12, 2024
54ab234
verify_ingest_bibrecords and verify_rights tests
moseshll Dec 12, 2024
fec6aa4
Finish PostZephirVerifier test coverage
moseshll Dec 13, 2024
77eb638
- Use zlib instead of zinzout in verify_parseable_ndj
moseshll Dec 13, 2024
4c187a5
DEV-1418: Tests for catalog verifier run_for_date
aelkiss Dec 16, 2024
b35c4dd
Update bundler & dependencies
aelkiss Dec 16, 2024
1731db8
Pin to bundler 2.5.23
aelkiss Dec 16, 2024
c6bb369
started implementing Derivative (sing.) class
mwarin Dec 17, 2024
08afbf6
Misc cleanup / address feedback
aelkiss Dec 17, 2024
a5076ab
Merge branch 'main' into DEV-458_E2E_workflow_test
aelkiss Dec 17, 2024
fe2236a
Include field names for hathifile verifier output
aelkiss Dec 17, 2024
2e3bc0c
started subclassing Derivative, added HathifileDerivative & implement…
mwarin Dec 17, 2024
dba6ce4
Refactor rights verifier
aelkiss Dec 17, 2024
d1119d0
Add Derivative::Catalog
aelkiss Dec 17, 2024
b289269
updated naming convention for subclasses
mwarin Dec 17, 2024
f081015
standardrbrrbrb
mwarin Dec 17, 2024
1c99b43
Use Derivative::Catalog for CatalogIndexVerifier
aelkiss Dec 17, 2024
09434a3
Use Derivative::Catalog in post_zephir_verifier
aelkiss Dec 17, 2024
f424f53
Use Derivative::Rights instead of :RIGHTS_ARCHIVE
aelkiss Dec 17, 2024
6386ead
using Derivative::HathifileWWW
mwarin Dec 17, 2024
d994698
Use Derivative in hathifiles_verifier
aelkiss Dec 18, 2024
5c22657
Don't return line count from verify_hathifile_contents
aelkiss Dec 18, 2024
5da30f6
fixes to hathifiles_listing_verifier & its spec, new hathifile_www de…
mwarin Dec 18, 2024
eec748b
gitignore got me again
mwarin Dec 18, 2024
d39507d
Add Derivative::Delete class
moseshll Dec 18, 2024
33ec293
Appease standardrb
moseshll Dec 18, 2024
9f5beb3
Remove DIR_DATA in favor of Derivative classes
moseshll Dec 18, 2024
f10458a
Derivative subclass for dollar dup
aelkiss Dec 18, 2024
7906dc1
Remove verification for rights report
aelkiss Dec 18, 2024
d227b62
PostZephirVerifier.verify_rights_file_format now checks individual cols
mwarin Dec 18, 2024
0cce69e
Respond to issues in HathifilesDatabaseVerifier spec
moseshll Dec 18, 2024
3dccc26
climatecontrol for hathifiles_redirects_verifier_spec
mwarin Dec 18, 2024
f8f1693
redirects_verifier now displays line number in errors
mwarin Dec 18, 2024
b728e9c
Catalog indexing verification time range improvements
aelkiss Dec 18, 2024
62b92e0
Prepend class name to error messages
aelkiss Dec 18, 2024
912ffe4
Appease standardrb
moseshll Dec 18, 2024
fdeadcc
Batch rights_current checks in PopulateRightsVerifier; refactor tests
moseshll Dec 18, 2024
6094687
nested conditions
mwarin Dec 19, 2024
2970aa5
PostZephirVerifier.verify_catalog_prep now using Derivative::Delete
mwarin Dec 19, 2024
5498a33
added, implemented and tested Derivative::IngestBibrecord
mwarin Dec 19, 2024
c1aee80
made and implemented Derivative::HTBibExport
mwarin Dec 19, 2024
46a86d5
changed require_relative to require where possible
mwarin Dec 19, 2024
aa73c64
Clean up some uses of ClimateControl
aelkiss Dec 19, 2024
6deabe3
Remove directory_for; datestamped_derivative
aelkiss Dec 19, 2024
48d1a64
dropped zinzout, uzing zlib everywhere for consistency
mwarin Dec 19, 2024
1435ea6
Refactor derivatives integration spec
aelkiss Dec 19, 2024
7732354
Rename Derivatives -> PostZephirDerivatives
aelkiss Dec 19, 2024
75fe457
Use guard clauses for verify_file, etc
aelkiss Dec 19, 2024
77ce92d
Remove extra with_test_environment in hathifiles contents verifier
aelkiss Dec 19, 2024
3bdde11
Add verifier classes to verifier script
aelkiss Dec 19, 2024
d3e6135
Happy-path integration test
aelkiss Dec 20, 2024
bd141dd
- Change the semantics of `run_for_date` parameter to always mean "ru…
moseshll Dec 20, 2024
6fa3bcd
- Move `derivatives_integration_spec.rb` to `post_zephir_derivatives_…
moseshll Dec 20, 2024
829d4fe
run_verifiers: use class names instead of lambdas
aelkiss Dec 20, 2024
0bbde91
- Extend CATALOG_ARCHIVE line count check to update files (in additio…
moseshll Dec 23, 2024
a74a806
- Database connection uses ENV instead of `database.yml`
moseshll Dec 31, 2024
b71f6ff
Address #55 Move WhateverVerifier to Verifier::Whatever
moseshll Jan 2, 2025
58cf99c
Allow hathifiles `digitization_agent_code` to match `yale2` by allowi…
moseshll Jan 2, 2025
c8f19b6
Add exception handler around Solr results to diagnose testing issue
moseshll Jan 2, 2025
2f7be27
DEV-1418: Handle solr auth params correctly
aelkiss Jan 6, 2025
6499b4f
standardrb fixes
aelkiss Jan 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ config/config.pl
config/.netrc
coverage/
zephir_full_daily_a*
*.gz
data/
local
local*
*_stderr
Expand All @@ -22,3 +22,5 @@ compare_*
*.jsonl
t/fixtures/rights_dbm
cover_db
**/.*.sw?
*~
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \
libmarc-perl \
libmarc-record-perl \
libmarc-xml-perl \
libmariadb-dev \
libnet-ssleay-perl \
libtest-output-perl \
libwww-perl \
Expand All @@ -39,7 +40,7 @@ RUN cpanm --notest \
# Ruby setup
ENV BUNDLE_PATH /gems
ENV RUBYLIB /usr/src/app/lib
RUN gem install bundler
RUN gem install bundler --version "~> 2.5.23"
RUN bundle config --global silence_root_warning 1
RUN bundle install

Expand Down
8 changes: 8 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,18 @@

source "https://rubygems.org"

gem "canister"
gem "dotenv"
gem "faraday"
gem "mysql2"
gem "sequel"

group :development, :test do
gem "climate_control"
gem "pry"
gem "rspec"
gem "simplecov"
gem "simplecov-lcov"
gem "standardrb"
gem "webmock"
end
80 changes: 58 additions & 22 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,88 +1,124 @@
GEM
remote: https://rubygems.org/
specs:
addressable (2.8.7)
public_suffix (>= 2.0.2, < 7.0)
ast (2.4.2)
bigdecimal (3.1.8)
canister (0.9.2)
climate_control (1.2.0)
coderay (1.1.3)
crack (1.0.0)
bigdecimal
rexml
diff-lcs (1.5.1)
docile (1.4.1)
json (2.7.2)
dotenv (3.1.6)
faraday (2.12.2)
faraday-net_http (>= 2.0, < 3.5)
json
logger
faraday-net_http (3.4.0)
net-http (>= 0.5.0)
hashdiff (1.1.2)
json (2.9.0)
language_server-protocol (3.17.0.3)
lint_roller (1.1.0)
logger (1.6.3)
method_source (1.1.0)
mysql2 (0.5.6)
net-http (0.6.0)
uri
parallel (1.26.3)
parser (3.3.5.0)
parser (3.3.6.0)
ast (~> 2.4.1)
racc
pry (0.14.2)
pry (0.15.0)
coderay (~> 1.1)
method_source (~> 1.0)
public_suffix (6.0.1)
racc (1.8.1)
rainbow (3.1.1)
regexp_parser (2.9.2)
rexml (3.3.9)
regexp_parser (2.9.3)
rexml (3.4.0)
rspec (3.13.0)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.1)
rspec-core (3.13.2)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.3)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.1)
rspec-mocks (3.13.2)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.1)
rubocop (1.65.1)
rspec-support (3.13.2)
rubocop (1.69.2)
json (~> 2.3)
language_server-protocol (>= 3.17.0)
parallel (~> 1.10)
parser (>= 3.3.0.2)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 2.4, < 3.0)
rexml (>= 3.2.5, < 4.0)
rubocop-ast (>= 1.31.1, < 2.0)
regexp_parser (>= 2.9.3, < 3.0)
rubocop-ast (>= 1.36.2, < 2.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 2.4.0, < 3.0)
rubocop-ast (1.32.3)
unicode-display_width (>= 2.4.0, < 4.0)
rubocop-ast (1.37.0)
parser (>= 3.3.1.0)
rubocop-performance (1.21.1)
rubocop-performance (1.23.0)
rubocop (>= 1.48.1, < 2.0)
rubocop-ast (>= 1.31.1, < 2.0)
ruby-progressbar (1.13.0)
sequel (5.87.0)
bigdecimal
simplecov (0.22.0)
docile (~> 1.1)
simplecov-html (~> 0.11)
simplecov_json_formatter (~> 0.1)
simplecov-html (0.13.1)
simplecov-lcov (0.8.0)
simplecov_json_formatter (0.1.4)
standard (1.40.0)
standard (1.43.0)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.65.0)
rubocop (~> 1.69.1)
standard-custom (~> 1.0.0)
standard-performance (~> 1.4)
standard-performance (~> 1.6)
standard-custom (1.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.50)
standard-performance (1.4.0)
standard-performance (1.6.0)
lint_roller (~> 1.1)
rubocop-performance (~> 1.21.0)
rubocop-performance (~> 1.23.0)
standardrb (1.0.1)
standard
unicode-display_width (2.6.0)
unicode-display_width (3.1.2)
unicode-emoji (~> 4.0, >= 4.0.4)
unicode-emoji (4.0.4)
uri (1.0.2)
webmock (3.24.0)
addressable (>= 2.8.0)
crack (>= 0.3.2)
hashdiff (>= 0.4.0, < 2.0.0)

PLATFORMS
aarch64-linux
ruby

DEPENDENCIES
canister
climate_control
dotenv
faraday
mysql2
pry
rspec
sequel
simplecov
simplecov-lcov
standardrb
webmock

BUNDLED WITH
2.5.19
2.5.23
105 changes: 80 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,47 @@ docker compose run --rm test bundle exec standardrb
docker compose run --rm test bundle exec rspec
```

run_process_zephir_incremental.sh (daily)
=========================================
## Standard Locations

Post-Zephir can read and write files in a number of locations, and it can become bewildering.
Many of the locations (all of them directories) show up again and again. Under Argo these
all come from the `ENV` provided to the workflow. Under Docker the locations are not so scattered,
and all orient themselves to `ENV[ROOTDIR]`. The shell scripts rely on `config/defaults` to fill
in many of these variables; the Ruby scripts expect that the environment variables set by `config/defaults` are present.

TODO: can we use `dotenv` and `.env` in both the shell scripts and the Ruby code, and get rid of
`config/defaults`? Or can we translate `config/defaults` into Ruby and invoke it from the driver?

| `ENV` | Standard Location | Docker/Default Location |
| -------- | ------- | ----- |
| `CATALOG_ARCHIVE` | `/htapps/archive/catalog` | `DATA_ROOT/catalog_archive` |
| `CATALOG_PREP` | `/htsolr/catalog/prep` | `DATA_ROOT/catalog_prep` |
| `DATA_ROOT` | `/htprep/zephir` | `ROOTDIR/data` |
| `FEDDOCS_HOME` | `/htprep/govdocs` | `DATA_ROOT/govdocs` |
| `INGEST_BIBRECORDS` | `/htapps/babel/feed/var/bibrecords` | `DATA_ROOT/ingest_bibrecords` |
| `RIGHTS_DIR` | `/htapps/babel/feed/var/rights` | `DATA_ROOT/rights` |
| `ROOTDIR` | (not used) | `/usr/src/app` |

Additional derivative paths are set by `config/defaults`, typically from the daily or monthly shell script.

| `ENV` | Standard/Default/Docker Location | Note |
| -------- | ------- | ---- |
| `REPORTS` | `DATA_ROOT/reports` | *unused* |
| `RIGHTS_DBM` | `DATA_ROOT/rights_dbm` | *this is a file* |
| `TMPDIR` | `DATA_ROOT/work` | |
| `ZEPHIR_DATA` | `DATA_ROOT/zephir` | |



## `run_process_zephir_incremental.sh` (daily)

* Process daily file of new/updated/deleted metadata provided by Zephir
* Send deleted bib record IDs (provided by Zephir) to Bill
* "Clean up" zephir records
* Send deleted bib record IDs (provided by Zephir) to catalog indexer
* "Clean up" zephir records (what does this mean?)
* (re)determine bibliographic rights
+ Write new/updated bib rights to file for Aaron's process to pick up and update the rights db (Why: possibly because of limited permissions on the rights database)
* File of processed new/updated records is copied to an HT server for Bill to index in the catalog
* Retrieves full bib metadata file from zephir and runs run_zephir_full_monthly.sh. (Why?)
+ Write new/updated bib rights to file for `populate_rights_data.pl` to pick up and update the rights db
* File of processed new/updated records is copied to a location for the catalog indexer to find it
* Retrieves full bib metadata file from zephir and runs `run_zephir_full_monthly.sh`. (It does?? I don't think so.)

Why?
----
Expand All @@ -47,18 +79,33 @@ Data In
-------
* `ht_bib_export_incr_YYYY-MM-DD.json.gz` (incremental updates from Zephir, `ftps_zephir_get`)
* `vufind_removed_cids_YYYY-MM-DD.txt.gz` (CIDs that have gone away, `ftps_zephir_get`)
* `/tmp/rights_dbm` (taken from `ht_rights.rights_current` table in the rights database)
* `us_cities.db` (dependency for `bib_rights.pm`)
* `us_fed_pub_exception_file` (dependency for `bib_rights.pm`, `/htdata/govdocs/feddocs_oclc_filter/`)
* `DATA_ROOT/rights_dbm` (local copy of Rights DB `ht_rights.rights_current`)
* `ROOTDIR/data/us_cities.db` (dependency for `bib_rights.pm`)
* `ENV[us_fed_pub_exception_file]` (optional dependency for `bib_rights.pm`)

Data Out
--------
* `debug_current.txt` (what and why for this?)
* `zephir_upd_YYYYMMDD.rights` - picked up hourly by https://github.com/hathitrust/feed_internal/blob/master/feed.hourly/populate_rights_data.pl and loaded into the `rights_current` table. Will be placed directly in /htapps/babel/feed/var/rights and will remove the scp logic from populate_rights_data.pl
* `zephir_upd_YYYYMMDD_delete.txt.gz` will be moved to /htsolr/catalog/prep. Used by the catalog to process deletes.
* `zephir_upd_YYYYMMDD_dollar_dup.txt `(generated by post_zephir_cleanup.pl, gets sent to Zephir, ftps_zephir_send, Zephir uningests these duplicate records)
* `zephir_upd_YYYYMMDD.json.gz` will be sent to /htsolr/catalog/prep for [catalog indexing](https://github.com/hathitrust/hathitrust_catalog_indexer)
* `zephir_full_monthly_rpt.txt` Does anyone need this?

Many files are named based on the `BASENAME` variable which is "zephir_upd_YYYYMMDD." Files are typically created in
`TMPDIR` and moved/renamed from there.

AFAICT, Verifier should only be interested in files outside `TMPDIR`, with the possible exception of
`TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz`.

| File | Notes |
| -------- | ----- |
| `CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script |
| `CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz` | Same file as above, removed from `TMPDIR` after being copied to the two destinations |
| `CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz` | Created as `TMPDIR/BASENAME_all_delete.txt.gz` combining two files (see below) |
| `RIGHTS_DIR/zephir_upd_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` |
| `ROOTDIR/data/zephir/debug_current.txt` | _Commented out at end of monthly script. Should be removed._ |
| `TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz` | Created as `TMPDIR/BASENAME_dollar_dup.txt`, renamed and sent to Zephir |
| `TMPDIR/zephir_upd_YYYYMMDD_delete.txt` | From `postZephir.pm`: usually empty list of 974-less CIDs, merged with `vufind_removed_cids` |
| `TMPDIR/zephir_upd_YYYYMMDD.rights.debug` | From `postZephir.pm`, _if no one is using this it should be removed_ |
| `TMPDIR/zephir_upd_YYYYMMDD_rpt.txt` | Log data from `postZephir.pm` |
| `TMPDIR/zephir_upd_YYYYMMDD_stderr` | `STDERR` from `postZephir.pm`, _if no one is using this it should be removed_ |
| `TMPDIR/zephir_upd_YYYYMMDD_zephir_delete.txt` | Intermediate file from `vufind_removed_cids_...` before merge with our deletes, _remove?_ |


Perl script dependencies
------------------------
Expand Down Expand Up @@ -87,20 +134,28 @@ Previously generated the HTRC datasets. All that remains is the zephir_ingested_

Data In
-------
* `ht_bib_export_full_YYYY-MM-DD.json.gz` (monthly updates from Zephir, `ftps_zephir_get`)
Note: this file is deleted by the `unpigz` command that splits it into smaller files to process in parallel.
* Note: there is no monthly "removed CIDs" or "deletes" files, these are only in the daily updates.
* US Fed Doc exception list `/htdata/govdocs/feddocs_oclc_filter/oclcs_removed_from_registry.txt`
* `/tmp/rights_dbm`
* `DATA_ROOT/rights_dbm` (local copy of Rights DB `ht_rights.rights_current`)
* `groove_export_YYYY-MM-DD.tsv.gz` (ftps from cdlib)
* `ht_bib_export_full_YYYY-MM-DD.json.gz`


Data Out
--------
* `groove_export_YYYY-MM-DD.tsv.gz` will be moved to /htapps/babel/feed/var/bibrecords/groove_full.tsv.gz
* `zephir_full_${YESTERDAY}_vufind.json.gz` catalog archive. Indexed into catalog via the same process as for `run_process_zephir_incremental.sh`
* `zephir_full_${YESTERDAY}.rights` moved to /htapps/babel/feed/var/rights/
* `zephir_full_${YESTERDAY}.rights.debug`, doesn't appear to be used
* `zephir_full_monthly_rpt.txt`moved to ../data/full/
* `zephir_full_${YESTERDAY}.rights_rpt.tsv moved to ./data/full/
* `zephir_ingested_items.txt.gz` - copied to `/htapps/babel/feed/var/bibrecords`. Used by https://github.com/hathitrust/feed_internal/blob/master/feed.monthly/zephir_diff.pl to refresh the full `feed_zephir_items` table on a monthly basis.
| File | Notes |
| -------- | ----- |
| `INGEST_BIBRECORDS/groove_full.tsv.gz` | Downloaded as `groove_export_YYYY-MM-DD.tsv.gz` and moved, contents are not modified |
| `INGEST_BIBRECORDS/zephir_ingested_items.txt.gz` | From `postZephir.pm`, TSV of {htid, source, collection, digitization_source, ia_id} |
| `CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz` | Concatenated from parallel-processed files, gzipped and moved by shell script |
| `CATALOG_PREP/zephir_full_YYYYMMDD_vufind.json.gz` | Same file as above, copied to `CATALOG_PREP` before being moved to `CATALOG_ARCHIVE` |
| `RIGHTS_DIR/zephir_full_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` |
| `TMPDIR/stderr.tmp.txt` | Concatenated from subfiles' STDERR |
| `TMPDIR/zephir_full_YYYYMMDD.rights.debug` | From `postZephir.pm`, _if no one is using this it should be removed_ | |
| `ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt` | Concatenated from subfiles and moved from `TMPDIR` |
| `ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv` | Concatenated from subfiles and moved from `TMPDIR` |


Perl script dependencies
------------------------
Expand Down
21 changes: 14 additions & 7 deletions bin/post_zephir.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
require "logger"

require_relative "../lib/dates"
require_relative "../lib/derivatives"
require_relative "../lib/post_zephir_derivatives"
require_relative "../lib/journal"

def run_system_command(command)
LOGGER.info command
Expand All @@ -21,19 +22,25 @@ def run_system_command(command)
INCREMENTAL_SCRIPT = File.join(HOME, "run_process_zephir_incremental.sh")
YESTERDAY = Date.today - 1

inventory = PostZephirProcessing::Derivatives.new(date: YESTERDAY)

if inventory.earliest_missing_date.nil?
LOGGER.info "no Zephir files to process, exiting"
exit 0
derivatives = PostZephirProcessing::PostZephirDerivatives.new
dates = []
# Is there a missing date? Plug them into an array to process.
if !derivatives.earliest_missing_date.nil?
dates = ((derivatives.earliest_missing_date - 1)..YESTERDAY)
end

dates = (inventory.earliest_missing_date..YESTERDAY)
LOGGER.info "Processing Zephir files from #{dates}"
dates.each do |date|
date_str = date.strftime("%Y%m%d")
LOGGER.info "Processing Zephir file for #{date_str}"
if date.last_of_month?
run_system_command "#{FULL_SCRIPT} #{date_str}"
end
run_system_command "#{INCREMENTAL_SCRIPT} #{date_str}"
end

# Record our work for the verifier
LOGGER.info "Writing journal for #{dates}"
# TODO: consider moving the `to_a` to the Journal initializer so it can take
# Ranges as well as Arrays
PostZephirProcessing::Journal.new(dates: dates.to_a).write!
Loading
Loading