An HTTP API for querying and updating PURLs. See the API section below for docs. Purl-fetcher is a cache which enables the access portfolio to efficiently index and query data such as release tags and collection memberships. It is not the canonical source for any information.
https://docs.google.com/drawings/d/1--7pQQlzD-_g2AyPCtNkeODTs4CTrXKKyUyxofFWc1U/edit
- Ruby (3.2 or greater)
- bundler gem
- Apache Kafka (0.10 or greater), or Docker
Clone the repository:
git clone https://github.com/sul-dlss/purl-fetcher.git
cd purl-fetcher
Install dependencies:
bundle install
Set up the database:
rake db:migrate
The API communicates with a Kafka broker to dispatch and process updates asynchronously. You can run a Kafka broker locally, or use the provided docker-compose
configuration:
docker-compose up
Then, in a separate terminal, start a development API server:
bin/rails server
You can make requests to the API using curl
or a similar tool. To add an object to the database, you can first download its public Cocina JSON from production PURL:
curl https://purl.stanford.edu/bb112zx3193.json > bb112zx3193.json
Then, you can use the POST /purls/:druid
endpoint to add the object to the database:
curl -X POST -H "Content-Type: application/json" -d @bb112zx3193.json http://localhost:3000/purls/bb112zx3193
After the object has been added, it will be added to the Kafka topic for indexing.
The full test suite (with RuboCop style enforcement) can be run with the default rake task:
rake
The tests can be run without RuboCop style enforcement:
rake spec
The RuboCop style enforcement can be run without running the tests:
rake rubocop
GET /purls/:druid
Display a single purl
The GET /purls/:druid
endpoint provides the ability to display a PURL document. This endpoint is used by purl to know if an item should be in the sitemap
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | Druid of a specific PURL | Yes | string eg(druid:cc1111dd2222 ) |
null |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
{
"druid": "druid:dd111ee2222",
"latest_change": "2014-01-01T00:00:00Z",
"true_targets": ["PURL sitemap"],
"collections": ["druid:oo000oo0001"]
}
POST /purls/:druid
Purl Document Update
The POST /purls/:druid
endpoint provides the ability to create or update a PURL document from public Cocina JSON. This endpoint is used by dor-services-app as part of SDR workflows.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | Druid of a specific PURL | Yes | string eg(druid:cc1111dd2222 ) |
null |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
true
GET /collections/:druid/purls
Collection Purls route
The /collections/:druid/purls
endpoint a listing of Purls for a specific collection. This endpoint is used by the Exhibits application.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | Druid of a specific collection | Yes | string eg(druid:cc1111dd2222 ) |
null |
page |
query | request a specific page of results | No | integer | 1 |
per_page |
query | Limit the number of results per page | No | integer (1 - 10000) | 100 |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
{
"purls": [
{
"druid": "druid:ee111ff2222",
"published_at": "2013-01-01T00:00:00.000Z",
"deleted_at": "2016-01-03T00:00:00.000Z",
"object_type": "set",
"catkey": "",
"title": "Some test object number 4",
"collections": [
"druid:ff111gg2222"
],
"true_targets": [
"SearchWorksPreview"
]
},
...
{
"druid": "druid:cc111dd2222",
"published_at": "2016-01-01T00:00:00.000Z",
"deleted_at": "2016-01-02T00:00:00.000Z",
"object_type": "item",
"catkey": "567",
"title": "Some test object number 2",
"collections": [
"druid:ff111gg2222"
],
"true_targets": [
"SearchWorksPreview"
],
"false_targets": [
"SearchWorks"
]
}
],
"pages": {
"current_page": 1,
"next_page": null,
"prev_page": null,
"total_pages": 1,
"per_page": 100,
"offset_value": 0,
"first_page?": true,
"last_page?": true
}
}
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
tag |
url | Tag to search for | Yes | string eg(PURL%20sitemap ) |
null |
List the PURLs that should display on the sitemap.
This is used by the PURL application to generate a sitemap
[
{
"druid": "druid:ee111ff2222",
"updated_at": "2016-01-03T00:00:00.000Z",
},
...
{
"druid": "druid:cc111dd2222",
"updated_at": "2016-01-02T00:00:00.000Z",
}
]
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | object identifier | Yes | string eg(druid:bc123df4567 ) |
null |
actions |
body | list of actions to take on the object. This object should contain two keys, "index" and "delete", each value is an array of properties to release to. | Yes | object | null |
Set the release tags for an item
This tells purl-fetcher to update the cache of release tags and puts messages on the appropriate Kafka streams.
204 Accepted
true
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | object identifier | Yes | string eg(druid:bc123df4567 ) |
null |
version |
url | version to be changed | Yes | object | null |
Withdraw / restore a version of an item
204 No Content
You can create Kafka messages that will cause all the Purls to be reindexed by doing:
Purl.unscoped.find_in_batches.with_index do |group, batch|
puts "Processing group ##{batch}"
group.each(&:produce_indexer_log_message)
end
Or only for searchworks:
Purl.target('Searchworks').find_in_batches.with_index do |group, batch|
puts "Processing group ##{batch}"
Racecar.wait_for_delivery do
group.each { |purl| purl.produce_indexer_log_message(async: true) }
end
end
The API's internals use an ActiveRecord data model to manage various information
about published PURLs. This model consists of Purl
, Collection
, and
ReleaseTag
active records. See app/models/
and db/schema.rb
for details.
This approach provides administrators a couple ways to explore the data outside of the API.
With Rails' runner
, you can query the database using ActiveRecord. For example, running the Ruby in script/reports/summary.rb
using:
RAILS_ENV=environment bundle exec rails runner script/reports/summary.rb
produces output like this:
Summary report as of 2016-08-24 09:52:49 -0700 on purl-fetcher-dev.stanford.edu
PURLs: 193960
Deleted PURLs: 1
Published PURLs: 193959
Published PURLs in last week: 0
Released to SearchWorks: 5
With Rails' dbconsole
, you can query the database using SQL. For example, running the SQL in script/reports/summary.sql
using:
RAILS_ENV=environment bundle exec rails dbconsole -p < script/reports/summary.sql
produces output like this:
PURLs 193960
Deleted PURLs 1
Published PURLs 193959
Published this year 9
Released to SearchWorks 5
To generate an authentication token run RAILS_ENV=production bin/rails generate_token on the prod server. This will use the HMAC secret to sign the token. It will ask you to submit a value for "Account". This should be the name of the calling service, or a username if this is to be used by a specific individual. This value is used for traceability of errors and can be seen in the "Context" section of a Honeybadger error. For example:
{"invoked_by" => "workflow-service"}
Objects created prior to August 2024 use the legacy, unversioned layout. Objects created thereafter have the new versioned layout.
In a future workcycle, access applications (image server, streaming server, etc.) will be changed to utilize the versioned layout, at which point all objects can be migrated to the versioned layout.
In PURL file system:
/purl/document_cache/bc/123/df/4567/
| cocina.json
| public <-- public xml
\-- meta.json
In Stacks file system:
/stacks/bc/123/df/4567/
| file1.txt <-- content file
\-- more_files/
\-- files2.txt
In Stacks file system:
/stacks/bc/123/df/4567/
| file1.txt <-- Hardlinked with content/3e25960a79dbc69b674cd4ec67a72c62. For consistency with unversioned layout.
| more_files/
\-- files2.txt <-- Hardlinked with content/5997de4d5abb55f21f652aa61b8f3aaf. For consistency with unversioned layout.
\-- bc123df4567/
| versions/
| versions.json <-- Metadata about versions.
| meta.json
| cocina.1.json <-- cocina.json for version 1.
| cocina.2.json <-- cocina.json for version 2.
| cocina.json <-- cocina.json for head version. Hardlinked with cocina.2.json.
| public.1.xml <-- public xml for version 1.
| public.2.xml <-- public xml for version 2.
\-- public.xml <-- public xml for head version. Hardlinked with public.2.xml.
\-- content/
| 3e25960a79dbc69b674cd4ec67a72c62 <-- content file named by md5. Hardlinked with file1.txt.
| fb46af9b56999fc63eeda4da3d6bc1de <-- content file in version 1, but not version 2.
\-- 5997de4d5abb55f21f652aa61b8f3aaf <-- Hardlinked with more_files/files2.txt.