This proof of concept project is not currently actively maintained! The repository represents a potential workflow to enhance a collectionbuilder collection with an Elasticsearch instance, however dependencies have not be updated. Please consider this a basic outline model only, and update all dependencies before use!
CollectionBuilder-Elasticsearch is a web application generator and toolset for configuring, administering, and searching collection data using Elasticsearch.
You can either build the Docker image that has all of the software dependencies preinstalled, or you can install these dependencies yourself on your own machine.
If you have make
installed::
make build-docker-image
otherwise, you can run the docker-compose
command directly:
docker-compose build \
--build-arg "DOCKER_USER=`id -un`" \
--build-arg "DOCKER_UID=`id -u`" \
--build-arg "DOCKER_GID=`id -g`" \
default
Running the container will give you a bash prompt within the container at which you can execute the steps in Building The Project. Note that docker-compose
will automatically create a local Elasticseach instance so you can skip step 2. Start Elasticsearch
.
The docker-compose
configuration will mirror your local collectionbuilder-elasticsearch
directory inside the container so any changes you make to the files in that directory on your local filesystem will be reflected within the container.
If using make
:
make run-docker-image
otherwise:
docker-compose run default
See: https://collectionbuilder.github.io/docs/software.html#ruby
The code in this repo has been verified to work with the following versions:
name | version |
---|---|
ruby | 2.7.0 |
bundler | 2.1.4 |
jekyll | 4.1.0 |
After the bundler
gem is installed, run the following command to install the remaining dependencies specified in the Gemfile
:
bundle install
*Note for MAC Users: Several dependencies can be installed using Homebrew. Homebrew makes the installation simple via basic command line instructions like brew install xpdf
The pdftotext
utility in the Xpdf package is used by extract-pdf-text
to extract text from .pdf
collection object files.
Download the appropriate executable for your operating system under the "Download the Xpdf command line tools:" section here: http://www.xpdfreader.com/download.html
The scripts expect this to be executable via the command pdftotext
.
Windows users will need to extract the files from the downloaded .zip folder and then move the extracted directory to their program files folder.
Mac users can use Homebrew and type brew install xpdf
into the command line.
Here's an example of installation under Ubuntu:
curl https://xpdfreader-dl.s3.amazonaws.com/xpdf-tools-linux-4.02.tar.gz -O
tar xf xpdf-tools-linux-4.02.tar.gz
sudo mv xpdf-tools-linux-4.02/bin64/pdftotext /usr/local/bin/
rm -rf xpdf-tools-linux-4.02*
Download the appropriate executable for your operating system here: https://www.elastic.co/downloads/elasticsearch
Windows users will need to extract the files from the downloaded .zip folder and then move the extracted directory to their program files folder.
Mac users can use homebrew. Following [these instructions]
(https://www.elastic.co/guide/en/elasticsearch/reference/current/brew.html) --> Type brew tap elastic/tap
into your terminal "to tap the Elastic Homebrew repository." Then type brew install elastic/tap/elasticsearch-full
to install the full version.
Here's an example of installation under Ubuntu:
curl https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.0-amd64.deb -O
curl https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.0-amd64.deb.sha512 -O
sha512sum -c elasticsearch-7.7.0-amd64.deb.sha512
sudo dpkg -i elasticsearch-7.7.0-amd64.deb
For Mac and Linux Users
Add the following lines to your elasticsearch.yml
configuration file:
network.host: 0.0.0.0
discovery.type: single-node
http.cors.enabled: true
http.cors.allow-origin: "*"
Following the above installation for Ubuntu, elasticsearch.yml
can be found in the directory /etc/elasticsearch
Mac users can find elasticsearch.yml
in the directory /usr/local/etc/elasticsearch/
Update _config.yml to reflect your Elasticsearch server configuration. E.g.:
elasticsearch-protocol: http
elasticsearch-host: 0.0.0.0
elasticsearch-port: 9200
elasticsearch-index: moscon_programs_collection
For Windows Users
Add the following lines to your elasticsearch.yml
configuration file:
network.host: localhost
discovery.type: single-node
http.cors.enabled: true
http.cors.allow-origin: "*"
Following the above installation for Ubuntu, elasticsearch.yml
can be found in the directory /etc/elasticsearch
Update _config.yml to reflect your Elasticsearch server configuration. E.g.:
elasticsearch-protocol: http
elasticsearch-host: localhost
elasticsearch-port: 9200
elasticsearch-index: moscon_programs_collection
Add the collections you want to include in the build to the config-collections.csv configuration file. Each row must specify at least a homepage_url
value. Any unspecified fields will be addressed during the build process, either automatically or via a manual input prompt.
Though this step is platform dependent, you might accomplish this by executing elasticsearch
in a terminal.
For example, if you installed Elasticsearch under Ubuntu, you can start Elasticsearch with the command:
sudo service elasticsearch start
Use the cb:build rake task to automatically execute the following rake tasks:
cb:generate_collections_metadata
cb:download_collections_objects_metadata
cb:analyze_collections_objects_metadata
cb:generate_search_config
cb:download_collections_pdfs
cb:extract_pdf_text
cb:generate_collections_search_index_data
cb:generate_collections_search_index_settings
es:create_directory_index
cb:create_collections_search_indices
cb:load_collections_search_index_data
Usage:
rake cb:build
See Manually building the project for information on how to customize these build steps.
_data/config-search.csv
defines the settings for the fields that you want indexed and displayed in search. This configuration file is automatically generated
during the build process via analysis of the collection object metadata by the generate_search_config rake task. While the auto-generated config is a good starting point, we recommend that you audit and edit this file to refine the search experience.
rake cb:serve
rake tasks are used to automate project build steps and administer the Elasticsearch instance.
All of the defined rake tasks, as reported by rake --tasks
:
rake cb:analyze_collections_objects_metadata # Analyze the downloaded collection object metadata files
rake cb:build[env,test] # Execute all build steps required to go from a config-collection file to fully-populated Elasticsearch index
rake cb:create_collections_search_indices[env,es_profile] # Create Elasticsearch indices all configured collections
rake cb:deploy # Build site with production env
rake cb:download_collections_objects_metadata # Download the object metadata files for each collection
rake cb:download_collections_pdfs[test] # Download collections PDFs for text extraction
rake cb:enable_daily_search_index_snapshots[profile] # Enable daily Elasticsearch snapshots to be written to the "_elasticsearch_snapshots" directory of your Digital Ocean Space
rake cb:extract_pdf_text # Extract the text from PDF collection objects
rake cb:generate_collection_search_index_data[env,collection_url] # Generate the file that we'll use to populate the Elasticsearch index via the Bulk API
rake cb:generate_collection_search_index_settings[collection_url] # Generate the settings file that we'll use to create the Elasticsearch index
rake cb:generate_collections_metadata # Generate metadata for each collection from local config and remote JSON-LD
rake cb:generate_collections_search_index_data[env] # Generate the file that we'll use to populate the Elasticsearch index via the Bulk API for all configured collections
rake cb:generate_collections_search_index_settings[env] # Generate the Elasticsearch index settings files for all configured collections
rake cb:generate_search_config # Create an initial search config from the superset of all object fields
rake cb:load_collections_search_index_data[env,es_profile] # Load data into Elasticsearch indices for all configured collections
rake cb:serve[env] # Run the local web server
rake es:create_directory_index[profile] # Create the Elasticsearch directory index
rake es:create_index[profile,index,settings_path] # Create the Elasticsearch index
rake es:create_snapshot[profile,repository,wait] # Create a new Elasticsearch snapshot
rake es:create_snapshot_policy[profile,policy,repository,schedule] # Create a policy to enable automatic Elasticsearch snapshots
rake es:create_snapshot_s3_repository[profile,bucket,base_path,repository_name] # Create an Elasticsearch snapshot repository that uses S3-compatible storage
rake es:delete_directory_index[profile] # Delete the Elasticsearch directory index
rake es:delete_index[profile,index] # Delete the Elasticsearch index
rake es:delete_snapshot[profile,snapshot,repository] # Delete an Elasticsearch snapshot
rake es:delete_snapshot_policy[profile,policy] # Delete an Elasticsearch snapshot policy
rake es:delete_snapshot_repository[profile,repository] # Delete an Elasticsearch snapshot repository
rake es:execute_snapshot_policy[profile,policy] # Manually execute an existing Elasticsearch snapshot policy
rake es:list_indices[profile] # Pretty-print the list of existing indices to the console
rake es:list_snapshot_policies[profile] # List the currently-defined Elasticsearch snapshot policies
rake es:list_snapshot_repositories[profile] # List the existing Elasticsearch snapshot repositories
rake es:list_snapshots[profile,repository_name] # List available Elasticsearch snapshots
rake es:load_bulk_data[profile,datafile_path] # Load index data using the Bulk API
rake es:minimize_disk_watermark[profile] # Minimize the disk watermark to allow write operations on a near-full disk
rake es:ready[profile] # Display whether the Elasticsearch instance is up and running
rake es:restore_snapshot[profile,snapshot_name,wait,repository_name] # Restore an Elasticsearch snapshot
rake es:update_directory_index[profile,raise_on_missing] # Update the Elasticsearch directory index to reflect the current indices
You can find detailed information about many of these tasks in the section: Manually building the project
All rake tasks are defined by the .rake
files in the rakelib/ directory. Note that the empty (less a comment justifying its existence) Rakefile
in the project root exists only to signal to rake that it should look for tasks in rakelib/
.
The currently defined .rake
files are as follows:
file | description |
---|---|
collectionbuilder.rake | Single-operation project build tasks |
elasticsearch.rake | Elasticsearch administration tasks |
You can customize many of the default task configuration options by modifying the values in rakelib/lib/constants.rb
Some tasks have external dependencies as indicated below:
task name | software dependencies | service dependencies |
---|---|---|
cb:extract_pdf_text | xpdf | |
es:* | Elasticsearch |
This section will describe how to get Elasticsearch up and running on a Digital Ocean Droplet using our preconfigured, custom disk image.
-
Import our custom Elasticsearch image via the Digital Ocean web console by navigating to:
Images -> Custom Images -> Import via URL
and entering the URL: https://collectionbuilder-sa-demo.s3.amazonaws.com/collectionbuilder-elasticsearch-1-0.vmdk
- You will need to select a "Distribution" -- Choose
Ubuntu
. - You will need to select a distribution center location. Choose the location closest to your physical location.
- Once the image is available within your account, click on
More -> Start a droplet
- You can simply leave the default settings and scroll to the bottom of the page to start this.
-
Once the Droplet is running, navigate to:
Networking -> Firewalls -> Create Firewall
Give the firewall a name and add the rules as depicted in the below screenshot:
-
The
HTTP TCP 80
rule allows thecertbot
SSL certificate application that we'll soon run to verify that we own this machine. -
The
Custom TCP 9200
rule enables external access to the Elasticsearch instance.
In the
Apply to Droplets
section, specify the name of the previously-created Elasticsearch Droplet and clickCreate Firewall
This can be found at the top of the page for the firewall. There is a
droplets
menu option (it's a little hard to see). Click that and then specifiy the name of the droplet you created. -
-
Generate your SSL certificate
The Elasticsearch server is configured to use secure communication over HTTPS, which requires an SSL certificate. In order to request a free SSL certificate from Let's Encrypt, you first need to ensure that your Elasticsearch server is accessible via some registered web domain. To do this, you'll need to create a
A
-type DNS record that points some root/sub-domain to the IP address of your Droplet.- Create a DNS record for your Droplet
- In the Digital Ocean UI, navigate to
Droplets -> <the-droplet>
- Take note of the
ipv4
IP address displayed at the top - However you do this, create a
A
DNS record to associate a root/sub-domain with your Droplet IP address
- In the Digital Ocean UI, navigate to
- Create a DNS record for your Droplet
You will need to have a domain to create an A record. If you have one hosted somewhere, such as a personal website, you can go to the area where they manage the DNS records (A and CNAME, etc.) and add an A record to a new subdomain, such as, digitalocean.johndoe.com and point it to the ipv4 IP addresss on your Droplet.
Once that is set up, you will enter that full domain (i.e. `digitalocean.johndoe.com) in step 9 below to generate the certificate.
2. Generate the certificate
1. In the Digital Ocean UI, navigate to `Droplets -> <the-droplet>`
2. Click the `Console []` link on the right side (it's a blue link at the top right)
3. At the `elastic login:` prompt, type `ubuntu` and hit `ENTER`
4. At the `Password:` prompt, type `password` and hit `ENTER`
5. Type `sudo ./get-ssl-certificate` and hit `ENTER`, type `password` and hit `ENTER`
6. Enter an email address to associate with your certificate
7. Type `A` then `ENTER` to agree to the terms of service
8. Specify whether you want to share your email address with the EFF
9. Enter the name of the root/sub-domain for which you created the `A` record associated with your Droplet IP address
10. Restart Elasticsearch so that it will use the new certificate by executing `sudo systemctl restart elasticsearch`
-
Check that Elasticsearch is accessible via HTTPS
- In a web browser, surf on over to:
https://<the-root/sub-domain-you-created>:9200
and you should see something like this:
It's reporting a
security_exception
because the server is configured to prevent anonymous, public users from accessing things they shouldn't. You'll see a more friendly response at:https://<the-root/sub-domain-you-created>:9200/_search
- In a web browser, surf on over to:
-
Generate your Elasticsearch passwords
In order to securely administer your Elasticsearch server, you'll need to generate passwords for the built-in Elasticsearch users.
If necessary, open a console window:
1. In the Digital Ocean UI, navigate to `Droplets -> <the-droplet>` 2. Click the `Console []` link on the right side 3. At the `elastic login:` prompt, type `ubuntu` and hit `ENTER` 4. At the `Password:` prompt, type `password` and hit `ENTER`
Execute the command:
sudo /usr/share/elasticsearch/bin/elasticsearch-setup-passwords auto
The script will display the name and newly-generated password for each of the built-in Elasticsearch users - copy these down and save them in a safe place. You will be using the
elastic
user credentials to later administer the server. See: Creating Your Local Elasticsearch Credentials File -
Change the
ubuntu
user passwordEvery droplet that someone creates from the provided custom disk image is going to have the same default
ubuntu
user password ofpassword
. For better security, you should change this to your own, unique password.If necessary, open a console window:
1. In the Digital Ocean UI, navigate to `Droplets -> <the-droplet>` 2. Click the `Console []` link on the right side 3. At the `elastic login:` prompt, type `ubuntu` and hit `ENTER` 4. At the `Password:` prompt, type `password` and hit `ENTER`
Execute the command:
sudo passwd ubuntu
The flow looks like this:
[sudo] password for ubuntu: <enter-the-current-password-ie-"password"> New password: <enter-your-new-password> Retype new password: <enter-your-new-password> passwd: password updated successfully
Elasticsearch provides a snapshot feature that allows you to save the current state of your indices. These snapshots can then be used to restore an instance to a previous state, or to initialize a new instance.
Though there are several options for where/how to store your snapshots, we'll describe doing so using a Digital Ocean Space and the Elasticsearch repository-s3 plugin. Note that since we're leveraging the Digital Ocean Spaces S3-compatible API, these same basic steps can be used to alternately configure an AWS S3 bucket for snapshot storage.
-
Choose or create a Digital Ocean Space
The easiest thing is use the same DO Space that you're already using to store your collection objects to also store your Elasticsearch snapshots. In fact, the
cb:enable_daily_search_index_snapshots
rake task that we detail below assumes this and parses the Space name from thedigital-objects
value of your production config. By default, the snapshot files will be saved as non-public objects to a_elasticsearch_snapshots/
subdirectory of the configured Space, which shouldn't interfere with any existing collections.If you don't want to use an existing DO Space to store your snapshots, you should create a new one for this purpose.
-
Create a Digital Ocean Space access key
Elasticsearch will need to specify credentials when reading and writing snapshot objects on the Digital Ocean Space.
You can generate your Digital Ocean access key by going to your DO account page and clicking on:
API -> Spaces access keys -> Generate New Key
A good name for this key is something like:
elasticsearch-snapshot-writer
-
Configure Elasticsearch to access the Space
This step needs to be completed on the Elasticsearch server instance itself.
-
Open a console window:
- In the Digital Ocean UI, navigate to
Droplets -> <the-droplet>
- Click the
Console []
link on the right side - At the
elastic login:
prompt, typeubuntu
and hitENTER
- At the
Password:
prompt, typepassword
(or your updated password) and hitENTER
- In the Digital Ocean UI, navigate to
-
Run the configure-s3-snapshots shell script
Usage:
sudo ./configure-s3-snapshots
This script will:
- Check whether an S3-compatible endpoint has already been configured
- Install the
repository-s3
plugin if necessary - Prompt you for your S3-compatible endpoint (see note)
- Prompt you for the DO Space access key
- Prompt you for the DO Space secret key
Notes:
-
This script assumes the default S3 repository name of
"default"
. If you plan on executing thees:create_snapshot_s3_repository
rake task manually (as opposed to the automatedenable_daily_search_index_snapshots
that we detail below) and specifing a non-default repository name, you should specify that name as the first argument toconfigure-s3-snapshots
, i.e.sudo ./configure-s3-snapshots <repository-name>
-
You can find your DO Space endpoint value by navigating to
Spaces -> <the-space> -> Settings -> Endpoint
in the Digital Ocean UI. Alternatively, if you know which region your Space is in, the endpoint value is in the format:<REGION>.digitaloceanspaces.com
, e.g.sfo2.digitaloceanspaces.com
-
-
Configure a snapshot repository and enable daily snapshots
The
cb:enable_daily_search_index_snapshots
rake task takes care of creating the Elasticsearch S3 snapshot repository, automated snapshot policy, and tests the snapshot policy to make sure everything's working.Usage:
rake cb:enable_daily_search_index_snapshots[<profile-name>]
Notes:
- This task only targets remote production (not local development) Elasticsearch instances, so you must specify an Elasticsearch credentials profile name.
- This task assumes that you want to use all of the default snapshot configuration values which includes using the same Digital Ocean Space that you've configured in the
digital-objects
value of your production config to store your snapshot files. If you want to use a different repository name, DO Space, or snapshot schedule other than daily, you'll have to run thees:create_snapshot_s3_repository
,es:create_snapshot_policy
, andes:execute_snapshot_policy
rake tasks manually.
After generating passwords for your built-in Elasticsearch users, the ES-related rake tasks will need access to these usernames / passwords (namely that of the elastic
user) in order to communicate with the server. This is done by creating a local Elasticsearch credentials file.
By default, the tasks will look for this file at: <user-home>/.elasticsearch/credentials
. If you want to change this location, you can do so here.
This credentials file must formatted as YAML as follows:
users:
<profile-name>:
username: <elasticsearch-username>
password: <elasticsearch-password>
Here's a template that works with the other examples in this documentation, requiring only that you fill in the elastic
user password:
users:
PRODUCTION:
username: elastic
password: <password>
The search configuration in config-search.csv
(which is generated by the cb:generate_search_config
rake task) is used by the cb:generate_search_index_settings
rake task to generate an Elasticsearch index settings file which the es:create_index
rake task then uses to create a new Elasticsearch index. If you need to make changes to config-search.csv
after the index has already been created, you will need to synchronize these changes to Elasticsearch in order for the new configuration to take effect.
While there are a number of ways to achieve this (see: Index Aliases and Zero Downtime), the easiest is to:
-
Delete the existing index by executing the
es:delete_index
rake task. Seees:create_index
for how to specify a user profile name if you need to target your production Elasticsearch instance. Note thates:delete_index
automatically invokeses:update_directory_index
to remove the deleted index from any existing directory. -
Execute the
cb:generate_collection_search_index_settings
andes:create_index
rake tasks to create a new index using the updatedconfig-search.csv
configuration -
Execute the
es:load_bulk_data
rake task to load the documents into the new index
Cross-collection search is made possible by the addition of a special directory_
index on the Elasticsearch instance that stores information about the available collection indices.
The documents in directory_
comprise the fields: index
, doc_count
, title
, description
Here's an example Elasticsearch query that returns two documents from a directory_
index:
curl --silent https://<elasticsearch-host>:9200/directory_/_search?size=2 | jq
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "directory_",
"_type": "_doc",
"_id": "pg1",
"_score": 1,
"_source": {
"index": "pg1",
"doc_count": "342",
"title": "The University of Idaho Campus Photograph Collection",
"description": "The University of Idaho Campus Photograph Collection contains over 3000 historical photographs of the UI Campus from 1889 to the present."
}
},
{
"_index": "directory_",
"_type": "_doc",
"_id": "uiext",
"_score": 1,
"_source": {
"index": "uiext",
"doc_count": "253",
"title": "Agricultural Experiment & UI Extension Publications",
"description": "A collaboration between the Library and University of Idaho Extension, the University of Idaho Extension and Idaho Agricultural Experiment Station Publications collection features over 2000 publications that serve as the primary source for practical, research-based information on Idaho agriculture, forestry, gardening, family and consumer sciences, and other to links."
}
}
]
}
}
The site-specific search page queries this index to collect information about whether there are additional collections available to search. The cross-collection search page queries this index in order to populate its list of available indices to search against.
Use the es:create_directory_index rake task to create the directory_
index on your Elasticsearch instance.
Note that the es:create_directory_index
task operates directly on the Elasticsearch instance and has no dependency on the collection-specific codebase in which you execute it.
Local development usage:
rake es:create_directory_index
To target your production Elasticsearch instance, you must specify a user profile name argument:
rake es:create_directory_index[<profile-name>]
For example, to specify the user profile name "PRODUCTION":
rake es:create_directory_index[PRODUCTION]
Use the es:update_directory_index rake task to update the directory_
index to reflect the current state of collection indices on the Elasticsearch instance. Note that the es:create_index and es:delete_index tasks automatically invoke es:update_directory_index
.
The es:update_directory_index
task works by querying Elasticsearch for a list of all available indices that it uses to update the directory_
index documents by either generating new documents for unrepresented collection indices, or by removing documents that represent collection indices that no longer exist.
Note that the es:update_directory_index
task operates directly on the Elasticsearch instance and has no dependency on the collection-specific codebase in which you execute it.
Local development usage:
rake es:update_directory_index
To target your production Elasticsearch instance, you must specify a user profile name argument:
rake es:update_directory_index[<profile-name>]
For example, to specify the user profile name "PRODUCTION":
rake es:update_directory_index[PRODUCTION]
The following section provides details on how to manually execute and customize each step of the project build process.
During the build process, all generated and downloaded collection-specific files will be stored in collection-specific subdirectories of _data/collections
.
This tree has the structure:
└── _data
└── collections
├── <COLLECTION_URL_FORMATTED_AS_FILENAME>
│ ├── collection-metadata.json
│ ├── elasticsearch
│ │ ├── bulk_data.jsonl
│ │ └── index_settings.json
│ ├── extracted_pdfs_text
│ │ ├── <PDF_URL_FORMATTED_AS_FILENAME>.txt
│ └── ...
│ ├── objects-metadata.json
│ └── pdfs
│ ├── <PDF_URL_FORMATTED_AS_FILENAME>
│ └── ...
└── ...
Use the cb:generate_collections_metadata
rake task to generate a final metadata file for each configured collection in _data/config-collections.csv
. If there are any required fields unspecified in config-collection.csv
, an attempt will be made to retrieve these values by reading the JSON-LD data embedded in the response from the homepage_url
. If any required values remain unsatisfied, you will be prompted for manual input of these values.
This step will generate the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/collection-metadata.json
files.
Usage:
rake cb:generate_collections_metadata
Use the cb:download_collections_objects_metadata
rake task to download each collection's object metadata JSON file from either the objects_metadata_url
specified in config-collections.csv
or from the default website path of /assets/data/metadata.json
as defined by the $COLLECTIONBUILDER_JSON_METADATA_PATH variable in rakelib/lib/constants.rb
This step will generate the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/bjects-metadata.json
files.
Usage:
rake cb:download_collections_objects_metadata
Use the cb:analyze_collections_objects_metadata
rake task to analyze the downloaded objects metadata files and display any warnings regrading missing or invalid values.
This step will not generate any files.
Usage
rake cb:analyze_collections_objects_metadata
An error condition will be indicated by the collection-specific output:
**** Analyzing objects metadata for collection: <COLLECTION_URL>
...
Found missing or invalid values for the following REQUIRED fields:
{
"<FIELD_NAME>": 1
}
Please correct these values on the remote collection site, or edit the local copy at the below location, and try again:
_data/collections/<COLLECTION_URL_ESCAPED_AS_FILENAME>/objects-metadata.json
and the final output line:
**** Aborting due to 1 collections with missing or invalid REQUIRED object metadata fields
The following help text will also be displayed:
**** Some optional and/or required fields that we normally include in the search index documents were found to be missing or invalid.
If your metadata uses non-standard field names, the $OBJECT_METADATA_KEY_ALIASES_MAP configuration variable in rakelib/lib/constants.rb provides a means of mapping our names to yours. Please see the documentation in constants.rb for more information on how to do this.
Any required missing or invalid fields must be correctly before continuing on to the next step.
Use the cb:generate_search_config
rake task to automatically generate a default search configuration by analyzing all of the downloaded object metadata files.
This step will generate the _data/config-search.csv
file.
Usage:
rake cb:generate_search_config
Use the cb:download_collections_pdfs
rake task to download all PDFs specified in the object metadata files to the local filesystem for text extraction.
This step will download PDFs to the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/pdfs/
directories.
Usage:
rake cb:download_collections_pdfs
Use the cb:extract_pdf_text
rake task to extract text from the downloaded PDFs.
This step will download PDFs to the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/extracted_pdfs_text/
directories.
Usage:
rake cb:extract_pdf_text
Use the cb:generate_collections_search_index_data
rake task to generate a search index data file for each collection which includes the object metadata and extracted PDF text.
This step will generate the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/elasticsearch/bulk_data.jsonl
files.
Local development usage:
rake cb:generate_collections_search_index_data
To target your production Elasticsearch instance, you must specify a user profile name argument:
rake cb:generate_collections_search_index_data[<profile-name>]
For example, to specify the user profile name "PRODUCTION":
rake cb:generate_collections_search_index_data[PRODUCTION]
When you specify a user profile name, the task assumes that you want to target the production Elasticsearch instance and will read the connection information from _config.production.yml
and the username / password for the specified profile from your Elasticsearch credentials file.
See: Creating Your Local Elasticsearch Credentials File
Use the cb:generate_collections_search_index_settings
rake task to generate an Elasticsearch index settings file for each collection based on the previously-generated search configuration.
This step will generate the _data/collections/<COLLECTION_URL_FORMATTED_AS_FILENAME>/elasticsearch/index_settings.json
files.
Usage:
rake cb:generate_collections_search_index_settings
Use the es:create_directory_index
rake task to create the _directory
index that is used to store information about which collection-specific indices exist on the server.
Usage:
rake es:create_directory_index
See 7. Generate the Search Index Data Files for information on specifying a profile to target non-development environments.
Use the cb:create_collections_search_indices
rake task to create a search index for each collection using the previously-generated index settings.
Usage:
rake cb:create_collections_search_indices
See 7. Generate the Search Index Data Files for information on specifying a profile to target non-development environments.
Use the cb:load_collections_search_index_data
rake task to load the previously-generate search index data files into their corresponding indices.
Usage:
rake cb:load_collections_search_index_data
See 7. Generate the Search Index Data Files for information on specifying a profile to target non-development environments.