-
Notifications
You must be signed in to change notification settings - Fork 0
Release tasks and task descriptions
-
Triage All Issues
- Manual review and correction of all issues
- Automated triage of all issues
- Review and validation of updates encoding for update records
- Add labels to issues
- Approve, deny, hold for processing, or request curator reviews
- Assign issues to the correct project column
-
Release Candidates Identified
- Review all issues for project status and label assignments
- Extract release metadata for all issues in Ready for Sign-off/Metadata QA project column
- Run pre-release tests on extracted metadata
- Validate New Updates CSVs
- Check for in-release duplicates
- Validate New Updates CSVs
- Address validation
- Check for duplicate external IDs
- Check for duplicate URLs
- Conduct secondary review of all issues
- Update issues based on second review and pre-release tests
- Correct issue metadata
- Add any necessary comments
- Update column assignment in project
- Close duplicate record issues
- Perform bulk update in validation mode
-
Create Records
- Create records with extracted metadata
- Run post-creation tests on records
- New records
- Check JSON integrity
- Check for duplicate values
- Check for leading/trailing spaces
- Check for unprintable characters
- Check for production duplicates
- Check for IDs on production
- Update records
- Check JSON integrity
- Check for duplicate values
- Check for leading/trailing spaces
- Check for unprintable characters
- New records
- Update issues based on post-creation tests
- Correct issue metadata
- Add any necessary comments
- Update column assignment in project
- Close duplicate record issues
- Re-extract metadata once all issues have been updated/corrected
- Create records with re-extracted/updated metadata
-
Pre-release prep
- Create milestone and add all release issues to milestone
- Commit records to
ror-updates
release branch - Create and validate relationships file
- Commit relationships file to
ror-updates
release branch
-
Pre-Release Actions
- Validate without relationships
- Create relationships
- Remove relationships to inactive records
- Update labels in related records
- Update addresses
- Update last modified (set to release date)
- Validate files with relationships
-
Release Notes
- Create release notes
- Add as draft pre-publication
- Create release notes
-
Publish Release
- Follow release instructions in
ror-records
- Merge
ror-updates
release branch to main - Publish release notes
- Follow release instructions in
-
Post-Release
- Update release issues with ROR IDs and close
- Close milestone
- Archive release issues
The following are the steps required to process issues and prepare them for release, including triage, metadata extraction, tests, record creation, and pre and post release activities.
Metadata in each issue should be reviewed and corrected to conform with ROR’s metadata policies and issue formatting. All metadata in issues is user submitted and may be partial, incorrect, or otherwise in need of refinement.
Metadata from issues is extracted programmatically to create new and updated records. Although the metadata undergoes additional review at time of extraction, having it correctly represented in the issues in initial triage is important to reduce the time and overall work needed to prepare the release.
Remove any extraneous text from the issue that is populated as a result of the form submission. Update any repeating fields to separate their values with semicolons vs. other forms of punctuation. For example, if a new record request has three aliases submitted, they should be represented in the aliases field as follows:
alias_1; alias_2; alias_3
All name values should additionally be appended with an asterisk and the name of their language or the ISO 639-1 language code for each instance of a name. Refer to the LOC standard for identifying language codes. For example, if a record had both a Spanish and Japanese label, it would be represented in the labels field as follows:
Spanish_label*es; Japanese_label*Japanese
Prefer use of the language code over language name. Language names have to be mapped to their language codes programatically on the basis of an exact name match, and so can create errors in release prep if tagged incorrectly (e.g. where a typo is made in the language name, such as Ukranian
vs. the correct Ukrainian
).
Verify that the organization name provided in the name
field corresponds to that for the organization and tag the corresponding language on the name. Refer to the ISO 639-1 standard for identifying language codes. This can be determined by checking the organization's site for title information or the name included in the copyright statement. Except where otherwise indicated in the request, assign to the name value the name in the language used by the organization on their site. If the name used by the organization is rendered in a non-Latin character script, add the non-Latin character value as a label in the request. If any other names appear on the organization's site outside the title or copyright sections, add these to the aliases field.
Company names are additionally appended with their country name in parentheses to disambiguate national-level manifestations, e.g.:
Company Name (United States)
If record is for the headquarters, additionally include the form without the country in the labels.
All issues/records have the status active, unless otherwise indicated in the request. If a request indicates that is for an inactive organization or you are otherwise creating, add an additional status field to the issue with the corresponding inactive status, e.g.:
Status: inactive
ROR does not create new withdrawn records. This status is reserved for records created in error and only used in update requests.
Verify that the organization's site resolves and the value provided is for the organization. Do not include links where the the provided value is for another organization, but in some way references the one submitted in the request.
Verify that any domain values provided occur on the organization's site. If no domain value was provided, determine if one can be inferred from the link or the organization's site. The domain should generally be referenced in the email address for the organization, but note that if the organization is hosted on a sub-domain page, the email address domain may be that for the parent-level page. Do not assign these value to the organization if their site is hosted on a sub-domain page.
Verify that the links provided resolve and demonstrate use of the organization's name in affiliation usage or funding acknowledgements
Verify that the type provided is correct for the organization. Refer to the types section of ROR's metadata policies and the Guidance for evaluating common organization types if any clarification is needed.
Verify that the Wikipedia page is that for the organization. Prefer the page in the language used by the organization, unless another page is more detailed or complete. Use the standard site vs. mobile page. Remove the Wikipedia link if the provided page is a sub-heading or section of another page.
Verify that all external identifiers provided belong to the organization by checking their corresponding site or API entries:
- Wikidata: https://wikidata.org
- ISNI: https://isni.oclc.org
- Funder ID: https://api.crossref.org/funders/{funder_id}
Remove any URL formatting from the identifiers. For ISNI identifiers, format as four digits/characters, separated by an individual space, e.g. 0000 0005 1090 3649
Verify that the aliases, labels, and acronyms provided occur on the organization's site or are included in affiliation usage. Pay attention to the assignment of name values in the request and reassign to the correct fields as needed. Add languages for all. Refer to the LOC standard for identifying language codes.
Relationships should be represented in the relationships field using the following pattern: ror_id (relationship_type). There is no need to separate repeating instances of the relationships with semicolons, but each must be followed by the relationship type value in order to be extracted. For example, record for which three relationships needed to be added would be coded in the relationships field as follows:
https://ror.org/000000001 (parent) https://ror.org/000000002 (child) https://ror.org/000000003 (related)
For assigning the correct relationship type, refer to the relationship types section of ROR metadata polcies
Where names of organizations are used instead of their ROR IDs, search and determine whether the organization exists in ROR. If the organization does not exist in ROR and appears to be otherwise in scope, create a new record request for the missing organization. Tag the relationship with the issue number in either the new record request or original issue referencing the relationships, e.g.:
#12345 (child)
Note the relationship between the new records in your personal notes document for the release.
Verify that the city and country indicated in the request are correct for the organization. Check that the locations indicated are used on the organization's site or in other authoritative sources.
The Geonames ID will typically not be provided in the request and needs to be returned using the automated triage process or by searching the Geonames site. If a values is provided, confirm that it corresponds to the value indicated in the request.
Confirm via the organization's site.
Review these fields for any comments that may impact the metadata or curation of the request.
Remove any extraneous text from the issue that is populated as a result of the form submission.
Verify that the ROR ID provided is for the organization indicated in the request. The name value may differ from the ror_display
value, but so long as it is otherwise correct for the organization, this does not need to be changed.
Review for a general framing in triaging the update. These values do not need to be changed, unless so inaccurate or unclear that they impede understanding of the request.
Begin by carefully reviewing the request and comparing them to the existing record to identify the specific changes requested. Next, consider whether any additional metadata might need updating as a consequence of the request. For example, changing an organization's name might also require updating its acronyms, aliases, or labels. Similarly, a request to update only the URL might reveal that the organization name is also out of date. Throughout this process, refer to ROR's metadata polices to ensure all changes align with them.
Identify if any of the requested changes that are inconsistent with ROR's curation policies and flag them in a comment. The most common example of this for updates are requests to remove "unofficial names" from a record. ROR faciliates the matching of variant names to their primary or official forms by inclusion of aliases on its records, so these should not be removed. If requested, explain this to the requestor in a comment and link to our blog post on name metadata.
Additionally assess whether the proposed changes could impact other records. If other records are affected, file new issues to reflect these changes.
Updates to records are encoded with a special syntax that describes the changes. This encoding is generated by the automated triaging, but must be validated by the curation lead and can alternately be supplied by them alone. This encoding begins with an "Update:" field, followed by changes to specified fields separated by vertical bars (|), and terminated with a "$". Each change follows this structure:
field.operation==value
Where:
-
field
is the name of the field to be updated -
operation
is one of: add, delete, replace, or delete_field -
value
is the new or modified data (omitted for delete_field)
- "add" - Adds the specified value to a repeating field
- "delete" - Removes the specified value from a repeating field
- "replace" - Replaces all existing data in a field with the supplied value (used for non-repeating fields or to completely overwrite repeating fields)
- "delete_field" - Removes all data in the field, rendering it empty (no value is specified)
- Non-repeating fields (can only use replace operation): 'status', 'established', 'grid.preferred', 'isni.preferred', 'wikidata.preferred'
- Repeating fields (can use add, delete, replace, or delete_field operations): 'acronym', 'alias', 'label', 'ror_display', 'types', 'domains', 'geonames', 'fundref.all', 'fundref.preferred', 'grid.all', 'isni.all', 'wikidata.all', 'website', 'wikipedia'
Updates to fields not included in the above lists will be ignored on extraction.
Example: To change an organization's name, delete an alias, remove the Wikipedia URL, add a label, and add a new preferred ISNI:
Update: ror_display.replace==New Name*en | label.delete==Old Name*en | alias.add==Old Name*en | label.add==New Name*en | isni.preferred.replace==ISNI_ID | isni.all.add==ISNI_ID | wikipedia.delete_field$
- Name values: When adding or deleting name values, append an asterisk and the ISO 639-1 language code for each instance. Example:
label.add==New Name*en
. If the language associated with a name value is not included, the update to this value will not be processed correctly. - ror_display: If changed, update both the ror_display and the record's labels. Add the original name as an alias, as appropriate.
- locations: Encode location changes using geonames. Example:
geonames.replace==GeonamesID
- External IDs:
- If not existing, add to both preferred and all fields
- If existing, add to all field
- To assert a new preferred value, replace preferred and add to all
Except for issues where updates are manually encoded or where only relationships are being updated, all requests should be triage with the automated triage script.
For each new organization request, the script generates a comment on the issue with the following information:
- Wikidata: Name, ID, and similarity score for the matched name (if found)
- ISNI: Matched ID(s) and name(s) retrieved from the ISNI API
- Funder ID: Matched Crossref Funder Registry ID returned from the Crossref API(if found)
- Publication affiliation usage: DOIs where the affiliation string contains the organization names provided in the request. Retrieved from the OpenAlex API.
- ORCID affiliation usage: ORCID IDs where the organization name is listed as the affiliation
- Possible ROR matches: Existing ROR IDs and names that are pot. Used to identify records that already exist in ROR
- Previous requests: Links to GitHub issues where the same organization is named
- Geonames match: Name and Geonames ID of matched location returned from the Geonames API
The results of the automated triage should then be verified for correctness and reconciled back with the main issue body. Individual fields that fail to return anything from their corresponding API queries will be absent from the comment.
The publication and ORCID affiliation usage should be used to help assess whether the record is in scope for ROR. However, do not rely exclusively on what is returned from the script to determine evidence of affiliation usage. If no affiliation usage is returned by the script, check additional sources like Google Scholar to identify whether affiliation usage exists that is not otherwise indexed in the DOI metadata or in OpenAlex.
For update requests, the script generates an encoded update string using the record identified in the request and the description of change, created through an OpenAPI request using the updates encoding prompt, with some additional procedural validation. This results in an update string like the following:
Update: ror_display.replace==New Organization Name | label.add==New Organization Name | alias.add==Old Organization Name | isni.add==0000 0001 2345 6789$
This encoded update is added as a comment on the GitHub issue for review.
Although the automated triaging can generally handle update requests of simple to moderate complexity, it can make mistakes, skip over data, or introduce other forms of errors and often fails for complex or ambiguous requests. The updates encoding from the the automated triage should thus not be used without additional review.
Review the encoding relative to the record and description of change to verify its correctness and completeness. Verify that the update will not result in any unnecessary data loss (e.g. where a field's values are being errantly replaced, vs. added or deleted). Check for any issues relative to the special consideration in the encoding updates section. Make sure that the correct languages are assigned for name values. Make sure any additional update required for the record, but not identified in the request are included in the encoding as well.
All requests should have the appropriate labels assigned to indicate their type, character, and complexity.
Label | Description |
---|---|
lion | High-complexity issue |
jaguar | Medium-complexity issue |
kitten | Low-complexity issue |
level 1 | Higher priority (primarily new record requests) |
level 2 | Medium priority (primarily metadata changes to principal fields for discovery and disambiguation) |
level 3 | Lower priority (all other metadata changes) |
already in ror | ROR ID already exists |
duplicate | This issue or pull request already exists |
hold for later | To be processed at a later point |
merge records | Two or more records need to be merged |
split record | Split record into one or more records |
new record | Add a new ROR record |
update record | Update an existing ROR record |
cleanup | Cleanup work to fix data issues involving a high volume of records |
needs discussion | Issue requires a policy-related discussion or decision |
non-request | A general question or comment as opposed to a specific request |
out of scope | Not in scope for ROR |
org-requested | This request came from the organization in question |
project | A longer-term and larger-scale curation task, typically involving bulk updates to a set of records |
training | Issue useful for training |
triage needed | Request needs to be triaged by curation lead |
Use the curator evaluations workflows for new record requests and updates to existing records to assess all requests. Once reviewed, provide a comment approving, denying, requesting an additional review by a curator, or indicating that the request will be put on hold for additional review, contact with the requestor, or until further evidence of meeting ROR's criteria for inclusion are met.
Once each request has been triaged, reviewed, and labeled, assign to the appropriate project column.
Label | Description |
---|---|
To do (ready for review) | Issues here are a holding pen for work in progress |
In Progress | Issues here are in progress. Primarily bulk submissions and other project issues. |
Second Review | Issues here require additional review by a curation team member |
Needs discussion | Issues here require further team discussion and potential consultation with requestors |
Ready for sign-off / metadata QA | Requests here are approved and ready for metadata prep |
Approved | QA complete on metadata and approved, but not yet moved into a release |
Ready for production release | Ready to be included in the next release |
Done (Released on Production) | Issue has been released on production |
Declined requests | Requests declined because they are (1) out of scope, (2) duplicate an existing request, or (3) duplicate information already in ROR |
Hold for later | These requests cannot yet be processed due to insufficient information or incomplete functionality |
Projects | These are projects involving bulk analysis/bulk processing of sets of records |
Cleanup | Cleanup work that is needed to fix data issues involving a high volume of records across the registry |
Verify that all records that have been triaged are part of the project and are assigned to the correct project column. This can be accomplished with an issue search for issues that do not have a triage label and are not assigned to the project:
-
is:issue is:open -project:ror-community/19 -label:"triage needed"
-
is:issue
: Filters for issues (not pull requests). -
is:open
: Filters for issues that are still open. -
-project:ror-community/19
: Excludes issues assigned to project 19 of the "ror-community" repository. -
-label:"triage"
: Filters for issues without "triage" label.
-
Similarly, check for any missing or mixed-up labels for new and update record requests. These are used to identify and extract the corresponding metadata, so they need to be correctly assigned. This can be accomplished by searching for misaligned issues and title text:
-
is:issue is:open label:"new record" "Modify the"
-
is:issue
: Filters for issues. -
is:open
: Filters for open issues. -
label:"new record"
: Filters for issues with the "new record" label. -
"Modify the"
: Filters for issues where the title contains the text "Modify the."
-
-
is:issue is:open label:"update record" "Add a new"
-
is:issue
: Filters for issues. -
is:open
: Filters for open issues. -
label:"update record"
: Filters for issues with the "update record" label. -
"Add a new"
: Filters for issues where the title contains the text "Add a new."
-
Records without new or update record tags can be seen in the ROR updates project by switching to the table view and applying the following filter:
-
status:"Ready for sign-off / metadata QA" -label:"new record" -label:"update record"
-
status:"Ready for sign-off / metadata QA"
: Filters issues with the status "Ready for sign-off / metadata QA." -
-label:"new record"
: Excludes issues with the "new record" label. -
-label:"update record"
: Excludes issues with the "update record" label.
-
Use the script for extracting record metadata from issues to create the new and update records files from the issues in the Ready for Sign-off/Metadata QA column.
Here's the reorganized and improved version of the section:
Run the following tests on the extracted metadata, using the instructions in their corresponding READMEs:
Correct any errors in the issue, repeating extraction and tests until all problems have been addressed.
Once all tests are passing, perform a manual review of all issues in the extracted metadata, scanning for errors. Refer to the new records and update records processing section for guidance.
After completing the pre-release tests and secondary review, update the issues to reflect any changes:
- Update issues with corrections based on test results and secondary review
- Document any changes or reasons for corrections as comments in the issues where additional explanation is needed
- Reassign issues to appropriate columns based on their updated status
- Close any issues identified as duplicates and remove them from the project
It is generally better to create the new and update records in separate batches. Update records generally require less tests and are less complex, so prioritize creating those first, followed by new records, if possible.
Using the extracted metadata files, create the release records via the API using the create_records script.
For new records, you will need to reconcile the ROR IDs back into the input CSV file. This can be done by copying the ROR ID values in the report.csv file that is returned as part of the API response zip into input.csv.
Rename the input files with the date and type, using the pattern {date}_{record_type}_records_metadata.csv, e.g. 20241017_new_records_metadata_csv.
Create a release branch off the main branch in ror-updates using the pattern rc-v{release_version}{release_number}
, e.g. rc-v1.54
. In this branch create three directories: new
, updates
, and input_files
. Add the records to the directory corresponding to their type and the CSV file used to generate them to the input_files directory. Commit to the branch with a basic commit message summarizing the action, e.g. "Adding all new records through 2024/10/17 for release v1.54."
Run the following tests on records created using the instructions in their corresponding READMEs:
- JSON integrity check
- Duplicate values across fields
- Leading/trailing whitespace/punctuation
- Unprintable characters
- Production duplicates (new records only)
- IDs already in use on production
For each test, review output, make corrections either directly in files or through re-extraction/creation, delete test results, and commit changes.
After completing the test, update and issues to reflect any changes by:
- Updating issues and CSV files with test-based corrections, such that they are consistent with any changes made in files.
- Documenting any necessary context for any changes in issue comments
- Reassigning issues to the appropriate columns
- Closing issues and removing from the project
- Deleting rows from the input CSV files for any records that have been dropped from the release.
Once all tests have been completed, use the move issues script to move the release issues to the Ready for Production Release
column.
Create a milestone in ror-updates, naming it with the release number (e.g v1.55). No due date or description is required. Note or copy the milestone number and use it in the script to add all issues to the milestone.
For requests that reference new records in their relationships field, add the corresponding ROR IDs to the requests that reference them from the new records that were generated, replacing the issue number references. Use the new records relationship
label to identify or additionally search and filter with the scripts for finding text in issues.
Once all requests have been updated with the new records ROR IDs, generate a CSV file all of all names and ROR IDs in the release directory using the script for obtaining all of these values. Then, proceed to use this as input to the create relationships script to generate a relationships CSV from the issues in the Ready for Production Release
column (which the script references by default). Review the resulting relationship CSV file for any errors (identified by an error in the relationship type column) and spot check several relationship entries against their corresponding issues. Once reviewed, commit the relationships CSV to the release directory in the release branch.
Once all files are committed to the release branch, run the following Github actions in this order
- Validate Without Relationships
- Create Relationships
- Remove relationships to inactive records
- Update Labels in Related Records
- Update Addresses
- Update Last Modified (Set date to the scheduled release date)
- Validate Files with Relationships
If any errors are encountered when running each action, review and update files to correct for these issues, with a corresponding commit, and re-run until successful.
Once all actions have run successfully, pull down the updated and additional record files to your local branch and run the create release notes script on the release directory. Add the resulting release notes to ror-updates as a draft branch for publication after the release has been deployed.
Once all have release tests have passed, but prior to deploying to the data dump to Zenodo, merge the release release branch to main and publish the draft release notes.
Once the release is fully deployed and published to Zenodo, use the close issues script with the release inputs files as input to add comments to all release issues and close.
Verify that all release issues have been closed (listed in the milestone description), update and close any not missed by the close issues script, then close the milestone.
Once all issue have been updated with the release comments and closed, archive the release issues in the project.