diff --git a/configuration_settings_on_the_command_line/index.html b/configuration_settings_on_the_command_line/index.html index bcbb024e..4f5de479 100644 --- a/configuration_settings_on_the_command_line/index.html +++ b/configuration_settings_on_the_command_line/index.html @@ -970,6 +970,8 @@

Specifying configuration settings on the command line

  • log_file_path
  • rollback_csv_file_path
  • rollback_config_file_path
  • +
  • csv_start_row
  • +
  • csv_stop_row
  • In all cases, you need to add -- to conform with Python's command-line argument syntax.

    For example, if you want to specify an input CSV file different from the one registered in your configuration file, include --input_csv as a command-line argument to Worbkench:

    diff --git a/search/search_index.json b/search/search_index.json index e04a0c0d..147f15ae 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Overview Islandora Workbench is a command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. It is an alternative to using Drupal's built-in Migrate framework for ingesting Islandora content from CSV files . Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Drupal server. Drupal's Migrate framework, however, is much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot. Note that Islandora Workbench is not related in any way to the Drupal contrib module called Workbench . Features Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files Allows creation of paged/compound content Can run from anywhere - it does not need to be run from the Drupal server's command line Provides both sensible default configuration values and rich configuration options for power users Provides robust data validation functionality Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation) Can generate a CSV file template based on Drupal content type Can generate a contact sheet from CSV data Can use a Google Sheet or an Excel file instead of a CSV file as input Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs Allows creation of new taxonomy terms from CSV field data Allows the assignment of URL aliases Allows creation of URL redirects Allows adding alt text to images Supports transmission fixity auditing for media files Cross platform (Windows, Mac, and Linux) Well documented Well tested Usage Within the islandora_workbench directory, run the following command, providing the name of your configuration file (\"config.yml\" in this example): ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. --check validates your configuration and input data. Typical output looks like: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. If your configuration file is not in the same directory as the workbench script, use its absolute path, e.g.: ./workbench --config /home/mark/config.yml --check If --check hasn't identified any problems, you can then rerun Islandora Workbench without the --check option to create the nodes: ./workbench --config config.yml Workbench will then create a node and attached media for each record in your input CSV file. Typical output looks like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created. If you'd rather not see all this detail, you can set an option in your configuration file to see a progress bar instead: [==================================> 40.0% ] License Islandora Workbench's documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License . Contributing Contributions to this documentation are welcome. If you have a suggestion, please open an issue on the Islandora Workbench GitHub repository's queue and tag your issue \"documentation\".","title":"Home"},{"location":"#overview","text":"Islandora Workbench is a command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. It is an alternative to using Drupal's built-in Migrate framework for ingesting Islandora content from CSV files . Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Drupal server. Drupal's Migrate framework, however, is much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot. Note that Islandora Workbench is not related in any way to the Drupal contrib module called Workbench .","title":"Overview"},{"location":"#features","text":"Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files Allows creation of paged/compound content Can run from anywhere - it does not need to be run from the Drupal server's command line Provides both sensible default configuration values and rich configuration options for power users Provides robust data validation functionality Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation) Can generate a CSV file template based on Drupal content type Can generate a contact sheet from CSV data Can use a Google Sheet or an Excel file instead of a CSV file as input Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs Allows creation of new taxonomy terms from CSV field data Allows the assignment of URL aliases Allows creation of URL redirects Allows adding alt text to images Supports transmission fixity auditing for media files Cross platform (Windows, Mac, and Linux) Well documented Well tested","title":"Features"},{"location":"#usage","text":"Within the islandora_workbench directory, run the following command, providing the name of your configuration file (\"config.yml\" in this example): ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. --check validates your configuration and input data. Typical output looks like: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. If your configuration file is not in the same directory as the workbench script, use its absolute path, e.g.: ./workbench --config /home/mark/config.yml --check If --check hasn't identified any problems, you can then rerun Islandora Workbench without the --check option to create the nodes: ./workbench --config config.yml Workbench will then create a node and attached media for each record in your input CSV file. Typical output looks like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created. If you'd rather not see all this detail, you can set an option in your configuration file to see a progress bar instead: [==================================> 40.0% ]","title":"Usage"},{"location":"#license","text":"Islandora Workbench's documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .","title":"License"},{"location":"#contributing","text":"Contributions to this documentation are welcome. If you have a suggestion, please open an issue on the Islandora Workbench GitHub repository's queue and tag your issue \"documentation\".","title":"Contributing"},{"location":"adding_media/","text":"Adding media from files You can add media to existing nodes by providing a CSV file with a node_id column plus a file field that contains the name of the file you want to add: node_id,file 100,test.txt Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for \"add_media\" tasks like this (note the task option is 'add_media'): task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv # media_use_tid is optional, it defaults to \"Original file\". media_use_tid: 21 media_type: file This is the same configuration file using a term URI in media_use_tid rather than a term ID: task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv media_use_tid: \"http://pcdm.org/use#Transcript\" media_type: file If you want to specify a media_use_tid per CSV row, you can include that column in your CSV (in either \"add_media\" or \"create\" tasks): node_id,file,media_use_tid 100,test.txt,21 110,test2.txt,35 If you include media_use_tid values in your CSV file, they override the media_use_tid value set in your configuration file. Note If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's 'field_edited_text' field, allowing it to be indexed in Solr. Note The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Adding references to media for use with DGI's Image Discovery module DiscoveryGarden's DGI Image Discovery module provides a way to assign the same thumbnail image to multiple nodes. This is not the module's main purpose, but reusing a thumbnail image on many nodes is easy to accomplish using this module and Islandora Workbench. Essentially, the Image Discovery module defines a specific Drupal field on a node, field_representative_image , that contains a reference to an existing media, for example a media with an Islandora Media Use term \"Thumbnail image\". This approach to defining a thumbnail is different than Islandora's normal node/media relationship, where the media entity references its parent node in a field_media_of field attached to the media. To use Workbench to populate field_representative_image , simply include that field in your Workbench create task CSV and populate it with the media ID of the thumbnail media you want to use. The following example CSV populates the field with the media ID \"3784\": id,file,title,field_representative_image,field_model test-001,Test node 001,3784,Image test-002,Test node 002,3784,Digital Document Note that this approach to assigning a thumbnail image to a node does not use Workbench's additional_fields configuration setting to define a CSV column containing the filename of a thumbnail image. It simply populates a node's field_representative_image field with a media ID. DGI's module ensures that the referenced media is used as the node's thumbnail image. No other Workbench configuration is necessary. It's just as easy to use an update task to add the referenced media ID to nodes: node_id,field_representative_image 1089,3784 1093,3784","title":"Adding media to nodes"},{"location":"adding_media/#adding-media-from-files","text":"You can add media to existing nodes by providing a CSV file with a node_id column plus a file field that contains the name of the file you want to add: node_id,file 100,test.txt Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for \"add_media\" tasks like this (note the task option is 'add_media'): task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv # media_use_tid is optional, it defaults to \"Original file\". media_use_tid: 21 media_type: file This is the same configuration file using a term URI in media_use_tid rather than a term ID: task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv media_use_tid: \"http://pcdm.org/use#Transcript\" media_type: file If you want to specify a media_use_tid per CSV row, you can include that column in your CSV (in either \"add_media\" or \"create\" tasks): node_id,file,media_use_tid 100,test.txt,21 110,test2.txt,35 If you include media_use_tid values in your CSV file, they override the media_use_tid value set in your configuration file. Note If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's 'field_edited_text' field, allowing it to be indexed in Solr. Note The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration.","title":"Adding media from files"},{"location":"adding_media/#adding-references-to-media-for-use-with-dgis-image-discovery-module","text":"DiscoveryGarden's DGI Image Discovery module provides a way to assign the same thumbnail image to multiple nodes. This is not the module's main purpose, but reusing a thumbnail image on many nodes is easy to accomplish using this module and Islandora Workbench. Essentially, the Image Discovery module defines a specific Drupal field on a node, field_representative_image , that contains a reference to an existing media, for example a media with an Islandora Media Use term \"Thumbnail image\". This approach to defining a thumbnail is different than Islandora's normal node/media relationship, where the media entity references its parent node in a field_media_of field attached to the media. To use Workbench to populate field_representative_image , simply include that field in your Workbench create task CSV and populate it with the media ID of the thumbnail media you want to use. The following example CSV populates the field with the media ID \"3784\": id,file,title,field_representative_image,field_model test-001,Test node 001,3784,Image test-002,Test node 002,3784,Digital Document Note that this approach to assigning a thumbnail image to a node does not use Workbench's additional_fields configuration setting to define a CSV column containing the filename of a thumbnail image. It simply populates a node's field_representative_image field with a media ID. DGI's module ensures that the referenced media is used as the node's thumbnail image. No other Workbench configuration is necessary. It's just as easy to use an update task to add the referenced media ID to nodes: node_id,field_representative_image 1089,3784 1093,3784","title":"Adding references to media for use with DGI's Image Discovery module"},{"location":"adding_multiple_media/","text":"By default, Islandora Workbench adds only one media per node. This applies to both create and add_media tasks, and the file that is used to create the media is the one named in the \"file\" CSV column. The media use term ID is defined in either the media_use_tid configuration option or in the \"media_use_tid\" CSV field. However, it is possible to add more than a single media per node in create and add_media tasks. This ability might be useful if you have pre-generated derivatives (thumbnails, OCR, etc.) or if you want to add the main file (e.g. with a media use term of \"Original File\") and one or more files that are not generated by Islandora's microservices. Workbench looks for the additional_files configuration option, which maps CSV field names to media use term IDs. If the field names defined in this option exist in your CSV (in either a create task or an add_media task), Workbench creates a media for each file and assigns it the corresponding media use term. Here is an example of this configuration option: additional_files: - thumbnail: 20 - rightsdocs: 280 Note The syntax of the additional_files entries is important. Be sure to include a space after the dash at the start of each entry, like - rightsdocs , not -rightsdocs . The accompanying create CSV would look like this: file,id,title,thumbnail,rightsdocs main.jpg,036,Testing,/tmp/thumbs/main_tn.jpg,/tmp/rightsdocs/036.txt The accompanying add_media CSV using the same sample file paths would look like this: file,node_id,thumbnail,rightsdocs main.jpg,2078,/tmp/thumbs/main_tn.jpg,/tmp/rightsdocs/036.txt You can use multiple media use term IDs by subdelimiting them, e.g.: additional_files: - thumbnail: 20 - rightsdocs: 280|465 Warning If you are creating Media using this feature, you should temporarily disable Contexts that would normally generate derivatives equivalent to the additional Media you are creating. Do this at admin/structure/context by choosing \"Disable\" in the \"Operations\" list for each applicable Context, and be sure to re-enable them after running Workbench. Because Islandora creates derivatives asynchronously, it is impossible to guarantee that Contexts will not overwrite additional Media created using this feature. Running --check will warn you of this by issuing a message like the following: \"Warning: Term ID '18' registered in the 'additional_files' config option for field 'thumbnail' will assign an Islandora Media Use term that might conflict with derivative media. You should temporarily disable the Context or Action that generates those derivatives.\" A few notes: If the additional_files columns are empty in any rows, or if the reserved file column is empty because all file-related columns are listed in additional_files, you will need to add the allow_missing_files: true option to your configuration file. Fixity checking is only available for files named in the file CSV column, not in the additional columns described here. See this issue for more information. If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's \"field_edited_text\" field, allowing it to be indexed in Solr.","title":"Adding multiple media"},{"location":"aliases/","text":"In create tasks, you can assign URL aliases to nodes by adding the url_alias field to you CSV file, like this: file,title,field_description,url_alias IMG_1410.tif,Small boats in Havana Harbour,Some are blue.,/havana_boats IMG_2549.jp2,Manhatten Island,Manhatten is part of New York City.,/manhatten No other configuration is required. URL aliases must start with a forward slash ( / ). When you run Workbench with its --check option, it will check whether each alias starts with this character, and whether the alias already exists. Note This method of assigning URL aliases is useful if, for example, you are migrating from another platform and want to retain the URLs of items from the source platform. If you want to assign URL aliases that are derived from node-specific field data (like title, date, taxonomy terms, etc.), you can use the Drupal contrib module Pathauto instead. But, note also that any URL aliases created through Drupal's core URL alias functionality, which the method described above uses, is overwritten by Pathauto. This means that if you use Pathauto to create aliases, any URL aliases created by Workbench will likely not work. You can also assign URL aliases in update tasks: node_id,url_alias 345,/this_is_a_cool_item 367,/soisthisone However, in update tasks, you can only assign/update the url_alias for nodes that do not already have an alias. Your update_mode setting can be either append or replace .","title":"Assigning URL aliases"},{"location":"alt_text/","text":"Islandora image media require a value in their \"Alternative text\" field. This text is used as the alt text in the HTML markup rendering the image. You can assign alt text values to image media by adding the image_alt_text field to you CSV file, like this: file,title,field_model,image_alt_text IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" The value will only be applied to image media that Workbench creates. If you do not include this field in your CSV file, or the field is present but empty, Workbench will use the node's title as the alt text. Warning Workbench only adds alt text to the image media it creates. So if the image files in your create task are \"Original file\" media use (the default), those are the image media that will get the alt text value from your CSV. If the image files you create are \"Service file\" media (e.g via an additional_files configuration), those are the ones that get the alt text. A couple notes: Assigning alt text to images is only available in create tasks (but see this issue ). Workbench strips out all HTML markup within the alt text to prevent potential cross-site scripting vulnerabilities.","title":"Adding alt text to images"},{"location":"cancelling/","text":"Cancelling Workbench You can cancel/quit Islandora Workbench manually, before it completes running. You normally wouldn't do this, but if you do need to cancel/quit a Workbench job, press ctrl-c (the standard way to exit a running script) while Workbench is running. To illustrate what happens when you do this, let's use the following simple CSV input file, which we'll use to create some nodes: file,id,title IMG_1410.tif,01,Small boats in Havana Harbour IMG_2549.jp2,02,Manhatten Island IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island We run Workbench, and after two nodes have been created, we issue a ctrl-c to cancel: OK, connection to Drupal at http://localhost verified. Node for \"Small boats in Havana Harbour\" (record 01) created at http://localhost/node/33. Node for \"Manhatten Island\" (record 02) created at http://localhost/node/34. ^CExiting before entire CSV processed. See log for more info. workbench.log will contain the following entries: 07-Nov-21 09:39:30 - INFO - OK, connection to Drupal at http://localhost verified. 07-Nov-21 09:39:31 - INFO - Writing rollback CSV to input_data/rollback.csv 07-Nov-21 09:39:31 - INFO - \"Create\" task started using config file foo.yml 07-Nov-21 09:39:39 - INFO - \"nodes_only\" option in effect. No media will be created. 07-Nov-21 09:39:40 - INFO - Node for Small boats in Havana Harbour (record 01) created at http://localhost/node/33. 07-Nov-21 09:39:41 - INFO - Node for Manhatten Island (record 02) created at http://localhost/node/34. 07-Nov-21 09:39:42 - WARNING - Workbench exiting after receiving \"ctrl c.\" Consult the documentation to learn how to resume your batch. Resuming a cancelled job/batch The \"documentation\" you are referred to is this page! The log shows that the last row in the CSV that resulted in a node being created is the row for record 02; after that row was processed, the user issued a ctrl-c to stop Workbench. To process the remaining CSV records (\"resume your batch\"), you need to remove from the input CSV the rows that were processed (according to the example log above, the rows for record 01 and record 02), and run Workbench on the resulting unprocessed records: file,id,title IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island Note that cancelling Workbench simply stops it from executing. It doesn't use a transaction to ensure that all child objects or media that are being processed are also processed: If it stops while creating a media associated with a node, it might stop before the media is created. If it stops while creating a compound object such as a book, it might stop before the children/pages are being processed. If you cancel Workbench while it is running, you should always inspect the last object created and any of its media/children to ensure that they were all created. Use information in the log to see what was processed just prior to exiting. Note You can also issue a ctrl-c while running a --check . If you do so, Workbench just logs the action and exits.","title":"Cancelling Workbench"},{"location":"cancelling/#cancelling-workbench","text":"You can cancel/quit Islandora Workbench manually, before it completes running. You normally wouldn't do this, but if you do need to cancel/quit a Workbench job, press ctrl-c (the standard way to exit a running script) while Workbench is running. To illustrate what happens when you do this, let's use the following simple CSV input file, which we'll use to create some nodes: file,id,title IMG_1410.tif,01,Small boats in Havana Harbour IMG_2549.jp2,02,Manhatten Island IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island We run Workbench, and after two nodes have been created, we issue a ctrl-c to cancel: OK, connection to Drupal at http://localhost verified. Node for \"Small boats in Havana Harbour\" (record 01) created at http://localhost/node/33. Node for \"Manhatten Island\" (record 02) created at http://localhost/node/34. ^CExiting before entire CSV processed. See log for more info. workbench.log will contain the following entries: 07-Nov-21 09:39:30 - INFO - OK, connection to Drupal at http://localhost verified. 07-Nov-21 09:39:31 - INFO - Writing rollback CSV to input_data/rollback.csv 07-Nov-21 09:39:31 - INFO - \"Create\" task started using config file foo.yml 07-Nov-21 09:39:39 - INFO - \"nodes_only\" option in effect. No media will be created. 07-Nov-21 09:39:40 - INFO - Node for Small boats in Havana Harbour (record 01) created at http://localhost/node/33. 07-Nov-21 09:39:41 - INFO - Node for Manhatten Island (record 02) created at http://localhost/node/34. 07-Nov-21 09:39:42 - WARNING - Workbench exiting after receiving \"ctrl c.\" Consult the documentation to learn how to resume your batch.","title":"Cancelling Workbench"},{"location":"cancelling/#resuming-a-cancelled-jobbatch","text":"The \"documentation\" you are referred to is this page! The log shows that the last row in the CSV that resulted in a node being created is the row for record 02; after that row was processed, the user issued a ctrl-c to stop Workbench. To process the remaining CSV records (\"resume your batch\"), you need to remove from the input CSV the rows that were processed (according to the example log above, the rows for record 01 and record 02), and run Workbench on the resulting unprocessed records: file,id,title IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island Note that cancelling Workbench simply stops it from executing. It doesn't use a transaction to ensure that all child objects or media that are being processed are also processed: If it stops while creating a media associated with a node, it might stop before the media is created. If it stops while creating a compound object such as a book, it might stop before the children/pages are being processed. If you cancel Workbench while it is running, you should always inspect the last object created and any of its media/children to ensure that they were all created. Use information in the log to see what was processed just prior to exiting. Note You can also issue a ctrl-c while running a --check . If you do so, Workbench just logs the action and exits.","title":"Resuming a cancelled job/batch"},{"location":"changelog/","text":"main branch (no tag/release) December 3, 2024: Resolved (commit eb5a849) issue 856 ; resolved (commit 22e5d1e) issue 857 and issue 858 . December 1, 2024: Resolved (commit f50e9c2) issue 855 . November 26, 2024: Resolved (commit d1b6ea4) issue 845 . November 19, 2024: Resolved (commit 12a81dc) issue 844 , (commit 53801bb) issue 847 , (commit f86d56d) issue 850 , (commit 9b15fb4) issue 852 , and (commit 72ccb97) issue 792 . November 12, 2024: Resolved (commit 584ca09) issue 838 . November 11, 2024: Merged (c1f6cd4) PR 837 . Thanks ajstanley! Resolved (commit c8bc120) issue 847 . November 10, 2024: Resolved (commit 12a81dc) issue 844 . October 20, 2024: Resolved (commit 4358d41) issue 832 and issue 833 . October 15, 2024: Merged (c1f6cd4) PR 828 . Thanks @newzealandpaul! October 14, 2024: Resolved (c1f6cd4) issue 796 . Thanks g011um! September 23, 2024: Resolved (commit e855926) issue 825 and (commit 538b10a) issue 826 . September 17, 2024: Resolved (commit 4f3b5e0) issue 822 . September 3, 2024: Resolved (commit c787245) issue 807 . September 2, 2024: Resolved (commit 8fe842c) issue 816 and (commit d70e8b5) issue 820 . August 26, 2024: Resolved (commit fe614563) issue 819 . August 23, 2024: Merged (commit 2e0f7f5) PR 815 and resolved (commit 3c14eac) issue 817 . August 20, 2024: Resolved (commit 525c1c9) issue 812 . August 9, 2024: Resolved (commit 8117024) issue 806 . August 6, 2024: Resolved (commit ff901d8) issue 791 and (commit ef01052) issue 808 . July 9, 2024: Resolved (commit 7f5f814) issue 337 and issue 752 . July 7, 2024: Resolved issue 798 . July 3, 2024: (commit d32acdb): Relocated \" in console output version of the 'Node for \"[title]\"' log entry to match the console output. June 30, 2024: (commit ecd5d86): Resolved issue 795 . June 24, 2024: (commit 90c112c): Resolved issue 793 . June 20, 2024: (commit 5a049bb): Resolved issue 398 . June 7, 2024: (commit 986e8ba) Merged PR 788 ; resolved (commit 7101c2a) issue 787 . May 21, 2024: (commit be71dd5) Resolved issue 782 and (commit fff678e) issue 783 . May 20, 2024: (commit 3acddfa) Resolved issue 745 , (commit 3f7b966) issue 773 , (commit f667639) issue 766 , and (commit e268f97) issue 781 . May 15, 2024: (commit e7e55ca) Merged PR 778 . April 23, 2024: (commit c197de3) Added email_log_if_errors.py script. April 22, 2024: (commit edd8870) Resolved issue 771 . April 15, 2024: (commit 19ffa9c) Resolved issue 770 . April 14, 2024: (commit 0824988) Resolved issue 749 . APril 10, 2024: (commit 13f3618) Merged Seth Shaw's work to allow using term names in Entity Reference Views fields ( issue 642 ). April 9, 2024: (commit f751ad8) Resolved issue 767 and (commit 253f2d6) issue 768 . April 4, 2024: (commit 0824988) Resolved issue 765 . April 2, 2024: (commit 0777318) Resolved issue 763 . March 28, 2024: (commit 76736ba) Work on issue 762 . March 27, 2024: (commit 1cf0717) Resolved issue 756 and (commit 514b8f3) issue 755 . March 4, 2024: (commit 5332f33) Work on issue 747 . February 20, 2024: (commit 2b686d5) Resolved issue 743 . February 13, 2024: (commit dab400f) Resolved issue 740 . January 30, 2024: (commit 3681ae1) Added a new get_media_report_from_view task, arising from discussion at issue 727 . January 25, 2024: Resolved issue 735 . January 24, 2024 (commit eea1165): Resolved issue 639 . January 23, 2024 (commit 60285ad): Resolved issue 730 . January 17, 2024 (commit 197e55a): Resolved issue 734 . January 15, 2024 (commit 5d0b38c): Resolved issue 733 . January 14, 2024 (commit 5dfd657): Resolved issue 731 and issue 732 . January 10, 2024 (commit 7d9aa0): Resolved issue 606 and (commit ac46541) issue 728 . January 5, 2024 (commit c36cc5d): Resolved issue issue 723 . January 2, 2024 (commit 248560b): Resolved issue issue 726 . December 12, 2023 (commit 864be45): Merged @ajstanley's work on PR 722 . December 11, 2023 (commit f03f97b): Merged @joecorall's work on issue issue 719 . December 10, 2023 (commit 988c69d): Resolved issue issue 687 . December 1, 2023 (commit 749265b): Resolved issue 715 , issue 716 , issue 717 , issue 718 . October 30, 2023 (commit b38d37e): Resolved issue issue 702 and issue issue 705 . October 29, 2023 (commit 3c3f7a8): Resolved issue issue 703 . October 23, 2023 (commit f41fa85): Resolved issue issue 701 . September 24, 2023 (commit 7c66389): Resolved issue issue 690 . September 20, 2023 (commit e41ece7): Some minor coding style cleanup. September 19, 2023 (commit bd3e05e): Work on issue 692 . September 13, 2023: Merged in @seth-shaw-asu's work to resolve issue 694 ; merged @ysuarez's work on issue 445 ; WIP on issue 691 and issue 692 . August 28, 2023 (commit 575b7ba): Merged @hassanelsheikha's work to resolve issue 95 (not yet documented). August 24, 2023: Resolved issues issue 682 and issue 684 . August 18, 2023 (commit f33e8df): Work in progress on issue 487 . August 15, 2023 (commit f33e8df): Work in progress on issue 487 ; updated minimum version of requests-cache in setup.py as per issue 632 . August 14, 2023 (commit c989e39): Resolved issues issue 657 and (commit 866b6c2) issue 671 . August 10, 2023 (commit 1ab0172): Resolved issue issue 664 . August 3, 2023 (commit c3fe7e1): Resolved issues issue 613 and issue 647 . August 2, 2023 (commit 4e4f14f): Preliminary work on issue 663 . August 1, 2023 (commit 18ea969): Resolved issue 648 . July 31, 2023 (commit a45a869): Resolved issue 652 . Thanks @willtp87! July 28, 2023 (commit 63b3b83): Added @noahsmith's fix in commit f50ebf2 to all task functions, and accounted for enable_http_cache: false ; Resolved (commit fce9db7) issue 654 . July 26, 2023 (commit f50ebf2): Merged @noahsmith's fix for pruning the HTTP cache ( PR 651 , work on issue 632 ). Thanks Noah! July 20, 2023 (commit 8c1995e): Merged @aOelschlager's contribution (thanks!) of an update_terms task from PR 622 , plus some additional prerequisite cleanup needed for her code to work. July 14, 2023 (commit dfa60ff): Merged @noahsmith's introduction of \"soft checks\" as described in issue 620 (thanks!). July 13, 2023 (commit 2a589f2): @noahsmith found and fixed issue 640 . July 11, 2023 (commit 411cd2d): Merged initial work on Paragraphs support (thanks @seth-shaw-asu). --check functionality and documentation forthcoming. July 10, 2023 (commit 2373149): Changes to how the CSV ID to node ID map works; (commit 52a5db) clear sqlite cache file (work on issue 632 ). July 7, 2023 (commit eae85c5): Resolved issue 633 ; resolved (commit 7511828) issue 635 . July 5, 2023 (commit b2fd24c): Resolved issue 631 . July 4, 2023 (commit 4a93ef0a): Resolved issue 443 ; resolved (commit 1f6051b) issue 629 . June 30, 2023 (commit 59f3c69): clarified --check message to user when \"log_term_creation\" config setting is set to \"false\". June 29, 2023 (commit 7d44d1c): Merged PR 625 into main branch and added some accompanying defensive logic to --check . June 28, 2023 (commit 5f4f35c): Further work on issue 607 . June 12, 2023 (commit a6404ea): Resolved issue 615 . May 29, 2023 (commit ad6c954): Resolved issue 611 ; (commit 3ce7fba) resolved issue 607 . May 27, 2023 (commit 391ee07): Updated PR template. May 26, 2023 (commit 3dc81c6): Resolved issue 610 ; (commit fcdeb7b) Improved wording of error/log messages when vocabulary or content type doesn't exist. May 25, 2023 (commit c93a706): Resolved issue 608 . May 24, 2023 (commit 2cc3cb7): Resolved issue 609 . May 22, 2023 (commit 66b4cd6): Merged in contribution from @hassanelsheikha enabling Workbench to update media. May 19, 2023 (commit 8e0d662): Resolved bug portion of issue 605 . May 10, 2023 (commit 13ae4c2): Resolved issue 601 . May 5, 2023 (commit 23d2941): Resolved issue 597 , (commit ab8ee21) resolved issue 367 . April 26, 2023 (commit eeade7f): Resolved issue 593 . March 26, 2023 (commit fab2501): Resolved issue 590 . March 23, 2023 (commit 478e8bb): Resolved issue 588 . March 22, 2023 (commit 1ed5f91): Resolved issue 586 . March 20, 2023 (commit b342451): Resolved issue 574 . March 13, 2023 (commit 3eb9c19): Resolved issue 584 . March 10, 2023 (commit a39bd8f): Resolved issue 580 . March 7, 2023 (commit 591dac1): Resolved issue 405 ; (commit bd5ee60) resolved issue 579 . March 6, 2023 (commit 3c19cf6): Resolved issue 576 . March 5, 2023: Fixed URL to the \" Entity Reference Views fields \" docs; resolved issue 566 (commit 19b1c2e). March 2, 2023: Created drupal_8.5_and_lower tag. Users of Drupal 8.5 and earlier must use this version of Workbench. February 28, 2023 (commit 542325f): Resolved issue 569 . February 24, 2023: Added clean_csv_values_skip config setting (commit e659616e, issue 567 ). February 22, 2023: Resolved issue 563 ; Added csv_value_templates config setting (commit ae1fcd2b, issue 566 ). February 20, 2023 (commit 96cc6ef): Resolved issue 554 ; (commit a143bab): resolved issue 556 ). February 18, 2023 (commit ffa03de): Added csv_headers config option (issue 559 ). February 16, 2023 (commit 9a8828b): Removed sample config files from workbench directory (issue 552 ). Added new config option log_term_creation (commit 51348d0, issue 558 ). February 15, 2023 (commit 309c311): Added temp_dir config option (issue 551 ). February 14, 2023 (commit d200db6): Resolved issue (issue 553 ). February 11, 2023 (commit 869bd5b): Resolved issue (issue 547 ). Added rollback_dir config option (commit 1abad16, pull request 550 ). Updated PR template (commit a32e88f). February 5, 2023 (commit 65db118): Resolved issue (issue 538 ). January 31, 2023 (commit b452450): Resolved issue (issue 536 ). January 29, 2023 (commit cff6008): Added ability to generate a contact sheet (issue 515 ). January 26, 2023 (commit 6b0c16b): Added validation in --check of parent/child position in CSV file (issue 529 ); resolved issue 531 (commit 3150b4b). January 19, 2023 (commit b97b563): Fixed bug 522 and (commit 76d8c44) bug 523 ; changed log level from ERROR to WARNING when there are missing files and the allow_missing_files config option is set to true. January 18, 2023 (commit 727145f): Added validate_parent_node_exists config option (issue 521 ). January 17, 2023 (commit a4a5008): Added better trimming of trailing slash in the host config option (issue 519 ); (commit 1763fe6) fixed bug when \"field_member_of\" contained multiple values 520 . January 15, 2023 (commit ba149d6d): Added validation of extensions for files named in the CSV file column (issue 126 ); (commit 82dd02c) added validation of CSV values for \"List (text)\" type fields 509 . January 9, 2023 (commit a3931df): Added ability to create media track files (issue 373 ); fixed some integration tests. January 6, 2023 (commit f4e4c8d): Fixed issue 502 . December 31, 2022: Better cleanup when using remote files - @ajstanley's fix for issue 497 (commit a0412af), resolved issue 499 (commit b8f74c8). December 28, 2022 (commit e4e6e49): Fixed bug where running Workbench using a Google Sheet or Excel file as input without first running --check caused a \"file not found\" error (issue 496 ). Thanks to @ruebot for discovering this bug. December 11, 2022 (commit 24b70fd): Added ability to export files along with CSV data (issue 492 ). December 5, 2022 (commit 0dbd459): Fixed bug in file closing when running --check during \"get_data_from_view\" tasks on Windows (issue 490 ). November 28, 2022 (commit 46cfc34): Added quick delete option for nodes and media (issue 488 ). November 24, 2022 (commit 3fe5c28): Extracted text media now have their \"field_edited_text\" field automatically populated with the contents of the specified text file (issue 407 ). November 22, 2022 (commit 74a83cf): Added more detailed logging on node, media, and file creation (issue 480 ). November 22, 2022 (commit f2a8a65): Added @DonRichards Dockerfile (PR 233 ). November 16, 2022 (commit 07a74b2): Added new config options path_to_python and path_to_workbench_script (issue 483 ). November 9, 2022 (commit 7c3e072): Fixed misspelling of \"preprocessed\" in code and temporary filenames (issue 482 ). November 1, 2022 (commit 7c3e072): Workbench now exits when run without --check and there are no records in the input CSV (issue 481 ). September 19, 2022 (commit 51c0f79): Replaced exit_on_first_missing_file_during_check configuration option with strict_check (issue 470 ). exit_on_first_missing_file_during_check will be available until Nov. 1, 2022, at which time strict_check will be the only option allowed. September 18, 2022 (commit 00f50d6): Added ability to tell Workbench to only process a subset of CSV records (issue 468 ). September 1, 2022 (commit 6aad517): All hook scripts now log their exit codes (issue 464 ). August 16, 2022 (commit 4270d13): Fixed bug that would not delete media with no files (issue 460 ). August 13, 2022 (commit 1b7b801): Added ability to run shutdown scripts (issue 459 ). August 12, 2022 (commit b821533): Provided configuration option standalone_media_url: true for sites who have Drupal's \"Standalone media URL\" option enabled (issue 466 ). August 11, 2022 (commit df0a609): Fixed bug where items in secondary task CSV were created even if they didn't have a parent in the primary CSV, or if their parent was not created (issue 458 ). They are now skipped. July 28, 2022 (commit 3d1753a): Added option to prompt user for password (issue 449 ; fixed 'version' in setup.py). July 27, 2022 (commit 029cb6d): Shifted to using Drupal's default media URIs (issue 446 ). July 26, 2022 (commit 8dcf85a): Fixed setup.py on macOS/Homebrew (issue 448 ). July 26, 2022 (commit 09e9f53): Changed license in setup.py to \"MIT\". Documentation December 3, 2024: Update docs on \" Applying CSV value templates to rows in your input CSV \"; updated docs on \" Rolling back nodes and media .\" December 1, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings added in issue 855 . Updated docs on \" Field data applied to pages/children \" to include link to \" Applying CSV value templates to paged content .\" November 26, 2024: Resolved issue 853 ; added docs on new paged_content_ignore_files config setting. November 19, 2024: Updated docs on \" Creating media track files \"; updated \" Troubleshooting \" to include the new --print_config argument. November 12, 2024: Updated docs on \" Creating taxonomy terms \" and \" Updating taxonomy terms \" to include use of the published CSV column. November 11, 2024: Updated docs on \" Configuring media types \".newzealandpaul November 10, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings, and added a dedicated section for rollbacks to the \" Configuration \" page. November 1, 2024: Updated docs on \" Adding alt text to images \" and \" Known limitations \". October 20, 2024: Updated docs on \" Creating redirects \". October 14, 2024: Added docs on populating the \" field_domain_access \" CSV column. Thanks for the docs @dara2! September 23, 2024: Updated docs on \" Checking configuration and input data \" to include new config setting check_lock_file_path . September 2, 2024: Added \" Prompting the user \". August 25, 2024: Updated \" Creating paged, compound, and collection content \" to document the new page_files_source_dir_field config setting. August 20, 2024: Added \" Processing or ignoring rows based on field values \". August 12, 2024: Added \" Overriding Workbench's default file extension to MIME type mappings \"; updated \" CSV value templates ; updated \" Taxonomy reference fields \". August 6, 2024: Updated \" Ingesting OCR (and other) files with page images \". July 16, 2024: Added \" Cross-environment deployment / Continuous Integration \". July 14, 2024: Added docs on the new protected_vocabularies config setting to \" Taxonomy reference fields \" and removed some cruft; added docs on \" Encoding of text files \". July 9, 2024: Added \" Using numbers as term names \". July 7, 2024: Added \" Using a local or remote .zip archive as input data \". July 4, 2024: Added \" Ingesting pages, their parents, and their 'grandparents' using a single CSV file \". July 3, 2024: Added \" Sharing configuration files with other applications \". June 30, 2024: Added docs on using values other than node IDs in field_member_of . June 20, 2024: Started docs on \" Creating redirects \". June 7, 2024: Updated docs on \" Exporting Islandora 7 content \". May 31, 2024: Some edits to \" Using subdirectories \" section of the docs on creating paged/compound content. May 21, 2024: Minor edit to \" Updating taxonomy terms \". May 20, 2024: Minor edit to \" Updating nodes ; updated docs to indicate that media_type configuration setting is now required for add_media and update_media tasks. April 22, 2024: Updated \" Hooks \" to document scripts/generate_iiif_manifests.py (from issue 771) and add some clarifications. April 17, 2024: Updated \" Hooks \" to be explicit about what Workbench configuration settings are available within external scripts. April 16, 2024: Updated \" Field data (Drupal and CSV) \" to clarify warning about Entity Reference Views fields; merged Rosie's changes to the docs on using Paragraphs . April 15, 2024: Resolved issue 748 ; updated docs to include new log_file_name_and_line_number config setting. April 14, 2024: Added \" Checking if nodes already exist \". Added documentation to \" Field data (Drupal and CSV) \" on configuring Views to allow using term names in Entity Reference Views fields. April 12, 2024: Added \"metadata maintenance\" section to the \" Workflows \" docs using Rosie Le Faive's excellent demonstration of round tripping metadata. April 8, 2024: Updated \" Field data (Drupal and CSV) \" to add documentation on Entity Reference Revisions fields (paragraphs). April 5, 2024: Updated \" Ignoring CSV rows and columns \" to describe using the new csv_rows_to_process config setting. April 2, 2024: Updated \" Choosing a task \"; updated \" Adding media to nodes \" to describe using DGI's Image Discovery module . February 21, 2024: Updated \" Development guide \". February 20, 2024: Updated \" Updating media \" to indicate that media_type is now a required configuration setting in update_media tasks. January 30, 2024: Added new docs on \" Using a Drupal View to generate a media report as CSV \". Also updated these docs to be clearer on the difference between Contextual Filters and Filter Criteria. January 24, 2024: Added new docs on \" Ingesting OCR (and other) files with page images \" and updated the \" Configuration \" page with the newly introduced config settings. January 17, 2023: Updated the \" Updating media \" docs to mention the update_mode config setting. January 14, 2024: Added promote to the \" Base fields \" docs; updated \" Updating media \". January 2, 2024: Updated the docs on allow_missing_files and perform_soft_checks . December 1, 2023: Updated the \" Updating media \" docs. November 28, 2023: Addressed issue 713 ; merged in @rosiel's https://github.com/mjordan/islandora_workbench_docs/pull/12. November 2, 2023: Updated the \" Troubleshooting \" page to include how to narrow down errors involving SSL certificates, and some additonal minor changes. October 29, 2023: Updated the docs on \" Assigning URL aliases \". September 13, 2023: Updated the docs on \" CSV preprocessor scripts \". September 1, 2023: Updated the \" Development guide \" page. August 21, 2023: Updated the \" Troubleshooting \" page to include how to eliminate Python \"InsecureRequestWarning\"s. August 16, 2023: Merged in @ysuarez's spelling fixes (issue 674 ). August 14, 2023: Update published entry in \" Base fields \" to allow media types to set their default published values. August 4, 2023: Removed published as a standalone configuration setting, and updated its entry in \" Base fields \". August 3, 2023: Documented the config settings query_csv_id_to_node_id_map_for_parents , ignore_duplicate_parent_ids , field_for_media_title , use_nid_in_media_title , use_node_title_for_media_title , use_node_title_for_remote_filename , use_nid_in_remote_filename , and field_for_remote_filename . Updated \" Using the CSV ID to node ID map \". August 2, 2023: Added mention of, and a screenshot showing, the DB Browser for SQLite to \" Using the CSV ID to node ID map \". Thanks for the tip @ajstanley! July 21, 2023: Updated \"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file\" entry in \" Troubleshooting \". July 20, 2023: Updated \" Checking configuration and input data \" to include the new perform_soft_checks config setting. July 19, 2023: Clarified that update tasks require the content_type setting in their config files if the target Drupal content type is not islandora_object . July 18, 2023: Updates to the published entry in the \" Base fields \" documentation; added entry for perform_soft_checks to \" Configuration \" docs (note: this new setting replaces strict_check ). July 10, 2023: Updates to \" Creating paged, compound, and collection content \" to reflect changes in the CSV ID to node ID map, specifically the new ignore_existing_parent_ids config setting. June 30, 2023: Corrected entry in \" Configuration docs \" for the strict_check setting. June 28, 2023: Updated \" Configuration docs \" and \" Using a secondary task \" to include new query_csv_id_to_node_id_map_for_parents configuration setting. Also added a note to the \"id\" reserved column entry in the \" Field data docs \" about importance of using unique ID values. June 26, 2023: Updated the \" Configuration docs \" to include the new HTTP cache settings introduced in issue 608. June 4, 2023: Updated the \" Requirements and installation \" and \" Checking configuration and input data \" docs with instructions on calling Python explicitly on Macs using Homebrew. June 1, 2023: Updated the \" With page/child-level metadata \" section to clarify use of parent_id as per issue 595 . May 30, 2023: Updated the \" Using the CSV ID to node ID map \" section. May 29, 2023: Added the \" Using the CSV ID to node ID map \" section and a few associated updates elsewhere. May 22, 2023: Added \" Updating media \" and a few associated updates elsewhere. May 11, 2023: Updated \" Post-action hooks .\" May 8, 2023: Added \" When Workbench skips invalid CSV data .\" May 7, 2023: Added note to \" Text fields \" that Workbench will truncate CSV values for fields configured in Drupal as \"text\" data type and that have a maximum allowed length. May 5, 2023: Add \" Text fields with markup .\" May 1, 2023: Updated \" Updating nodes .\" April 26, 2023: Updated \" Exporting Islandora 7 content \"; added docs for the new mimetype_extensions config option. April 14, 2023: Updated \" Troubleshooting .\" March 28, 2023: Added \" Choosing a task .\" March 23, 2023: Updated \" Configuration \" and \" Base fields .\" March 23, 2023: Updated \" Assigning URL aliases .\" March 13, 2023: Updated \" Exporting Islandora 7 content .\" March 7, 2023: Updated \" How Workbench cleans your input data \"; updated \" Checking configuration and input data \". March 6, 2023: Added an entry for the require_entity_reference_views config setting to the \" Drupal settings \"; minor corrections and updates to \" Workbench's relationship to Drupal and Islandora \". March 4, 2023: Updated the \" Exporting Islandora 7 content \" page. March 2, 2023: Added mention of drupal_8.5_and_lower tag to \" Requirements and installation . February 28, 2023: Removed references to the iteration-utilities Python library; add new page \" Workbench's relationship to Drupal and Islandora \". February 27, 2023: Replaced \"Islandora 8\" with \"Islandora 2\". February 24, 2023: Added \" How Workbench cleans your input data \". February 22, 2023: Updated \" Preparing your data \"; added \" CSV value templates \". February 20, 2023: Updated \" Rolling back nodes and media \". February 18, 2023: Updated \" Field data (Drupal and CSV) \" to include new csv_headers setting. February 16, 2023: Updated \" Configuration \" to include new log_term_creation setting. February 15, 2023: Updated \" Configuration \" to include new temp_dir setting. February 11, 2023: Updated \" Troubleshooting \" and \" Rolling back nodes and media .\" February 5, 2023: Updated the \" Using subdirectories \" method of creating compound/paged content to explain using the new page_title_template config option. January 31, 2023: Updated \" Generating a contact sheet \"; updated \" Configuring media types \". January 30, 2023: Edits to the \" Using subdirectories \" method of creating compound/paged content to clarify the absence of the \"file\" CSV column. January 29, 2023: Added \" Generating a contact sheet \". January 23, 2023: Added example CSVs for primary and secondary tasks in the \" Case study \" section of the Workflows documentation. January 22, 2023: Several clarifications and corrections, including @rosiel's correction of how to use allow_missing_files and additional_files together; added some examples of planning large compound/paged content ingests . January 16, 2023: Updated \" Exporting Islandora 7 content .\" January 15, 2023: Updated \" Checking configuration and input data .\" January 9, 2023: Added docs for creating \" Media track files .\" January 8, 2023: Updated \" Known limitations \" with a work around for unsupported \"Filter by an entity reference View\" fields; added examples of valid Windows paths to \" Values in the 'file' column .\" December 29, 2022: Minor corrections to \" Known limitations \", \" Workflows \", \" Creating paged, compound, and collection content ,\" and \" Preparing your data .\" December 28, 2022: Added cross reference between \" CSV field templates \" and \" Ignoring CSV rows and columns \". December 17, 2022: Corrected URI for http://pcdm.org/use#OriginalFile on \" Generating CSV files \" and \" Configuration .\" December 11, 2022: Updated \" Generating CSV files \" and \" Output CSV settings \" Configuration docs to include new ability to export files along with CSV data. Note: the data_from_view_file_path setting in \"get_data_from_view\" tasks has been replaced with export_csv_file_path . November 28, 2022: Added \" Quick delete \" docs; added clarification to \" Configuring Drupal's media URLs \" that standalone_media_url: true must be in all config files for tasks that interact with media; added note to \" Adding media to nodes \" and \" Values in the 'file' column \" clarifying that it is not possible to override the filesystem a media's file field is configured to use. November 26, 2022: Changed documentation theme from readthedocs to material; some edits for clarity to the docs for \"file\" field values ; some edits for clarity to the docs for \" adaptive pause .\" November 24, 2022: Added note to \" Adding media to nodes \" and \" Adding multiple media \" about extracted text media; added a note about using absolute file paths in scheduled jobs to the \" Workflows \" and \" Troubleshooting \"; removed the \"required\" \u2714\ufe0f from the password configuration setting entry in the table in \" Configuration \". November 17, 2022: Added new config options path_to_python and path_to_workbench_script to \" Configuration \" docs. October 28, 2022: Updated \" Configuration \" docs to provide details on YAML (configuration file) syntax. September 19, 2022: Updated references to exit_on_first_missing_file_during_check to use strict_check . Configuration settings entry advises exit_on_first_missing_file_during_check will be removed Nov. 1, 2022. September 18, 2022: Added entry \" Ignoring CSV rows and columns .\" September 15, 2022: Added entry to \" Limitations \" page about lack of support for HTML markup. Also added a section on \"Password management\" to \" Requirements and installation \". September 8, 2022: Added documentation on \" Reducing Workbench's impact on Drupal .\" August 30, 2022: Updated \" Hooks \" docs to clarify that the HTTP response code passed to post-entity-create scripts is a string, not an integer. August 18, 2022: Updated standalone_media_url entry in the \" Configuration \" docs, and added brief entry to the \" Troubleshooting \" page about clearing Drupal's cache. August 13, 2022: Updated \" Configuration \" and \" Hooks \" page to describe shutdown scripts. August 11, 2022: Added text to \" Creating paged, compound, and collection content \" page to clarify what happens when a row in the secondary CSV does not have a matching row in the primary CSV. August 8, 2022: Added entry to \" Limitations \" page about support for \"Filter by an entity reference View\" fields. August 3, 2022: Added entry to \" Troubleshooting \" page about missing Microsoft Visual C++ error when installing Workbench on Windows. August 3, 2022: Updated the \" Limitations \" page with entry about Paragraphs. August 2, 2022: Added note about ownership requirements on files to \" Deleting nodes \"; was previously only on \"Deleting media\". July 28, 2022: Updated password entry in the \" Configuration \" docs to mention the new password prompt feature.","title":"Change log"},{"location":"changelog/#main-branch-no-tagrelease","text":"December 3, 2024: Resolved (commit eb5a849) issue 856 ; resolved (commit 22e5d1e) issue 857 and issue 858 . December 1, 2024: Resolved (commit f50e9c2) issue 855 . November 26, 2024: Resolved (commit d1b6ea4) issue 845 . November 19, 2024: Resolved (commit 12a81dc) issue 844 , (commit 53801bb) issue 847 , (commit f86d56d) issue 850 , (commit 9b15fb4) issue 852 , and (commit 72ccb97) issue 792 . November 12, 2024: Resolved (commit 584ca09) issue 838 . November 11, 2024: Merged (c1f6cd4) PR 837 . Thanks ajstanley! Resolved (commit c8bc120) issue 847 . November 10, 2024: Resolved (commit 12a81dc) issue 844 . October 20, 2024: Resolved (commit 4358d41) issue 832 and issue 833 . October 15, 2024: Merged (c1f6cd4) PR 828 . Thanks @newzealandpaul! October 14, 2024: Resolved (c1f6cd4) issue 796 . Thanks g011um! September 23, 2024: Resolved (commit e855926) issue 825 and (commit 538b10a) issue 826 . September 17, 2024: Resolved (commit 4f3b5e0) issue 822 . September 3, 2024: Resolved (commit c787245) issue 807 . September 2, 2024: Resolved (commit 8fe842c) issue 816 and (commit d70e8b5) issue 820 . August 26, 2024: Resolved (commit fe614563) issue 819 . August 23, 2024: Merged (commit 2e0f7f5) PR 815 and resolved (commit 3c14eac) issue 817 . August 20, 2024: Resolved (commit 525c1c9) issue 812 . August 9, 2024: Resolved (commit 8117024) issue 806 . August 6, 2024: Resolved (commit ff901d8) issue 791 and (commit ef01052) issue 808 . July 9, 2024: Resolved (commit 7f5f814) issue 337 and issue 752 . July 7, 2024: Resolved issue 798 . July 3, 2024: (commit d32acdb): Relocated \" in console output version of the 'Node for \"[title]\"' log entry to match the console output. June 30, 2024: (commit ecd5d86): Resolved issue 795 . June 24, 2024: (commit 90c112c): Resolved issue 793 . June 20, 2024: (commit 5a049bb): Resolved issue 398 . June 7, 2024: (commit 986e8ba) Merged PR 788 ; resolved (commit 7101c2a) issue 787 . May 21, 2024: (commit be71dd5) Resolved issue 782 and (commit fff678e) issue 783 . May 20, 2024: (commit 3acddfa) Resolved issue 745 , (commit 3f7b966) issue 773 , (commit f667639) issue 766 , and (commit e268f97) issue 781 . May 15, 2024: (commit e7e55ca) Merged PR 778 . April 23, 2024: (commit c197de3) Added email_log_if_errors.py script. April 22, 2024: (commit edd8870) Resolved issue 771 . April 15, 2024: (commit 19ffa9c) Resolved issue 770 . April 14, 2024: (commit 0824988) Resolved issue 749 . APril 10, 2024: (commit 13f3618) Merged Seth Shaw's work to allow using term names in Entity Reference Views fields ( issue 642 ). April 9, 2024: (commit f751ad8) Resolved issue 767 and (commit 253f2d6) issue 768 . April 4, 2024: (commit 0824988) Resolved issue 765 . April 2, 2024: (commit 0777318) Resolved issue 763 . March 28, 2024: (commit 76736ba) Work on issue 762 . March 27, 2024: (commit 1cf0717) Resolved issue 756 and (commit 514b8f3) issue 755 . March 4, 2024: (commit 5332f33) Work on issue 747 . February 20, 2024: (commit 2b686d5) Resolved issue 743 . February 13, 2024: (commit dab400f) Resolved issue 740 . January 30, 2024: (commit 3681ae1) Added a new get_media_report_from_view task, arising from discussion at issue 727 . January 25, 2024: Resolved issue 735 . January 24, 2024 (commit eea1165): Resolved issue 639 . January 23, 2024 (commit 60285ad): Resolved issue 730 . January 17, 2024 (commit 197e55a): Resolved issue 734 . January 15, 2024 (commit 5d0b38c): Resolved issue 733 . January 14, 2024 (commit 5dfd657): Resolved issue 731 and issue 732 . January 10, 2024 (commit 7d9aa0): Resolved issue 606 and (commit ac46541) issue 728 . January 5, 2024 (commit c36cc5d): Resolved issue issue 723 . January 2, 2024 (commit 248560b): Resolved issue issue 726 . December 12, 2023 (commit 864be45): Merged @ajstanley's work on PR 722 . December 11, 2023 (commit f03f97b): Merged @joecorall's work on issue issue 719 . December 10, 2023 (commit 988c69d): Resolved issue issue 687 . December 1, 2023 (commit 749265b): Resolved issue 715 , issue 716 , issue 717 , issue 718 . October 30, 2023 (commit b38d37e): Resolved issue issue 702 and issue issue 705 . October 29, 2023 (commit 3c3f7a8): Resolved issue issue 703 . October 23, 2023 (commit f41fa85): Resolved issue issue 701 . September 24, 2023 (commit 7c66389): Resolved issue issue 690 . September 20, 2023 (commit e41ece7): Some minor coding style cleanup. September 19, 2023 (commit bd3e05e): Work on issue 692 . September 13, 2023: Merged in @seth-shaw-asu's work to resolve issue 694 ; merged @ysuarez's work on issue 445 ; WIP on issue 691 and issue 692 . August 28, 2023 (commit 575b7ba): Merged @hassanelsheikha's work to resolve issue 95 (not yet documented). August 24, 2023: Resolved issues issue 682 and issue 684 . August 18, 2023 (commit f33e8df): Work in progress on issue 487 . August 15, 2023 (commit f33e8df): Work in progress on issue 487 ; updated minimum version of requests-cache in setup.py as per issue 632 . August 14, 2023 (commit c989e39): Resolved issues issue 657 and (commit 866b6c2) issue 671 . August 10, 2023 (commit 1ab0172): Resolved issue issue 664 . August 3, 2023 (commit c3fe7e1): Resolved issues issue 613 and issue 647 . August 2, 2023 (commit 4e4f14f): Preliminary work on issue 663 . August 1, 2023 (commit 18ea969): Resolved issue 648 . July 31, 2023 (commit a45a869): Resolved issue 652 . Thanks @willtp87! July 28, 2023 (commit 63b3b83): Added @noahsmith's fix in commit f50ebf2 to all task functions, and accounted for enable_http_cache: false ; Resolved (commit fce9db7) issue 654 . July 26, 2023 (commit f50ebf2): Merged @noahsmith's fix for pruning the HTTP cache ( PR 651 , work on issue 632 ). Thanks Noah! July 20, 2023 (commit 8c1995e): Merged @aOelschlager's contribution (thanks!) of an update_terms task from PR 622 , plus some additional prerequisite cleanup needed for her code to work. July 14, 2023 (commit dfa60ff): Merged @noahsmith's introduction of \"soft checks\" as described in issue 620 (thanks!). July 13, 2023 (commit 2a589f2): @noahsmith found and fixed issue 640 . July 11, 2023 (commit 411cd2d): Merged initial work on Paragraphs support (thanks @seth-shaw-asu). --check functionality and documentation forthcoming. July 10, 2023 (commit 2373149): Changes to how the CSV ID to node ID map works; (commit 52a5db) clear sqlite cache file (work on issue 632 ). July 7, 2023 (commit eae85c5): Resolved issue 633 ; resolved (commit 7511828) issue 635 . July 5, 2023 (commit b2fd24c): Resolved issue 631 . July 4, 2023 (commit 4a93ef0a): Resolved issue 443 ; resolved (commit 1f6051b) issue 629 . June 30, 2023 (commit 59f3c69): clarified --check message to user when \"log_term_creation\" config setting is set to \"false\". June 29, 2023 (commit 7d44d1c): Merged PR 625 into main branch and added some accompanying defensive logic to --check . June 28, 2023 (commit 5f4f35c): Further work on issue 607 . June 12, 2023 (commit a6404ea): Resolved issue 615 . May 29, 2023 (commit ad6c954): Resolved issue 611 ; (commit 3ce7fba) resolved issue 607 . May 27, 2023 (commit 391ee07): Updated PR template. May 26, 2023 (commit 3dc81c6): Resolved issue 610 ; (commit fcdeb7b) Improved wording of error/log messages when vocabulary or content type doesn't exist. May 25, 2023 (commit c93a706): Resolved issue 608 . May 24, 2023 (commit 2cc3cb7): Resolved issue 609 . May 22, 2023 (commit 66b4cd6): Merged in contribution from @hassanelsheikha enabling Workbench to update media. May 19, 2023 (commit 8e0d662): Resolved bug portion of issue 605 . May 10, 2023 (commit 13ae4c2): Resolved issue 601 . May 5, 2023 (commit 23d2941): Resolved issue 597 , (commit ab8ee21) resolved issue 367 . April 26, 2023 (commit eeade7f): Resolved issue 593 . March 26, 2023 (commit fab2501): Resolved issue 590 . March 23, 2023 (commit 478e8bb): Resolved issue 588 . March 22, 2023 (commit 1ed5f91): Resolved issue 586 . March 20, 2023 (commit b342451): Resolved issue 574 . March 13, 2023 (commit 3eb9c19): Resolved issue 584 . March 10, 2023 (commit a39bd8f): Resolved issue 580 . March 7, 2023 (commit 591dac1): Resolved issue 405 ; (commit bd5ee60) resolved issue 579 . March 6, 2023 (commit 3c19cf6): Resolved issue 576 . March 5, 2023: Fixed URL to the \" Entity Reference Views fields \" docs; resolved issue 566 (commit 19b1c2e). March 2, 2023: Created drupal_8.5_and_lower tag. Users of Drupal 8.5 and earlier must use this version of Workbench. February 28, 2023 (commit 542325f): Resolved issue 569 . February 24, 2023: Added clean_csv_values_skip config setting (commit e659616e, issue 567 ). February 22, 2023: Resolved issue 563 ; Added csv_value_templates config setting (commit ae1fcd2b, issue 566 ). February 20, 2023 (commit 96cc6ef): Resolved issue 554 ; (commit a143bab): resolved issue 556 ). February 18, 2023 (commit ffa03de): Added csv_headers config option (issue 559 ). February 16, 2023 (commit 9a8828b): Removed sample config files from workbench directory (issue 552 ). Added new config option log_term_creation (commit 51348d0, issue 558 ). February 15, 2023 (commit 309c311): Added temp_dir config option (issue 551 ). February 14, 2023 (commit d200db6): Resolved issue (issue 553 ). February 11, 2023 (commit 869bd5b): Resolved issue (issue 547 ). Added rollback_dir config option (commit 1abad16, pull request 550 ). Updated PR template (commit a32e88f). February 5, 2023 (commit 65db118): Resolved issue (issue 538 ). January 31, 2023 (commit b452450): Resolved issue (issue 536 ). January 29, 2023 (commit cff6008): Added ability to generate a contact sheet (issue 515 ). January 26, 2023 (commit 6b0c16b): Added validation in --check of parent/child position in CSV file (issue 529 ); resolved issue 531 (commit 3150b4b). January 19, 2023 (commit b97b563): Fixed bug 522 and (commit 76d8c44) bug 523 ; changed log level from ERROR to WARNING when there are missing files and the allow_missing_files config option is set to true. January 18, 2023 (commit 727145f): Added validate_parent_node_exists config option (issue 521 ). January 17, 2023 (commit a4a5008): Added better trimming of trailing slash in the host config option (issue 519 ); (commit 1763fe6) fixed bug when \"field_member_of\" contained multiple values 520 . January 15, 2023 (commit ba149d6d): Added validation of extensions for files named in the CSV file column (issue 126 ); (commit 82dd02c) added validation of CSV values for \"List (text)\" type fields 509 . January 9, 2023 (commit a3931df): Added ability to create media track files (issue 373 ); fixed some integration tests. January 6, 2023 (commit f4e4c8d): Fixed issue 502 . December 31, 2022: Better cleanup when using remote files - @ajstanley's fix for issue 497 (commit a0412af), resolved issue 499 (commit b8f74c8). December 28, 2022 (commit e4e6e49): Fixed bug where running Workbench using a Google Sheet or Excel file as input without first running --check caused a \"file not found\" error (issue 496 ). Thanks to @ruebot for discovering this bug. December 11, 2022 (commit 24b70fd): Added ability to export files along with CSV data (issue 492 ). December 5, 2022 (commit 0dbd459): Fixed bug in file closing when running --check during \"get_data_from_view\" tasks on Windows (issue 490 ). November 28, 2022 (commit 46cfc34): Added quick delete option for nodes and media (issue 488 ). November 24, 2022 (commit 3fe5c28): Extracted text media now have their \"field_edited_text\" field automatically populated with the contents of the specified text file (issue 407 ). November 22, 2022 (commit 74a83cf): Added more detailed logging on node, media, and file creation (issue 480 ). November 22, 2022 (commit f2a8a65): Added @DonRichards Dockerfile (PR 233 ). November 16, 2022 (commit 07a74b2): Added new config options path_to_python and path_to_workbench_script (issue 483 ). November 9, 2022 (commit 7c3e072): Fixed misspelling of \"preprocessed\" in code and temporary filenames (issue 482 ). November 1, 2022 (commit 7c3e072): Workbench now exits when run without --check and there are no records in the input CSV (issue 481 ). September 19, 2022 (commit 51c0f79): Replaced exit_on_first_missing_file_during_check configuration option with strict_check (issue 470 ). exit_on_first_missing_file_during_check will be available until Nov. 1, 2022, at which time strict_check will be the only option allowed. September 18, 2022 (commit 00f50d6): Added ability to tell Workbench to only process a subset of CSV records (issue 468 ). September 1, 2022 (commit 6aad517): All hook scripts now log their exit codes (issue 464 ). August 16, 2022 (commit 4270d13): Fixed bug that would not delete media with no files (issue 460 ). August 13, 2022 (commit 1b7b801): Added ability to run shutdown scripts (issue 459 ). August 12, 2022 (commit b821533): Provided configuration option standalone_media_url: true for sites who have Drupal's \"Standalone media URL\" option enabled (issue 466 ). August 11, 2022 (commit df0a609): Fixed bug where items in secondary task CSV were created even if they didn't have a parent in the primary CSV, or if their parent was not created (issue 458 ). They are now skipped. July 28, 2022 (commit 3d1753a): Added option to prompt user for password (issue 449 ; fixed 'version' in setup.py). July 27, 2022 (commit 029cb6d): Shifted to using Drupal's default media URIs (issue 446 ). July 26, 2022 (commit 8dcf85a): Fixed setup.py on macOS/Homebrew (issue 448 ). July 26, 2022 (commit 09e9f53): Changed license in setup.py to \"MIT\".","title":"main branch (no tag/release)"},{"location":"changelog/#documentation","text":"December 3, 2024: Update docs on \" Applying CSV value templates to rows in your input CSV \"; updated docs on \" Rolling back nodes and media .\" December 1, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings added in issue 855 . Updated docs on \" Field data applied to pages/children \" to include link to \" Applying CSV value templates to paged content .\" November 26, 2024: Resolved issue 853 ; added docs on new paged_content_ignore_files config setting. November 19, 2024: Updated docs on \" Creating media track files \"; updated \" Troubleshooting \" to include the new --print_config argument. November 12, 2024: Updated docs on \" Creating taxonomy terms \" and \" Updating taxonomy terms \" to include use of the published CSV column. November 11, 2024: Updated docs on \" Configuring media types \".newzealandpaul November 10, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings, and added a dedicated section for rollbacks to the \" Configuration \" page. November 1, 2024: Updated docs on \" Adding alt text to images \" and \" Known limitations \". October 20, 2024: Updated docs on \" Creating redirects \". October 14, 2024: Added docs on populating the \" field_domain_access \" CSV column. Thanks for the docs @dara2! September 23, 2024: Updated docs on \" Checking configuration and input data \" to include new config setting check_lock_file_path . September 2, 2024: Added \" Prompting the user \". August 25, 2024: Updated \" Creating paged, compound, and collection content \" to document the new page_files_source_dir_field config setting. August 20, 2024: Added \" Processing or ignoring rows based on field values \". August 12, 2024: Added \" Overriding Workbench's default file extension to MIME type mappings \"; updated \" CSV value templates ; updated \" Taxonomy reference fields \". August 6, 2024: Updated \" Ingesting OCR (and other) files with page images \". July 16, 2024: Added \" Cross-environment deployment / Continuous Integration \". July 14, 2024: Added docs on the new protected_vocabularies config setting to \" Taxonomy reference fields \" and removed some cruft; added docs on \" Encoding of text files \". July 9, 2024: Added \" Using numbers as term names \". July 7, 2024: Added \" Using a local or remote .zip archive as input data \". July 4, 2024: Added \" Ingesting pages, their parents, and their 'grandparents' using a single CSV file \". July 3, 2024: Added \" Sharing configuration files with other applications \". June 30, 2024: Added docs on using values other than node IDs in field_member_of . June 20, 2024: Started docs on \" Creating redirects \". June 7, 2024: Updated docs on \" Exporting Islandora 7 content \". May 31, 2024: Some edits to \" Using subdirectories \" section of the docs on creating paged/compound content. May 21, 2024: Minor edit to \" Updating taxonomy terms \". May 20, 2024: Minor edit to \" Updating nodes ; updated docs to indicate that media_type configuration setting is now required for add_media and update_media tasks. April 22, 2024: Updated \" Hooks \" to document scripts/generate_iiif_manifests.py (from issue 771) and add some clarifications. April 17, 2024: Updated \" Hooks \" to be explicit about what Workbench configuration settings are available within external scripts. April 16, 2024: Updated \" Field data (Drupal and CSV) \" to clarify warning about Entity Reference Views fields; merged Rosie's changes to the docs on using Paragraphs . April 15, 2024: Resolved issue 748 ; updated docs to include new log_file_name_and_line_number config setting. April 14, 2024: Added \" Checking if nodes already exist \". Added documentation to \" Field data (Drupal and CSV) \" on configuring Views to allow using term names in Entity Reference Views fields. April 12, 2024: Added \"metadata maintenance\" section to the \" Workflows \" docs using Rosie Le Faive's excellent demonstration of round tripping metadata. April 8, 2024: Updated \" Field data (Drupal and CSV) \" to add documentation on Entity Reference Revisions fields (paragraphs). April 5, 2024: Updated \" Ignoring CSV rows and columns \" to describe using the new csv_rows_to_process config setting. April 2, 2024: Updated \" Choosing a task \"; updated \" Adding media to nodes \" to describe using DGI's Image Discovery module . February 21, 2024: Updated \" Development guide \". February 20, 2024: Updated \" Updating media \" to indicate that media_type is now a required configuration setting in update_media tasks. January 30, 2024: Added new docs on \" Using a Drupal View to generate a media report as CSV \". Also updated these docs to be clearer on the difference between Contextual Filters and Filter Criteria. January 24, 2024: Added new docs on \" Ingesting OCR (and other) files with page images \" and updated the \" Configuration \" page with the newly introduced config settings. January 17, 2023: Updated the \" Updating media \" docs to mention the update_mode config setting. January 14, 2024: Added promote to the \" Base fields \" docs; updated \" Updating media \". January 2, 2024: Updated the docs on allow_missing_files and perform_soft_checks . December 1, 2023: Updated the \" Updating media \" docs. November 28, 2023: Addressed issue 713 ; merged in @rosiel's https://github.com/mjordan/islandora_workbench_docs/pull/12. November 2, 2023: Updated the \" Troubleshooting \" page to include how to narrow down errors involving SSL certificates, and some additonal minor changes. October 29, 2023: Updated the docs on \" Assigning URL aliases \". September 13, 2023: Updated the docs on \" CSV preprocessor scripts \". September 1, 2023: Updated the \" Development guide \" page. August 21, 2023: Updated the \" Troubleshooting \" page to include how to eliminate Python \"InsecureRequestWarning\"s. August 16, 2023: Merged in @ysuarez's spelling fixes (issue 674 ). August 14, 2023: Update published entry in \" Base fields \" to allow media types to set their default published values. August 4, 2023: Removed published as a standalone configuration setting, and updated its entry in \" Base fields \". August 3, 2023: Documented the config settings query_csv_id_to_node_id_map_for_parents , ignore_duplicate_parent_ids , field_for_media_title , use_nid_in_media_title , use_node_title_for_media_title , use_node_title_for_remote_filename , use_nid_in_remote_filename , and field_for_remote_filename . Updated \" Using the CSV ID to node ID map \". August 2, 2023: Added mention of, and a screenshot showing, the DB Browser for SQLite to \" Using the CSV ID to node ID map \". Thanks for the tip @ajstanley! July 21, 2023: Updated \"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file\" entry in \" Troubleshooting \". July 20, 2023: Updated \" Checking configuration and input data \" to include the new perform_soft_checks config setting. July 19, 2023: Clarified that update tasks require the content_type setting in their config files if the target Drupal content type is not islandora_object . July 18, 2023: Updates to the published entry in the \" Base fields \" documentation; added entry for perform_soft_checks to \" Configuration \" docs (note: this new setting replaces strict_check ). July 10, 2023: Updates to \" Creating paged, compound, and collection content \" to reflect changes in the CSV ID to node ID map, specifically the new ignore_existing_parent_ids config setting. June 30, 2023: Corrected entry in \" Configuration docs \" for the strict_check setting. June 28, 2023: Updated \" Configuration docs \" and \" Using a secondary task \" to include new query_csv_id_to_node_id_map_for_parents configuration setting. Also added a note to the \"id\" reserved column entry in the \" Field data docs \" about importance of using unique ID values. June 26, 2023: Updated the \" Configuration docs \" to include the new HTTP cache settings introduced in issue 608. June 4, 2023: Updated the \" Requirements and installation \" and \" Checking configuration and input data \" docs with instructions on calling Python explicitly on Macs using Homebrew. June 1, 2023: Updated the \" With page/child-level metadata \" section to clarify use of parent_id as per issue 595 . May 30, 2023: Updated the \" Using the CSV ID to node ID map \" section. May 29, 2023: Added the \" Using the CSV ID to node ID map \" section and a few associated updates elsewhere. May 22, 2023: Added \" Updating media \" and a few associated updates elsewhere. May 11, 2023: Updated \" Post-action hooks .\" May 8, 2023: Added \" When Workbench skips invalid CSV data .\" May 7, 2023: Added note to \" Text fields \" that Workbench will truncate CSV values for fields configured in Drupal as \"text\" data type and that have a maximum allowed length. May 5, 2023: Add \" Text fields with markup .\" May 1, 2023: Updated \" Updating nodes .\" April 26, 2023: Updated \" Exporting Islandora 7 content \"; added docs for the new mimetype_extensions config option. April 14, 2023: Updated \" Troubleshooting .\" March 28, 2023: Added \" Choosing a task .\" March 23, 2023: Updated \" Configuration \" and \" Base fields .\" March 23, 2023: Updated \" Assigning URL aliases .\" March 13, 2023: Updated \" Exporting Islandora 7 content .\" March 7, 2023: Updated \" How Workbench cleans your input data \"; updated \" Checking configuration and input data \". March 6, 2023: Added an entry for the require_entity_reference_views config setting to the \" Drupal settings \"; minor corrections and updates to \" Workbench's relationship to Drupal and Islandora \". March 4, 2023: Updated the \" Exporting Islandora 7 content \" page. March 2, 2023: Added mention of drupal_8.5_and_lower tag to \" Requirements and installation . February 28, 2023: Removed references to the iteration-utilities Python library; add new page \" Workbench's relationship to Drupal and Islandora \". February 27, 2023: Replaced \"Islandora 8\" with \"Islandora 2\". February 24, 2023: Added \" How Workbench cleans your input data \". February 22, 2023: Updated \" Preparing your data \"; added \" CSV value templates \". February 20, 2023: Updated \" Rolling back nodes and media \". February 18, 2023: Updated \" Field data (Drupal and CSV) \" to include new csv_headers setting. February 16, 2023: Updated \" Configuration \" to include new log_term_creation setting. February 15, 2023: Updated \" Configuration \" to include new temp_dir setting. February 11, 2023: Updated \" Troubleshooting \" and \" Rolling back nodes and media .\" February 5, 2023: Updated the \" Using subdirectories \" method of creating compound/paged content to explain using the new page_title_template config option. January 31, 2023: Updated \" Generating a contact sheet \"; updated \" Configuring media types \". January 30, 2023: Edits to the \" Using subdirectories \" method of creating compound/paged content to clarify the absence of the \"file\" CSV column. January 29, 2023: Added \" Generating a contact sheet \". January 23, 2023: Added example CSVs for primary and secondary tasks in the \" Case study \" section of the Workflows documentation. January 22, 2023: Several clarifications and corrections, including @rosiel's correction of how to use allow_missing_files and additional_files together; added some examples of planning large compound/paged content ingests . January 16, 2023: Updated \" Exporting Islandora 7 content .\" January 15, 2023: Updated \" Checking configuration and input data .\" January 9, 2023: Added docs for creating \" Media track files .\" January 8, 2023: Updated \" Known limitations \" with a work around for unsupported \"Filter by an entity reference View\" fields; added examples of valid Windows paths to \" Values in the 'file' column .\" December 29, 2022: Minor corrections to \" Known limitations \", \" Workflows \", \" Creating paged, compound, and collection content ,\" and \" Preparing your data .\" December 28, 2022: Added cross reference between \" CSV field templates \" and \" Ignoring CSV rows and columns \". December 17, 2022: Corrected URI for http://pcdm.org/use#OriginalFile on \" Generating CSV files \" and \" Configuration .\" December 11, 2022: Updated \" Generating CSV files \" and \" Output CSV settings \" Configuration docs to include new ability to export files along with CSV data. Note: the data_from_view_file_path setting in \"get_data_from_view\" tasks has been replaced with export_csv_file_path . November 28, 2022: Added \" Quick delete \" docs; added clarification to \" Configuring Drupal's media URLs \" that standalone_media_url: true must be in all config files for tasks that interact with media; added note to \" Adding media to nodes \" and \" Values in the 'file' column \" clarifying that it is not possible to override the filesystem a media's file field is configured to use. November 26, 2022: Changed documentation theme from readthedocs to material; some edits for clarity to the docs for \"file\" field values ; some edits for clarity to the docs for \" adaptive pause .\" November 24, 2022: Added note to \" Adding media to nodes \" and \" Adding multiple media \" about extracted text media; added a note about using absolute file paths in scheduled jobs to the \" Workflows \" and \" Troubleshooting \"; removed the \"required\" \u2714\ufe0f from the password configuration setting entry in the table in \" Configuration \". November 17, 2022: Added new config options path_to_python and path_to_workbench_script to \" Configuration \" docs. October 28, 2022: Updated \" Configuration \" docs to provide details on YAML (configuration file) syntax. September 19, 2022: Updated references to exit_on_first_missing_file_during_check to use strict_check . Configuration settings entry advises exit_on_first_missing_file_during_check will be removed Nov. 1, 2022. September 18, 2022: Added entry \" Ignoring CSV rows and columns .\" September 15, 2022: Added entry to \" Limitations \" page about lack of support for HTML markup. Also added a section on \"Password management\" to \" Requirements and installation \". September 8, 2022: Added documentation on \" Reducing Workbench's impact on Drupal .\" August 30, 2022: Updated \" Hooks \" docs to clarify that the HTTP response code passed to post-entity-create scripts is a string, not an integer. August 18, 2022: Updated standalone_media_url entry in the \" Configuration \" docs, and added brief entry to the \" Troubleshooting \" page about clearing Drupal's cache. August 13, 2022: Updated \" Configuration \" and \" Hooks \" page to describe shutdown scripts. August 11, 2022: Added text to \" Creating paged, compound, and collection content \" page to clarify what happens when a row in the secondary CSV does not have a matching row in the primary CSV. August 8, 2022: Added entry to \" Limitations \" page about support for \"Filter by an entity reference View\" fields. August 3, 2022: Added entry to \" Troubleshooting \" page about missing Microsoft Visual C++ error when installing Workbench on Windows. August 3, 2022: Updated the \" Limitations \" page with entry about Paragraphs. August 2, 2022: Added note about ownership requirements on files to \" Deleting nodes \"; was previously only on \"Deleting media\". July 28, 2022: Updated password entry in the \" Configuration \" docs to mention the new password prompt feature.","title":"Documentation"},{"location":"check/","text":"Overview You should always check your configuration and input prior to creating, updating, or deleting content. You can do this by running Workbench with the --check option, e.g.: ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. Similarly, if you are on a Mac that has the Homebrew version of Python, you may need to run Workbench by providing the full path to the Homebrew Python interpreter, e.g., /opt/homebrew/bin/python3 workbench --config config.yml --check . If you do this, Workbench will check the following conditions and report any errors that require your attention before proceeding: Configuration file Whether your configuration file is valid YAML (i.e., no YAML syntax errors). Whether your configuration file contains all required values. Connection to Drupal Whether your Drupal has the required Workbench Integration module enabled, and that the module is up to date. Whether the host you provided will accept the username and password you provided. Input directory Whether the directory named in the input_dir configuration setting exists. Rollback files Whether the rollback config file and CSV file can be written. CSV file Whether the CSV file is encoded in either ASCII or UTF-8. Whether each row contains the same number of columns as there are column headers. Whether there are any duplicate column headers. Whether your CSV file contains required columns headers, including the field defined as the unique ID for each record (defaults to \"id\" if the id_field key is not in your config file) Whether your CSV column headers correspond to existing Drupal field labels or machine names. Whether all Drupal fields that are configured to be required are present in the CSV file. Whether required fields in your CSV contain values (i.e., they are not blank). Whether the columns required to create paged content are present (see \"Creating paged content\" below). If creating compound/paged content using the \"With page/child-level metadata\" method, --check will tell you if any child item rows in your CSV precede their parent rows. If your config file includes csv_headers: labels , --check will tell you if it detects any duplicate field labels. Media files Whether the files named in the CSV file are present, or in the case of remote files, are accessible (but this check is skipped if allow_missing_files: true is present in your config file for \"create\" tasks). If nodes_only is true, this check is skipped. Whether files in the file CSV column have extensions that are registered with the media's file field in Drupal. Note that validation of file extensions does not yet apply to files named using the additional_files configuration or for remote files (see this issue for more info). Whether the media types configured for specific file extensions are configured on the target Drupal. Islandora Workbench will default to the 'file' media type if it can't find another more specific media type for a file, so the most likely cause for this check to fail is that the assigned media type does not exist on the target Drupal. If creating media track files , --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) does not include the \"Service File\" taxonomy term. Field values Base fields If the langcode field is present in your CSV, whether values in it are valid Drupal language codes. Whether your CSV file contains a title field ( create task only) Whether values in the title field exceed Drupal's maximum length for titles of 255 characters, or whatever the value of the max_node_title_length configuration setting is. If the created field is present in your CSV file, whether the values in it are formatted correctly (like \"2020-11-15T23:49:22+00:00\") and whether the date is in the past (both of which are Drupal requirements). If the uid field is present in your CSV file, whether the user IDs in that field exist in the target Drupal. Note that this check does not inspect permissions or roles, only that the user ID exists. Whether aliases in the url_alias field in your CSV already exist, and whether they start with a leading slash ( / ). Taxonomy Whether term ID and term URIs used in CSV fields correspond to existing terms. Whether the length of new terms exceeds 255 characters, which is the maximum length for a term name. Whether the term ID (or term URI) provided for media_use_tid is a member of the \"Islandora Media Use\" vocabulary. Whether term names in your CSV require a vocabulary namespace. Typed Relation fields Whether values used in typed relation fields are in the required format Whether values need to be namespaced Whether the term IDs/term names/term URIs used in the values exist in the vocabularies configured for the field. \"List\" text fields Whether values in CSV fields of this Drupal field type are in the field's configured \"Allowed values list\". If using the pages from directories configuration ( paged_content_from_directories: true ): Whether page filenames contain an occurrence of the sequence separator. Whether any page directories are empty. Whether the content type identified in the content_type configuration option exists. Whether multivalued fields exceed their allowed number of values. Whether values in text-type fields exceed their configured maximum length. Whether the nodes referenced in field_member_of (if that field is present in the CSV) exist. Whether values used in geolocation fields are valid lat,long coordinates. Whether values used in EDTF fields are valid EDTF date/time values (subset of date/time values only; see documentation for more detail). Also validates whether dates are valid Gregorian calendar dates. Hook scripts Whether registered bootstrap, preprocessor, and post-action scripts exist and are executable. If Workbench detects a configuration or input data violation, it will either stop and tell you why it stopped, or (if the violation will not cause Workbench's interaction with Drupal to fail), tell you that it found an anomaly and to check the log file for more detail. Note Adding perform_soft_checks: true to you configuration file will tell --check to not stop when it encounters an error with 1) parent/child order in your input CSV, 2) file extensions in your CSV that are not registered with the Drupal configuration of their media type's file fields, or 3) invalid EDTF dates. Workbench will continue to log all of these errors, but will exit after it has checked every row in your CSV file. A successful outcome of running --check confirms that all of the conditions listed above are in place, but it does not guarantee a successful job. There are a lot of factors in play during ingest/update/delete interactions with Drupal that can't be checked in advance, most notably network stability, load on the Drupal server, or failure of an Islandora microservice. But in general --check will tell you if there's a problem that you can investigate and resolve before proceeding with your task. Typical (and recommended) Islandora Workbench usage You will probably need to run Workbench using --check a few times before you will be ready to run it without --check and commit your data to Islandora. For example, you may need to correct errors in taxonomy term IDs or names, fix errors in media filenames, or wrap values in your CSV files in quotation marks. It's also a good idea to check the Workbench log file after running --check . All warnings and errors are printed to the console, but the log file may contain additional information or detail that will help you resolve issues. Once you have used --check to detect all of the problems with your CSV data, committing it to Islandora will work very reliably. Also, it is good practice to check your log after each time you run Islandora Workbench, since it may contain information that is not printed to the console.` Prompting the user to run --check As decribed elsewhere , you can configure Workbench to prompt the user to remind them to run --check . To do so, include remind_user_to_run_check: true in your config file. If this setting is present, the user will be prompted \"Have you run --check? (y/n)\". Responding \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. Note that this setting does not force the user to run --check , it merely ask them if they have run it. Requiring --check You can require a successful --check to have to have been run by including check_lock_file_path with the name (or path) of a file as its value, for example check_lock_file_path: checklock.txt . If this setting is present in your config file, and it has a file name or path as its value, when Workbench is run without --check using the same configuration file, it will look for the \"lock\" file. It compares the data in this lock file with an expected value, and if they are the same, Workbench executes normally. If they differ, Workbench logs this difference and exits. If --check has detected any errors, Workbench will not execute. This can be useful if you want to force the user to run check, or if you are running Workbench in a scheduled or scripted environment and you want to only execute Workbench if --check has been successful.","title":"Checking configuration and input data"},{"location":"check/#overview","text":"You should always check your configuration and input prior to creating, updating, or deleting content. You can do this by running Workbench with the --check option, e.g.: ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. Similarly, if you are on a Mac that has the Homebrew version of Python, you may need to run Workbench by providing the full path to the Homebrew Python interpreter, e.g., /opt/homebrew/bin/python3 workbench --config config.yml --check . If you do this, Workbench will check the following conditions and report any errors that require your attention before proceeding: Configuration file Whether your configuration file is valid YAML (i.e., no YAML syntax errors). Whether your configuration file contains all required values. Connection to Drupal Whether your Drupal has the required Workbench Integration module enabled, and that the module is up to date. Whether the host you provided will accept the username and password you provided. Input directory Whether the directory named in the input_dir configuration setting exists. Rollback files Whether the rollback config file and CSV file can be written. CSV file Whether the CSV file is encoded in either ASCII or UTF-8. Whether each row contains the same number of columns as there are column headers. Whether there are any duplicate column headers. Whether your CSV file contains required columns headers, including the field defined as the unique ID for each record (defaults to \"id\" if the id_field key is not in your config file) Whether your CSV column headers correspond to existing Drupal field labels or machine names. Whether all Drupal fields that are configured to be required are present in the CSV file. Whether required fields in your CSV contain values (i.e., they are not blank). Whether the columns required to create paged content are present (see \"Creating paged content\" below). If creating compound/paged content using the \"With page/child-level metadata\" method, --check will tell you if any child item rows in your CSV precede their parent rows. If your config file includes csv_headers: labels , --check will tell you if it detects any duplicate field labels. Media files Whether the files named in the CSV file are present, or in the case of remote files, are accessible (but this check is skipped if allow_missing_files: true is present in your config file for \"create\" tasks). If nodes_only is true, this check is skipped. Whether files in the file CSV column have extensions that are registered with the media's file field in Drupal. Note that validation of file extensions does not yet apply to files named using the additional_files configuration or for remote files (see this issue for more info). Whether the media types configured for specific file extensions are configured on the target Drupal. Islandora Workbench will default to the 'file' media type if it can't find another more specific media type for a file, so the most likely cause for this check to fail is that the assigned media type does not exist on the target Drupal. If creating media track files , --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) does not include the \"Service File\" taxonomy term. Field values Base fields If the langcode field is present in your CSV, whether values in it are valid Drupal language codes. Whether your CSV file contains a title field ( create task only) Whether values in the title field exceed Drupal's maximum length for titles of 255 characters, or whatever the value of the max_node_title_length configuration setting is. If the created field is present in your CSV file, whether the values in it are formatted correctly (like \"2020-11-15T23:49:22+00:00\") and whether the date is in the past (both of which are Drupal requirements). If the uid field is present in your CSV file, whether the user IDs in that field exist in the target Drupal. Note that this check does not inspect permissions or roles, only that the user ID exists. Whether aliases in the url_alias field in your CSV already exist, and whether they start with a leading slash ( / ). Taxonomy Whether term ID and term URIs used in CSV fields correspond to existing terms. Whether the length of new terms exceeds 255 characters, which is the maximum length for a term name. Whether the term ID (or term URI) provided for media_use_tid is a member of the \"Islandora Media Use\" vocabulary. Whether term names in your CSV require a vocabulary namespace. Typed Relation fields Whether values used in typed relation fields are in the required format Whether values need to be namespaced Whether the term IDs/term names/term URIs used in the values exist in the vocabularies configured for the field. \"List\" text fields Whether values in CSV fields of this Drupal field type are in the field's configured \"Allowed values list\". If using the pages from directories configuration ( paged_content_from_directories: true ): Whether page filenames contain an occurrence of the sequence separator. Whether any page directories are empty. Whether the content type identified in the content_type configuration option exists. Whether multivalued fields exceed their allowed number of values. Whether values in text-type fields exceed their configured maximum length. Whether the nodes referenced in field_member_of (if that field is present in the CSV) exist. Whether values used in geolocation fields are valid lat,long coordinates. Whether values used in EDTF fields are valid EDTF date/time values (subset of date/time values only; see documentation for more detail). Also validates whether dates are valid Gregorian calendar dates. Hook scripts Whether registered bootstrap, preprocessor, and post-action scripts exist and are executable. If Workbench detects a configuration or input data violation, it will either stop and tell you why it stopped, or (if the violation will not cause Workbench's interaction with Drupal to fail), tell you that it found an anomaly and to check the log file for more detail. Note Adding perform_soft_checks: true to you configuration file will tell --check to not stop when it encounters an error with 1) parent/child order in your input CSV, 2) file extensions in your CSV that are not registered with the Drupal configuration of their media type's file fields, or 3) invalid EDTF dates. Workbench will continue to log all of these errors, but will exit after it has checked every row in your CSV file. A successful outcome of running --check confirms that all of the conditions listed above are in place, but it does not guarantee a successful job. There are a lot of factors in play during ingest/update/delete interactions with Drupal that can't be checked in advance, most notably network stability, load on the Drupal server, or failure of an Islandora microservice. But in general --check will tell you if there's a problem that you can investigate and resolve before proceeding with your task.","title":"Overview"},{"location":"check/#typical-and-recommended-islandora-workbench-usage","text":"You will probably need to run Workbench using --check a few times before you will be ready to run it without --check and commit your data to Islandora. For example, you may need to correct errors in taxonomy term IDs or names, fix errors in media filenames, or wrap values in your CSV files in quotation marks. It's also a good idea to check the Workbench log file after running --check . All warnings and errors are printed to the console, but the log file may contain additional information or detail that will help you resolve issues. Once you have used --check to detect all of the problems with your CSV data, committing it to Islandora will work very reliably. Also, it is good practice to check your log after each time you run Islandora Workbench, since it may contain information that is not printed to the console.`","title":"Typical (and recommended) Islandora Workbench usage"},{"location":"check/#prompting-the-user-to-run-check","text":"As decribed elsewhere , you can configure Workbench to prompt the user to remind them to run --check . To do so, include remind_user_to_run_check: true in your config file. If this setting is present, the user will be prompted \"Have you run --check? (y/n)\". Responding \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. Note that this setting does not force the user to run --check , it merely ask them if they have run it.","title":"Prompting the user to run --check"},{"location":"check/#requiring-check","text":"You can require a successful --check to have to have been run by including check_lock_file_path with the name (or path) of a file as its value, for example check_lock_file_path: checklock.txt . If this setting is present in your config file, and it has a file name or path as its value, when Workbench is run without --check using the same configuration file, it will look for the \"lock\" file. It compares the data in this lock file with an expected value, and if they are the same, Workbench executes normally. If they differ, Workbench logs this difference and exits. If --check has detected any errors, Workbench will not execute. This can be useful if you want to force the user to run check, or if you are running Workbench in a scheduled or scripted environment and you want to only execute Workbench if --check has been successful.","title":"Requiring --check"},{"location":"checking_if_nodes_exist/","text":"In create tasks, you can configure Workbench to query Drupal to determine if a node exists, using the values in a specified field (referred to as the \"lookup field\" below) in your input CSV, such as field_local_identifier . If Workbench finds a node with a matching value in that field, it will skip to the next CSV row and not create the duplicate node. This feature is useful if you create a subset of items as a test or quality assurance step before loading all items in your CSV, possibly using the csv_rows_to_process configuration setting. Another situation this might be useful in is if some of the nodes represented in your CSV failed to be created, you can fix the problems with those rows and simply rerun the entire batch without having to worry about removing the successful rows. Warning For this to work, the values in the lookup field need to be unique to each node. In other words, two more more nodes should not have the same value in this field. Another assumption Workbench makes is that the values do not contain any spaces. They can however contain underscores, hyphens, and colons. Creating the required View To use this feature, you first need to create a View that Workbench will query. This might seem like a lot of setup, but you can probably use the same View (and corresponding Workbench configuration, as illustrated below) over time for many create tasks. Create a new \"Content\" View that has a REST Export display (no other display types are needed). In the Format configuration, choose \"Serializer\" and under Settings, check the \"Force using fields\" box and the \"json\" format. In the Fields configuration, choose \"Content: ID\" (i.e., the node ID). Go to the Other configuration (on the right side of the View configuration) and in the Use Aggregation configuration, choose \"Yes\". If asked for \"Aggregation type\", select \"Group results together\". In the Filter criteria configuration, add the field in your content type that you want to use as the lookup field, e.g. Local Identifier. Check \"Expose this filter\". Choose \"Single filter\". In the Operator configuration, select \"Contains any word\" and in the Value field, enter the machine name of your field (e.g. field_local_identifier ) In the Filter identifier section, enter the name of the URL parameter requests to this View will use to identify the CSV values. You should use the same string that you used in Operator configuration (which is also the same as your field's machine name), e.g. field_local_identifier . In the Path settings, provide a path, e.g. local_identifier_lookup (do not add the leading / ) Assign this path Basic authentication. Access should be by Permission > View published content. In the Pager settings, choose \"Display all items\". Save your View. Here is a screenshot of an example View's configuration: If you have curl installed on your system, you can test you new REST export View: curl -v -uadmin:password -H \"Content-Type: application/json\" \"https://islandora.traefik.me/local_identifier_lookup?field_local_identifier=sfu_test_001\" (In this example, the REST export's \"Path\" is local_identifier_lookup . Immediately after the ? comes the filter identifier string configured above, with a value after the = from a node's Local Identifier field.) If testing with curl, change the example above to incorporate your local configuration and add a known value from the lookup field. If your test is successful, the query will return a single node ID in a structure like [{\"nid\":\"126\"}] . If the query can't find any nodes, the returned data will look like [] . If it finds more than one node, the structure will looks like [{\"nid\":\"125\"},{\"nid\":\"247\"}] . If you don't have curl, don't worry. --check will confirm that the configured View REST export URL is accessible and that the configured CSV column is in your input CSV. Configuring Workbench With the View configured and tested, you can now use this feature in Workbench by adding the following configuration block: node_exists_verification_view_endpoint: - field_local_identifier: /local_identifier_lookup In this sample configuration we tell Workbench to query a View at the path /local_identifier_lookup using the filter identifer/field name field_local_identifier . Note that you can only have a single CSV field to View path configuration here. If you include more than one, only the last one is used. Nothing special is required in the input CSV; the Workbench configuration block above is all you need to do. However, note that: the field you choose as your lookup field should be a regular metadata field (e.g. \"Local identifier\"), and not the id field required by Workbench. However, there is nothing preventing you from configuring Workbench (through the id_field setting) to use as its ID column the same field you have configured as your lookup field. the data in the CSV field can be multivalued. Workbench will represent the multiple values in a way that works with the \"Contains any word\" option in your View's Filter configuration. as noted above, for this feature to work, the CSV values in the lookup field cannot be used by more than one node and they cannot contain spaces. Your CSV can look like this: file,id,title,field_model,field_local_identifier IMG_1410.tif,01,Small boats in Havana Harbour,Image,sfu_id_001 IMG_2549.jp2,02,Manhatten Island,Image,sfu_id_002 IMG_2940.JPG,03,Looking across Burrard Inlet,Image,sfu_id_003|special_collections:9362 IMG_2958.JPG,04,Amsterdam waterfront,Image,sfu_id_004 IMG_5083.JPG,05,Alcatraz Island,Image,sfu_id_005 Before it creates a node, Workbench will use data from the CSV column specified on the left-hand side of the node_exists_verification_view_endpoint configuration to query the corresponding View endpoint. If it finds no match, it will create the node as usual; if it finds a single match, it will skip that CSV row and log that it has done so; if it finds more than one match, it will also skip the CSV row and not create a node, and it will log that it has done this.","title":"Checking if nodes already exist"},{"location":"checking_if_nodes_exist/#creating-the-required-view","text":"To use this feature, you first need to create a View that Workbench will query. This might seem like a lot of setup, but you can probably use the same View (and corresponding Workbench configuration, as illustrated below) over time for many create tasks. Create a new \"Content\" View that has a REST Export display (no other display types are needed). In the Format configuration, choose \"Serializer\" and under Settings, check the \"Force using fields\" box and the \"json\" format. In the Fields configuration, choose \"Content: ID\" (i.e., the node ID). Go to the Other configuration (on the right side of the View configuration) and in the Use Aggregation configuration, choose \"Yes\". If asked for \"Aggregation type\", select \"Group results together\". In the Filter criteria configuration, add the field in your content type that you want to use as the lookup field, e.g. Local Identifier. Check \"Expose this filter\". Choose \"Single filter\". In the Operator configuration, select \"Contains any word\" and in the Value field, enter the machine name of your field (e.g. field_local_identifier ) In the Filter identifier section, enter the name of the URL parameter requests to this View will use to identify the CSV values. You should use the same string that you used in Operator configuration (which is also the same as your field's machine name), e.g. field_local_identifier . In the Path settings, provide a path, e.g. local_identifier_lookup (do not add the leading / ) Assign this path Basic authentication. Access should be by Permission > View published content. In the Pager settings, choose \"Display all items\". Save your View. Here is a screenshot of an example View's configuration: If you have curl installed on your system, you can test you new REST export View: curl -v -uadmin:password -H \"Content-Type: application/json\" \"https://islandora.traefik.me/local_identifier_lookup?field_local_identifier=sfu_test_001\" (In this example, the REST export's \"Path\" is local_identifier_lookup . Immediately after the ? comes the filter identifier string configured above, with a value after the = from a node's Local Identifier field.) If testing with curl, change the example above to incorporate your local configuration and add a known value from the lookup field. If your test is successful, the query will return a single node ID in a structure like [{\"nid\":\"126\"}] . If the query can't find any nodes, the returned data will look like [] . If it finds more than one node, the structure will looks like [{\"nid\":\"125\"},{\"nid\":\"247\"}] . If you don't have curl, don't worry. --check will confirm that the configured View REST export URL is accessible and that the configured CSV column is in your input CSV.","title":"Creating the required View"},{"location":"checking_if_nodes_exist/#configuring-workbench","text":"With the View configured and tested, you can now use this feature in Workbench by adding the following configuration block: node_exists_verification_view_endpoint: - field_local_identifier: /local_identifier_lookup In this sample configuration we tell Workbench to query a View at the path /local_identifier_lookup using the filter identifer/field name field_local_identifier . Note that you can only have a single CSV field to View path configuration here. If you include more than one, only the last one is used. Nothing special is required in the input CSV; the Workbench configuration block above is all you need to do. However, note that: the field you choose as your lookup field should be a regular metadata field (e.g. \"Local identifier\"), and not the id field required by Workbench. However, there is nothing preventing you from configuring Workbench (through the id_field setting) to use as its ID column the same field you have configured as your lookup field. the data in the CSV field can be multivalued. Workbench will represent the multiple values in a way that works with the \"Contains any word\" option in your View's Filter configuration. as noted above, for this feature to work, the CSV values in the lookup field cannot be used by more than one node and they cannot contain spaces. Your CSV can look like this: file,id,title,field_model,field_local_identifier IMG_1410.tif,01,Small boats in Havana Harbour,Image,sfu_id_001 IMG_2549.jp2,02,Manhatten Island,Image,sfu_id_002 IMG_2940.JPG,03,Looking across Burrard Inlet,Image,sfu_id_003|special_collections:9362 IMG_2958.JPG,04,Amsterdam waterfront,Image,sfu_id_004 IMG_5083.JPG,05,Alcatraz Island,Image,sfu_id_005 Before it creates a node, Workbench will use data from the CSV column specified on the left-hand side of the node_exists_verification_view_endpoint configuration to query the corresponding View endpoint. If it finds no match, it will create the node as usual; if it finds a single match, it will skip that CSV row and log that it has done so; if it finds more than one match, it will also skip the CSV row and not create a node, and it will log that it has done this.","title":"Configuring Workbench"},{"location":"choosing_a_task/","text":"The task configuration setting defines the specific work you want Workbench to perform. This table may help you choose when to use a specific task: If you want to Then use this task Start with this documentation Create nodes from CSV and, optionally, attached media create Preparing your data , Field data (Drupal and CSV) Create basic nodes without using CSV, and attach media create_from_files Creating nodes from files Update node field data update Updating nodes Delete nodes and, optionally, their attached media delete Deleting nodes Add media to existing nodes using a list of node IDs add_media Adding media to nodes Update media field data update_media Updating media Replace files, including media track files, attached to media update_media Updating media Delete media using a list of media IDs delete_media Deleting media Delete media using a list of node IDs delete_media_by_node Deleting media Export node field data using a list of node IDs export_csv Generating CSV files Export node field data using a Drupal View get_data_from_view Generating CSV files Export a media report using a Drupal View get_media_report_from_view Generating CSV files Populate a vocabulary from CSV create_terms Creating taxonomy terms Update terms in a vocabulary from CSV update_terms Updating taxonomy terms Create URL redirects create_redirects Creating redirects","title":"Choosing a task"},{"location":"configuration/","text":"The configuration file Workbench uses a YAML configuration whose location is indicated in the --config argument. This file defines the various options it will use to create, update, or delete Islandora content. The simplest configuration file needs only the following four options: task: create host: \"http://localhost:8000\" username: admin password: islandora In this example, the task being performed is creating nodes (and optionally media). Other tasks are create_from_files , update , delete , add_media , update_media , and delete_media . Some of the configuration settings documented below are used in all tasks, while others are only used in specific tasks. Configuration settings The settings defined in a configuration file are documented in the tables below, grouped into broad functional categories for easier reference. The order of the options in the configuration file doesn't matter, and settings do not need to be grouped together in any specific way in the configuration file. Use of quotation marks Generally speaking, you do not need to use quotation marks around values in your configuration file. You may wrap values in quotation marks if you wish, and many examples in this documentation do that (especially the host setting), but the only values that should not be wrapped in quotation marks are those that take true or false as values because in YAML, and many other computer/markup languages, \"true\" is a string (in this case, an English word that can mean many things) and true is a reserved symbol that can mean one thing and one thing only, the boolean opposite of false (I'm sorry for this explanation, I can't describe the distinction in any other way without writing a primer on symbolic logic). For example, the following is a valid configuration file: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: true csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 But the following version is not valid, since there are quotes around \"true\" in the nodes_only setting: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: \"true\" csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 Use of spaces and other syntactical features Configuration setting names should start a new line and not have any leading spaces. The exception is illustrated in the values of the csv_field_templates setting in the above examples, where the setting's value is a list of other values. In this case the members of the list start with a dash and a space ( - ). The trailing space in these values is significant. (However, the leading space before the dash is insignificant, and is used for appearance only.) For example, this snippet is valid: csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 whereas this one is not: csv_field_templates: -field_linked_agent: relators:aut:person:Jordan, Mark -field_model: 25 Some setting values are represented in Workbench documentation using square brackets, like this one: export_csv_field_list: ['field_description', 'field_extent'] Strictly speaking, YAML lists can be represented as either a series of entries on their own lines that start with - or as entries enclosed in [ and ] . It's best to follow the examples provided throughout the Workbench documentation. Required configuration settings Setting Required Default value Description task \u2714\ufe0f One of 'create', 'create_from_files', 'update', delete', 'add_media', 'delete_media', 'update_media', 'export_csv', 'get_data_from_view', 'create_terms', or 'delete_media_by_node'. See \" Choosing a task \" for more information. host \u2714\ufe0f The hostname, including http:// or https:// of your Islandora repository, and port number if not the default 80. username \u2714\ufe0f The username used to authenticate the requests. This Drupal user should be a member of the \"Administrator\" role. If you want to create nodes that are owned by a specific Drupal user, include their numeric user ID in the uid column in your CSV. password The user's password. You can also set the password in your ISLANDORA_WORKBENCH_PASSWORD environment variable. If you do this, omit the password option in your configuration file. If a password is not available in either your configuration file or in the environment variable, Workbench will prompt for a password. Drupal settings Setting Required Default value Description content_type islandora_object The machine name of the Drupal node content type you are creating or updating. Required in \"create\" and \"update\" tasks. drupal_filesystem fedora:// One of 'fedora://', 'public://', or 'private://' (the wrapping quotation marks are required). Only used with Drupal 8.x - 9.1; starting with Drupal 9.2, the filesystem is automatically detected from the media's configuration. Will eventually be deprecated. allow_adding_terms false In create , update , add_media , update_media , create_terms , and update_terms tasks, determines if Workbench will add taxonomy terms if they do not exist in the target vocabulary. See more information in the \" Taxonomy reference fields \" section. Note: this setting is not required in create_terms tasks unless you are adding new terms to a taxonomy reference field on the term entries. protected_vocabularies [] (empty list) Allows you to exclude vocabularies from having new terms added via allow_adding_terms . See more information in the \" Taxonomy reference fields \" section. vocab_id \u2714\ufe0f in create_terms tasks. Identifies the vocabulary you are adding terms to in create_tersm tasks. See more information in the \" Creating taxonomy terms \" section. update_mode replace Determines if Workbench will replace , append (add to) , or delete field values during update tasks. See more information in the \" Updating nodes \" section. Also applies to update_media tasks. validate_terms_exist true If set to false, during --check Workbench will not query Drupal to determine if taxonomy terms exist. The structure of term values in CSV are still validated; this option only tells Workbench to not check for each term's existence in the target Drupal. Useful to speed up the --check process if you know terms don't exist in the target Drupal. validate_parent_node_exists true If set to false, during --check Workbench will not query Drupal to determine if nodes whose node IDs are in field_member_of exist. Useful to speed up the --check process if you know terms already exist in the target Drupal. max_node_title_length 255 Set to the number of allowed characters for node titles if your Drupal uses Node Title Length . If unsure what your the maximum length of the node titles your site allows, check the length of the \"title\" column in your Drupal database's \"node_field_data\" table. list_missing_drupal_fields false Set to true to tell Workbench to provide a list of fields that exist in your input CSV but that cannot be matched to Drupal field names (or reserved column names such as \"file\"). If false , Workbench will still check for CSV column headers that it can't match to Drupal fields, but will exit upon finding the first such field. This option produces a list of fields instead of exiting on detecting the first field. standalone_media_url false Set to true if your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings checked. The Drupal default is to have this unchecked, so you only need to use this Workbench option if you have changed Drupal's default. More information is available. require_entity_reference_views true Set to false to tell Workbench to not require a View to expose the values in an entity reference field configured to use an Entity Reference View. Additional information is available here . entity_reference_view_endpoints A list of mappings from Drupal/CSV field names to Views REST Export endpoints used to look up term names for entity reference field configured to use an Entity Reference View. Additional information is available here . text_format_id basic_html The text format ID (machine name) to apply to all Drupal text fields that have a \"formatted\" field type. See \" Text fields with markup \" for more information. field_text_format_ids Defines a mapping between field machine names the machine names of format IDs for \"formatted\" fields. See \" Text fields with markup \" for more information. paragraph_fields Defines structure of paragraph fields in the input CSV. See \" Entity Reference Revisions fields (paragraphs) \" for more information. Input data location settings Setting Required Default value Description input_dir input_data The full or relative path to the directory containing the files and metadata CSV file. input_csv metadata.csv Path to the CSV metadata file. Can be absolute, or if just the filename is provided, will be assumed to be in the directory named in input_dir . Can also be the URL to a Google spreadsheet (see the \" Using Google Sheets as input data \" section for more information). google_sheets_csv_filename google_sheet.csv Local CSV filename for data from a Google spreadsheet. See the \" Using Google Sheets as input data \" section for more information. google_sheets_gid 0 The \"gid\" of the worksheet to use in a Google Sheet. See \" Using Google Sheets as input data \" section for more information. excel_worksheet Sheet1 If using an Excel file as your input CSV file, the name of the worksheet that the CSV data will be extracted from. input_data_zip_archives [] List of local file paths and/or URLs to .zip archives to extract into the directory defined in input_dir . See \" Using a local or remote .zip archive as input data \" for more info. delete_zip_archive_after_extraction true Tells Workbench to delete an inpu zip archive after it has been extracted. Input CSV file settings Setting Required Default value Description id_field id The name of the field in the CSV that uniquely identifies each record. delimiter , [comma] The delimiter used in the CSV file, for example, \",\" or \"\\t\" (must use double quotes with \"\\t\"). If omitted, defaults to \",\". subdelimiter | [pipe] The subdelimiter used in the CSV file to define multiple values in one field. If omitted, defaults to \"|\". Can be a string of multiple characters, e.g. \"^^^\". csv_field_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding values that are copied into the CSV input file. More detail provided in the \" CSV field templates \" section. csv_value_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding templates. More detail provided in the \" CSV value templates \" section. csv_value_templates_for_paged_content Used in create tasks only. Similar to csv_value_templates but applies to paged/child items created using the \" Using subdirectories \" method of creating paged content. More detail provided in the \" CSV value templates \" section. csv_value_templates_rand_length 5 Length of the $random_alphanumeric_string and $random_number_string variables CSV value templates. More detail provided in the \" CSV value templates \" section. allow_csv_value_templates_if_field_empty [] List of fields to populate with CSV value templates if the CSV field is empty. More detail provided in the \" CSV value templates \" section. ignore_csv_columns Used in the create and update tasks only. A list of CSV column headers that Workbench should ignore. For example, ignore_csv_columns: [Target Collection, Ready to publish] csv_start_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) before the designated row number. More information is available. csv_stop_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) after the designated row number. More information is available. csv_rows_to_process Used in create and update tasks. Tells Workbench to process only the rows/records in input CSV (or Google Sheet or Excel) with the specified \"id\" column values. More information is available. csv_headers names Used in \"create\", \"update\" and \"create_terms\" tasks. Set to \"labels\" to allow use of field labels (as opposed to machine names) as CSV column headers. clean_csv_values_skip [] (empty list) Used in all tasks that use CSV input files. See \" How Workbench cleans your input data \" for more information. columns_with_term_names [] (empty list) Used in all tasks that allow creation of terms on entity ingest. See \" Using numbers as term names \" for more information. Output CSV settings See \" Generating CSV files \" section for more information. Setting Required Default value Description output_csv Used in \"create\" tasks. The full or relative (to the \"workbench\" script) path to a CSV file with one record per node created by Workbench. output_csv_include_input_csv false Used in \"create\" tasks in conjunction with output_csv . Include in the output CSV all the fields (and their values) from the input CSV. export_csv_term_mode tid Used in \"export_csv\" tasks to indicate whether vocabulary term IDs or names are included in the output CSV file. Set to \"tid\" (the default) to include term IDs, or set to \"name\" to include term names. See \" Exporting field data into a CSV file \" for more information. export_csv_field_list [] (empty list) List of fields to include in exported CSV data. If empty, all fields will be included. See \" Using a Drupal View to identify content to export as CSV \" for more information. view_parameters List of URL parameter/value strings to include in requests to a View. See \" Using a Drupal View to identify content to export as CSV \" for more information. export_csv_file_path \u2714\ufe0f in get_data_from_view tasks. Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the exported CSV file. Required in the \"get_data_from_view\" task; in the \"export_csv\" task, if left empty (the default), the file will be named after the value of the input_csv with \".csv_file_with_field_values\" appended and saved in the directory identified in input_dir . export_file_directory Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the directory where files corresponding to the data in the CSV output file will be written. export_file_media_use_term_id http://pcdm.org/use#OriginalFile Used in the \"export_csv\" and \"get_data_from_view\" tasks. The term ID or URI from the Islandora Media Use vocabulary that identifies the file you want to export. Media settings Setting Required Default value Description nodes_only false Include this option in create tasks, set to true , if you want to only create nodes and not their accompanying media. See the \"Creating nodes but not media\" section for more information. allow_missing_files false Determines if file values that point to missing (not found) files are allowed. Used in the create and add_media tasks. If set to true, file values that point to missing files are allowed. For create tasks, a true value will result in nodes without attached media. For add_media tasks, a true value will skip adding a media for the missing file CSV value. Defaults to false (which means all file values must name files that exist at their specified locations). Note that this setting has no effect on empty file values; these are always logged, and their corresponding media are simply not created. exit_on_first_missing_file_during_check true Removed as a configuration setting November 1, 2022. Use strict_check instead. strict_check Replaced with perform_soft_checks as of commit dfa60ff (July 14, 2023). media_use_tid http://pcdm.org/use#OriginalFile The term ID for the term from the \"Islandora Media Use\" vocabulary you want to apply to the media being created in create and add_media tasks. You can provide a term URI instead of a term ID, for example \"http://pcdm.org/use#OriginalFile\" . You can specify multiple values for this setting by joining them with the subdelimiter configured in the subdelimiter setting; for example, media_use_tid: 17|18 . You can also set this at the object level by including media_use_tid in your CSV file; values there will override the value set in your configuration file. If you are \" Adding multiple media \", you define media use term IDs in a slightly different way. media_type \u2714\ufe0f in add_media and update_media tasks Overrides, for all media being created, Workbench's default definition of whether the media being created is an image, file, document, audio, or video. Used in the create , create_from_files , add_media , and update_media , tasks. More detail provided in the \" Configuring Media Types \" section. Required in all update_media and add_media tasks, not to override Workbench's defaults, but to explicitly indicate the media type being updated/added. media_types_override Overrides default media type definitions on a per file extension basis. Used in the create , add_media , and create_from_files tasks. More detail provided in the \" Configuring Media Types \" section. media_type_file_fields Defines the name of the media field that references media's file (i.e., the field on the Media type). Usually used with custom media types and accompanied by either the media_type or media_types_override option. For more information, see the \" Configuring Media Types \" section. mimetype_extensions Overrides Workbench's default mappings between MIME types and file extensions. Usually used with remote files where the remote web server returns a MIME type that is not standard. For more information, see the \" Configuring Media Types \" section. extensions_to_mimetypes Overrides Workbench's default mappings between file extension and MIME types. For more information, see the \" Configuring Media Types \" section. delete_media_with_nodes true When a node is deleted using a delete task, by default, all if its media are automatically deleted. Set this option to false to not delete all of a node's media (you do not generally want to keep the media without the node). use_node_title_for_media_title true If set to true (default), name media the same as the parent node's title value. Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_nid_in_media_title false If set to true , assigns a name to the media following the pattern {node_id}-Original File . Set to true to use the parent node's node ID as the media title. Applies to both create and add_media tasks. field_for_media_title Identifies a CSV field name (machine name, not human readable) that will be used as the media title in create tasks. For example, field_for_media_title: id . Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_node_title_for_remote_filename false Set to true to use a version of the parent node's title as the filename for a remote (http[s]) file. Replaces all non-alphanumeric characters with an underscore ( _ ). Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) use_nid_in_media_filename setting. field_for_remote_filename Identifies a CSV field name (machine name, not human readable) that will be used as the filename for a remote (http[s]) file. For example, field_for_remote_filename: id . Truncates the value of the field to 255 characters. If the field is empty in the CSV, the CSV ID field value will be used. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) field_for_media_filename setting. delete_tmp_upload false For remote files, if set to true , the temporary copy of the remote file is deleted after it is used to create media. If false , the copy will remain in the location defined in your temp_dir setting. If the file cannot be deleted (e.g. a virus scanner is scanning it), it will remain and an error message will be added to the log file. additional_files Maps a set of CSV field names to media use terms IDs to create additional media (additional to the media created from the file named in the \"file\" column, that is) in create and add_media tasks. See \" Adding multiple media \" for more information. fixity_algorithm None Checksum/hash algorithm to use during transmission fixity checking. Must be one of \"md5\", \"sha1\", or \"sha256\". See \" Fixity checking \" for more information. validate_fixity_during_check false Perform checksum validation during --check . See \" Fixity checking \" for more information. delete_media_by_node_media_use_tids [] (empty list) During delete_media_by_node tasks, allows you to specify which media to delete. Only media with the listed terms IDs from the Islandora Media Use vocabulary will be deleted. By default (an empty list), all media are deleted. See \" Deleting Media \" for more information. Islandora model settings Setting Required Default value Description model [singular] Used in the create_from_files task only. Defines the term ID from the the \"Islandora Models\" vocabulary for all nodes created using this task. Note: one of model or models is required. More detail provided in the \" Creating nodes from files \" section. models [plural] Used in the create_from_files task only. Provides a mapping between file extensions and terms in the \"Islandora Models\" vocabulary. Note: one of model or models is required. More detail provided in the Creating nodes from files \" section. Paged and compound content settings See the section \" Creating paged content \" for more information. Setting Required Default value Description paged_content_from_directories false Set to true if you are using the \" Using subdirectories \" method of creating paged content. page_files_source_dir_field id [or whatever is defined as your id column using the id_field configuration setting] Set to directory if your input CSV contains a \"directory\" column that names each row's page, if are using the \" Using subdirectories \" method of creating paged content. paged_content_sequence_separator - [hyphen] The character used to separate the page sequence number from the rest of the filename. Used when creating paged content with the \" Using subdirectories \" method. Note: this configuration option was originally misspelled \"paged_content_sequence_seprator\". paged_content_page_model_tid Required if paged_content_from_directories is true. The the term ID from the Islandora Models (or its URI) taxonomy to assign to pages. paged_content_image_file_extension If the subdirectories containing your page image files also contain other files (that are not page images), you need to use this setting to tell Workbench which files to create pages from. Common values include tif , tiff , and jpg . paged_content_additional_page_media A mapping of Media Use term IDs (or URIs) to file extensions telling Workbench which Media Use term to apply to media created from additional files such as OCR text files. paged_content_page_display_hints The term ID from the Islandora Display taxonomy to assign to pages. If not included, defaults to the value of the field_display_hints in the parent's record in the CSV file. paged_content_page_content_type Set to the machine name of the Drupal node content type for pages created using the \" Using subdirectories \" method if it is different than the content type of the parent (which is specified in the content_type setting). page_title_template '$parent_title, page $weight' Template used to generate the titles of pages/members in the \" Using subdirectories \" method. paged_content_ignore_files [\"Thumbs.db\"] List of filenames that you want Workbench to ignore when it scans directories to create page- or child-level nodes. See \" Using subdirectories \" for more inforation. Logging settings See the \" Logging \" section for more information. Setting Required Default value Description log_file_path workbench.log The path to the log file, absolute or relative to the directory Workbench is run from. log_file_mode a [append] Set to \"w\" to overwrite the log file, if it exists. log_request_url false Whether or not to log the request URL (and its method). Useful for debugging. log_json false Whether or not to log the raw request JSON POSTed, PUT, or PATCHed to Drupal. Useful for debugging. log_headers false Whether or not to log the raw HTTP headers used in all requests. Useful for debugging. log_response_status_code false Whether or not to log the HTTP response code. Useful for debugging. log_response_time false Whether or not to log the response time of each request that is slower than the average response time for the last 20 HTTP requests Workbench makes to the Drupal server. Useful for debugging. log_response_body false Whether or not to log the raw HTTP response body. Useful for debugging. log_file_name_and_line_number false Whether or not to add the filename and line number that created the log entry. Useful for debugging. log_term_creation true Whether or not to log the creation of new terms during \"create\" and \"update\" tasks (does not apply to the \"create_terms\" task). --check will still report that terms in the CSV do not exist in their respective vocabularies. HTTP settings Setting Required Default value Description user_agent Islandora Workbench String to use as the User-Agent header in HTTP requests. allow_redirects true Whether or not to allow Islandora Workbench to respond to HTTP redirects. secure_ssl_only true Whether or not to require valid SSL certificates. Set to false if you want to ignore SSL certificates. enable_http_cache true Whether or not to enable Workbench's client-side request cache. Set to false if you want to disable the cache during troubleshooting, etc. http_cache_storage memory The backend storage type for the client-side cache. Set to sqlite if you are getting out of memory errors while running Islandora Workbench. http_cache_storage_expire_after 1200 Length of the client-side cache lifespan (in seconds). Reduce this number if you are using the sqlite storage backend and the database is using too much disk space. Note that reducing the cache lifespan will result in increased load on your Drupal server. Rollback configuration and CSV file settings See \" Rolling back \" for more information. Setting Required Default value Description rollback_dir Value of input_dir setting Absolute path to the directory where you want your \"rollback.csv\" file to be written. rollback_config_file_path rollback.yml Relative (to workbench) or absolute path to write the rollback config YAML file to. rollback_csv_file_path [input_dir/rollback.csv] Relative (to workbench) or absolute path to write the rollback CSV file to. timestamp_rollback false Set to true to add a timestamp to the \"rollback.yml\" and corresponding \"rollback.csv\" generated in \"create\" and \"create_from_files\" tasks. rollback_config_filename_template Defines a template that will be used to create the rollback configuration file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_csv_filename_template Defines a template that will be used to create the rollback CSV file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_file_comments Defines a list of lines to be added to both the rollback config and CSV file as comments. include_password_in_rollback_config_file false Set to true to include the value of the password configuration setting in your rollback config YAML file. Miscellaneous settings Setting Required Default value Description perform_soft_checks false If set to true, --check will not exit when it encounters an error with parent/child order, file extension registration with Drupal media file fields, missing files named in the files CSV column, or EDTF validation. Instead, it will log any errors it encounters and exit after it has checked all rows in the CSV input file. Note: this setting replaces strict_check as of commit dfa60ff (July 14, 2023). temp_dir Value of the temporary directory defined by your system as defined by Python's gettempdir() function. Relative or absolute path to the directory where you want Workbench's temporary files to be written. These include the \".preprocessed\" version of the your input CSV, remote files temporarily downloaded to create media, and the CSV ID to node ID map database. sqlite_db_filename [temp_dir]/workbench_temp_data.db Name of the SQLite database filename used to store session data. pause Defines the number of seconds to pause between all 'POST', 'PUT', 'PATCH', 'DELETE' requests to Drupal. Include it in your configuration to lessen the impact of Islandora Workbench on your site during large jobs, for example pause: 1.5. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause Defines the number of seconds to pause between each REST request to Drupal. Works like \"pause\" but only takes effect when the Drupal server's response to the most recent request is slower (determined by the \"adaptive_pause_threshold\" value) than the average response time for the last 20 requests. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause_threshold 2 A weighting of the response time for the most recent request, relative to the average response times of the last 20 requests. This weighting determines how much slower the Drupal server's response to the most recent Workbench request must be in order for adaptive pausing to take effect for the next request. For example, if set to \"1\", adaptive pausing will happen when the response time is equal to the average of the last 20 response times; if set to \"2\", adaptive pausing will take effect if the last requests's response time is double the average. progress_bar false Show a progress bar when running Workbench instead of row-by-row output. bootstrap List of absolute paths to one or more scripts that execute prior to Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. shutdown List of absolute paths to one or more scripts that execute after Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. preprocessors List of absolute paths to one or more scripts that are applied to CSV values prior to the values being ingested into Drupal. More information is available in the \" Hooks \" documentation. node_post_create List of absolute paths to one or more scripts that execute after a node is created. More information is available in the \" Hooks \" documentation. node_post_update List of absolute paths to one or more scripts that execute after a node is updated. More information is available in the \" Hooks \" documentation. media_post_create List of absolute paths to one or more scripts that execute after a media is created. More information is available in the \" Hooks \" documentation. drupal_8 Deprecated. path_to_python python Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". path_to_workbench_script workbench Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". contact_sheet_output_dir contact_sheet_output Used in create tasks to specify the name of the directory where contact sheet output is written. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. contact_sheet_css_path assets/contact_sheet/contact-sheet.css Used in create tasks to specify the path of the CSS stylesheet to use in contact sheets. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. node_exists_verification_view_endpoint Used in create tasks to tell Workbench to check whether a node with a matching value in your input CSV already exists. See \" Checking if nodes already exist \" for more information. redirect_status_code 301 Used in create_redirect tasks to set the HTTP response code in redirects. See \" Prompting the user \" for more information. remind_user_to_run_check false Prompt the user to confirm whether they have run Workbench with --check . See \" Prompting the user \" for more information. user_prompts [] (empty list) List of custom prompts to present to the user. See \" Prompting the user \" for more information. check_lock_file_path Defines a \"lock file\" that contains data generated by a successfule --check . See \" Checking configuration and input data \" for more information. secondary_tasks A list of configuration file names that are executed as secondary tasks after the primary task to create compound objects. See \" Using a secondary task \" for more information. csv_id_to_node_id_map_path [temp_dir]/csv_id_to_node_id_map.db Name of the SQLite database filename used to store CSV row ID to node ID mappings for nodes created during create and create_from_files tasks. If you want to store your map database outside of a temporary directory, use an absolute path to the database file for this setting. If you want to disable population of this database, set this config setting to false . See \" Generating CSV files \" for more information. query_csv_id_to_node_id_map_for_parents false Queries the CSV ID to node ID map when creating compound content. Set to true if you want to use parent IDs from previous Workbench sessions. Note: this setting is automatically set to true in secondary task config files. See \" Creating parent/child relationships across Workbench sessions for more information.\" ignore_duplicate_parent_ids true Tells Workbench to ignore entries in the CSV ID to node ID map that have the same parent ID value. Set to false if you want Workbench to warn you that there are duplicate parent IDs in your CSV ID to node ID map. See \" Creating parent/child relationships across Workbench sessions for more information.\" When you run Islandora Workbench with the --check argument, it will verify that all configuration options required for the current task are present, and if they aren't tell you so. Note Islandora Workbench automatically converts any configuration keys to lowercase, e.g., Task is equivalent to task . Validating the syntax of the configuration file When you run Workbench, it confirms that your configuration file is valid YAML. This is a syntax check only, not a content check. If the file is valid YAML, Workbench then goes on to perform a long list of application-specific checks . If this syntax check fails, some detail about the problem will be displayed to the user. The same information plus the entire Python stack trace is also logged to a file named \"workbench.log\" in the directory Islandora Workbench is run from. This file name is Workbench's default log file name, but in this case (validating the config file's YAML syntax), that file name is used regardless of the log file location defined in the configuration's log_file_path option. The reason the error is logged in the default location instead of the value in the configuration file (if one is present) is that the configuration file isn't valid YAML and therefore can't be parsed. Example configuration files These examples provide inline annotations explaining why the settings are included in the configuration file. Blank rows/lines are included for readability. Create nodes only, no media task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to create nodes with no media. # Also, this tells --check to skip all validation of \"file\" locations. # Other media settings, like \"media_use_tid\", are also ignored. nodes_only: true Use a custom log file location task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to write its log file to the location specified # instead of the default \"workbench.log\" within the directory Workbench is run from. log_file_path: /home/mark/workbench_log.txt Include some CSV field templates task: create host: \"http://localhost:8000\" username: admin password: islandora # The values in this list of field templates are applied to every row in the # input CSV file before the CSV file is used to populate Drupal fields. The # field templates are also applied during the \"--check\" in order to validate # the values of the fields. csv_field_templates: - field_member_of: 205 - field_model: 25 Use a Google Sheet as input CSV task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit # You only need to specify the google_sheets_gid option if the worksheet in the Google Sheet # is not the default one. google_sheets_gid: 1867618389 Create nodes and media from files (no input CSV file) task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # The files to create the nodes from are in this directory. input_dir: /tmp/sample_files # This tells Workbench to write a CSV file containing node IDs of the # created nodes, plus the field names used in the target content type # (\"islandora_object\" by default). output_csv: /tmp/sample_files.csv # All nodes should get the \"Model\" value corresponding to this URI. model: 'https://schema.org/DigitalDocument' Create taxonomy terms task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary Ignore some columns in your input CSV file task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: input.csv # This tells Workbench to ignore the 'date_generated' and 'batch_id' # columns in the input.csv file. ignore_csv_columns: ['date_generated', 'batch_id'] Generating sample Islandora content task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # This directory must match the on defined in the script's 'dest_dir' variable. input_dir: /tmp/autogen_input media_use_tid: 17 output_csv: /tmp/my_sample_content_csv.csv model: http://purl.org/coar/resource_type/c_c513 # This is the script that generates the sample content. bootstrap: - \"/home/mark/Documents/hacking/workbench/generate_image_files.py\" Running a post-action script ask: create host: \"http://localhost:8000\" username: admin password: islandora node_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # node_post_update: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # media_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] Create nodes and media from files task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora input_dir: path/to/files models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4']","title":"Configuration"},{"location":"configuration/#the-configuration-file","text":"Workbench uses a YAML configuration whose location is indicated in the --config argument. This file defines the various options it will use to create, update, or delete Islandora content. The simplest configuration file needs only the following four options: task: create host: \"http://localhost:8000\" username: admin password: islandora In this example, the task being performed is creating nodes (and optionally media). Other tasks are create_from_files , update , delete , add_media , update_media , and delete_media . Some of the configuration settings documented below are used in all tasks, while others are only used in specific tasks.","title":"The configuration file"},{"location":"configuration/#configuration-settings","text":"The settings defined in a configuration file are documented in the tables below, grouped into broad functional categories for easier reference. The order of the options in the configuration file doesn't matter, and settings do not need to be grouped together in any specific way in the configuration file.","title":"Configuration settings"},{"location":"configuration/#use-of-quotation-marks","text":"Generally speaking, you do not need to use quotation marks around values in your configuration file. You may wrap values in quotation marks if you wish, and many examples in this documentation do that (especially the host setting), but the only values that should not be wrapped in quotation marks are those that take true or false as values because in YAML, and many other computer/markup languages, \"true\" is a string (in this case, an English word that can mean many things) and true is a reserved symbol that can mean one thing and one thing only, the boolean opposite of false (I'm sorry for this explanation, I can't describe the distinction in any other way without writing a primer on symbolic logic). For example, the following is a valid configuration file: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: true csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 But the following version is not valid, since there are quotes around \"true\" in the nodes_only setting: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: \"true\" csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25","title":"Use of quotation marks"},{"location":"configuration/#use-of-spaces-and-other-syntactical-features","text":"Configuration setting names should start a new line and not have any leading spaces. The exception is illustrated in the values of the csv_field_templates setting in the above examples, where the setting's value is a list of other values. In this case the members of the list start with a dash and a space ( - ). The trailing space in these values is significant. (However, the leading space before the dash is insignificant, and is used for appearance only.) For example, this snippet is valid: csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 whereas this one is not: csv_field_templates: -field_linked_agent: relators:aut:person:Jordan, Mark -field_model: 25 Some setting values are represented in Workbench documentation using square brackets, like this one: export_csv_field_list: ['field_description', 'field_extent'] Strictly speaking, YAML lists can be represented as either a series of entries on their own lines that start with - or as entries enclosed in [ and ] . It's best to follow the examples provided throughout the Workbench documentation.","title":"Use of spaces and other syntactical features"},{"location":"configuration/#required-configuration-settings","text":"Setting Required Default value Description task \u2714\ufe0f One of 'create', 'create_from_files', 'update', delete', 'add_media', 'delete_media', 'update_media', 'export_csv', 'get_data_from_view', 'create_terms', or 'delete_media_by_node'. See \" Choosing a task \" for more information. host \u2714\ufe0f The hostname, including http:// or https:// of your Islandora repository, and port number if not the default 80. username \u2714\ufe0f The username used to authenticate the requests. This Drupal user should be a member of the \"Administrator\" role. If you want to create nodes that are owned by a specific Drupal user, include their numeric user ID in the uid column in your CSV. password The user's password. You can also set the password in your ISLANDORA_WORKBENCH_PASSWORD environment variable. If you do this, omit the password option in your configuration file. If a password is not available in either your configuration file or in the environment variable, Workbench will prompt for a password.","title":"Required configuration settings"},{"location":"configuration/#drupal-settings","text":"Setting Required Default value Description content_type islandora_object The machine name of the Drupal node content type you are creating or updating. Required in \"create\" and \"update\" tasks. drupal_filesystem fedora:// One of 'fedora://', 'public://', or 'private://' (the wrapping quotation marks are required). Only used with Drupal 8.x - 9.1; starting with Drupal 9.2, the filesystem is automatically detected from the media's configuration. Will eventually be deprecated. allow_adding_terms false In create , update , add_media , update_media , create_terms , and update_terms tasks, determines if Workbench will add taxonomy terms if they do not exist in the target vocabulary. See more information in the \" Taxonomy reference fields \" section. Note: this setting is not required in create_terms tasks unless you are adding new terms to a taxonomy reference field on the term entries. protected_vocabularies [] (empty list) Allows you to exclude vocabularies from having new terms added via allow_adding_terms . See more information in the \" Taxonomy reference fields \" section. vocab_id \u2714\ufe0f in create_terms tasks. Identifies the vocabulary you are adding terms to in create_tersm tasks. See more information in the \" Creating taxonomy terms \" section. update_mode replace Determines if Workbench will replace , append (add to) , or delete field values during update tasks. See more information in the \" Updating nodes \" section. Also applies to update_media tasks. validate_terms_exist true If set to false, during --check Workbench will not query Drupal to determine if taxonomy terms exist. The structure of term values in CSV are still validated; this option only tells Workbench to not check for each term's existence in the target Drupal. Useful to speed up the --check process if you know terms don't exist in the target Drupal. validate_parent_node_exists true If set to false, during --check Workbench will not query Drupal to determine if nodes whose node IDs are in field_member_of exist. Useful to speed up the --check process if you know terms already exist in the target Drupal. max_node_title_length 255 Set to the number of allowed characters for node titles if your Drupal uses Node Title Length . If unsure what your the maximum length of the node titles your site allows, check the length of the \"title\" column in your Drupal database's \"node_field_data\" table. list_missing_drupal_fields false Set to true to tell Workbench to provide a list of fields that exist in your input CSV but that cannot be matched to Drupal field names (or reserved column names such as \"file\"). If false , Workbench will still check for CSV column headers that it can't match to Drupal fields, but will exit upon finding the first such field. This option produces a list of fields instead of exiting on detecting the first field. standalone_media_url false Set to true if your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings checked. The Drupal default is to have this unchecked, so you only need to use this Workbench option if you have changed Drupal's default. More information is available. require_entity_reference_views true Set to false to tell Workbench to not require a View to expose the values in an entity reference field configured to use an Entity Reference View. Additional information is available here . entity_reference_view_endpoints A list of mappings from Drupal/CSV field names to Views REST Export endpoints used to look up term names for entity reference field configured to use an Entity Reference View. Additional information is available here . text_format_id basic_html The text format ID (machine name) to apply to all Drupal text fields that have a \"formatted\" field type. See \" Text fields with markup \" for more information. field_text_format_ids Defines a mapping between field machine names the machine names of format IDs for \"formatted\" fields. See \" Text fields with markup \" for more information. paragraph_fields Defines structure of paragraph fields in the input CSV. See \" Entity Reference Revisions fields (paragraphs) \" for more information.","title":"Drupal settings"},{"location":"configuration/#input-data-location-settings","text":"Setting Required Default value Description input_dir input_data The full or relative path to the directory containing the files and metadata CSV file. input_csv metadata.csv Path to the CSV metadata file. Can be absolute, or if just the filename is provided, will be assumed to be in the directory named in input_dir . Can also be the URL to a Google spreadsheet (see the \" Using Google Sheets as input data \" section for more information). google_sheets_csv_filename google_sheet.csv Local CSV filename for data from a Google spreadsheet. See the \" Using Google Sheets as input data \" section for more information. google_sheets_gid 0 The \"gid\" of the worksheet to use in a Google Sheet. See \" Using Google Sheets as input data \" section for more information. excel_worksheet Sheet1 If using an Excel file as your input CSV file, the name of the worksheet that the CSV data will be extracted from. input_data_zip_archives [] List of local file paths and/or URLs to .zip archives to extract into the directory defined in input_dir . See \" Using a local or remote .zip archive as input data \" for more info. delete_zip_archive_after_extraction true Tells Workbench to delete an inpu zip archive after it has been extracted.","title":"Input data location settings"},{"location":"configuration/#input-csv-file-settings","text":"Setting Required Default value Description id_field id The name of the field in the CSV that uniquely identifies each record. delimiter , [comma] The delimiter used in the CSV file, for example, \",\" or \"\\t\" (must use double quotes with \"\\t\"). If omitted, defaults to \",\". subdelimiter | [pipe] The subdelimiter used in the CSV file to define multiple values in one field. If omitted, defaults to \"|\". Can be a string of multiple characters, e.g. \"^^^\". csv_field_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding values that are copied into the CSV input file. More detail provided in the \" CSV field templates \" section. csv_value_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding templates. More detail provided in the \" CSV value templates \" section. csv_value_templates_for_paged_content Used in create tasks only. Similar to csv_value_templates but applies to paged/child items created using the \" Using subdirectories \" method of creating paged content. More detail provided in the \" CSV value templates \" section. csv_value_templates_rand_length 5 Length of the $random_alphanumeric_string and $random_number_string variables CSV value templates. More detail provided in the \" CSV value templates \" section. allow_csv_value_templates_if_field_empty [] List of fields to populate with CSV value templates if the CSV field is empty. More detail provided in the \" CSV value templates \" section. ignore_csv_columns Used in the create and update tasks only. A list of CSV column headers that Workbench should ignore. For example, ignore_csv_columns: [Target Collection, Ready to publish] csv_start_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) before the designated row number. More information is available. csv_stop_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) after the designated row number. More information is available. csv_rows_to_process Used in create and update tasks. Tells Workbench to process only the rows/records in input CSV (or Google Sheet or Excel) with the specified \"id\" column values. More information is available. csv_headers names Used in \"create\", \"update\" and \"create_terms\" tasks. Set to \"labels\" to allow use of field labels (as opposed to machine names) as CSV column headers. clean_csv_values_skip [] (empty list) Used in all tasks that use CSV input files. See \" How Workbench cleans your input data \" for more information. columns_with_term_names [] (empty list) Used in all tasks that allow creation of terms on entity ingest. See \" Using numbers as term names \" for more information.","title":"Input CSV file settings"},{"location":"configuration/#output-csv-settings","text":"See \" Generating CSV files \" section for more information. Setting Required Default value Description output_csv Used in \"create\" tasks. The full or relative (to the \"workbench\" script) path to a CSV file with one record per node created by Workbench. output_csv_include_input_csv false Used in \"create\" tasks in conjunction with output_csv . Include in the output CSV all the fields (and their values) from the input CSV. export_csv_term_mode tid Used in \"export_csv\" tasks to indicate whether vocabulary term IDs or names are included in the output CSV file. Set to \"tid\" (the default) to include term IDs, or set to \"name\" to include term names. See \" Exporting field data into a CSV file \" for more information. export_csv_field_list [] (empty list) List of fields to include in exported CSV data. If empty, all fields will be included. See \" Using a Drupal View to identify content to export as CSV \" for more information. view_parameters List of URL parameter/value strings to include in requests to a View. See \" Using a Drupal View to identify content to export as CSV \" for more information. export_csv_file_path \u2714\ufe0f in get_data_from_view tasks. Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the exported CSV file. Required in the \"get_data_from_view\" task; in the \"export_csv\" task, if left empty (the default), the file will be named after the value of the input_csv with \".csv_file_with_field_values\" appended and saved in the directory identified in input_dir . export_file_directory Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the directory where files corresponding to the data in the CSV output file will be written. export_file_media_use_term_id http://pcdm.org/use#OriginalFile Used in the \"export_csv\" and \"get_data_from_view\" tasks. The term ID or URI from the Islandora Media Use vocabulary that identifies the file you want to export.","title":"Output CSV settings"},{"location":"configuration/#media-settings","text":"Setting Required Default value Description nodes_only false Include this option in create tasks, set to true , if you want to only create nodes and not their accompanying media. See the \"Creating nodes but not media\" section for more information. allow_missing_files false Determines if file values that point to missing (not found) files are allowed. Used in the create and add_media tasks. If set to true, file values that point to missing files are allowed. For create tasks, a true value will result in nodes without attached media. For add_media tasks, a true value will skip adding a media for the missing file CSV value. Defaults to false (which means all file values must name files that exist at their specified locations). Note that this setting has no effect on empty file values; these are always logged, and their corresponding media are simply not created. exit_on_first_missing_file_during_check true Removed as a configuration setting November 1, 2022. Use strict_check instead. strict_check Replaced with perform_soft_checks as of commit dfa60ff (July 14, 2023). media_use_tid http://pcdm.org/use#OriginalFile The term ID for the term from the \"Islandora Media Use\" vocabulary you want to apply to the media being created in create and add_media tasks. You can provide a term URI instead of a term ID, for example \"http://pcdm.org/use#OriginalFile\" . You can specify multiple values for this setting by joining them with the subdelimiter configured in the subdelimiter setting; for example, media_use_tid: 17|18 . You can also set this at the object level by including media_use_tid in your CSV file; values there will override the value set in your configuration file. If you are \" Adding multiple media \", you define media use term IDs in a slightly different way. media_type \u2714\ufe0f in add_media and update_media tasks Overrides, for all media being created, Workbench's default definition of whether the media being created is an image, file, document, audio, or video. Used in the create , create_from_files , add_media , and update_media , tasks. More detail provided in the \" Configuring Media Types \" section. Required in all update_media and add_media tasks, not to override Workbench's defaults, but to explicitly indicate the media type being updated/added. media_types_override Overrides default media type definitions on a per file extension basis. Used in the create , add_media , and create_from_files tasks. More detail provided in the \" Configuring Media Types \" section. media_type_file_fields Defines the name of the media field that references media's file (i.e., the field on the Media type). Usually used with custom media types and accompanied by either the media_type or media_types_override option. For more information, see the \" Configuring Media Types \" section. mimetype_extensions Overrides Workbench's default mappings between MIME types and file extensions. Usually used with remote files where the remote web server returns a MIME type that is not standard. For more information, see the \" Configuring Media Types \" section. extensions_to_mimetypes Overrides Workbench's default mappings between file extension and MIME types. For more information, see the \" Configuring Media Types \" section. delete_media_with_nodes true When a node is deleted using a delete task, by default, all if its media are automatically deleted. Set this option to false to not delete all of a node's media (you do not generally want to keep the media without the node). use_node_title_for_media_title true If set to true (default), name media the same as the parent node's title value. Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_nid_in_media_title false If set to true , assigns a name to the media following the pattern {node_id}-Original File . Set to true to use the parent node's node ID as the media title. Applies to both create and add_media tasks. field_for_media_title Identifies a CSV field name (machine name, not human readable) that will be used as the media title in create tasks. For example, field_for_media_title: id . Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_node_title_for_remote_filename false Set to true to use a version of the parent node's title as the filename for a remote (http[s]) file. Replaces all non-alphanumeric characters with an underscore ( _ ). Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) use_nid_in_media_filename setting. field_for_remote_filename Identifies a CSV field name (machine name, not human readable) that will be used as the filename for a remote (http[s]) file. For example, field_for_remote_filename: id . Truncates the value of the field to 255 characters. If the field is empty in the CSV, the CSV ID field value will be used. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) field_for_media_filename setting. delete_tmp_upload false For remote files, if set to true , the temporary copy of the remote file is deleted after it is used to create media. If false , the copy will remain in the location defined in your temp_dir setting. If the file cannot be deleted (e.g. a virus scanner is scanning it), it will remain and an error message will be added to the log file. additional_files Maps a set of CSV field names to media use terms IDs to create additional media (additional to the media created from the file named in the \"file\" column, that is) in create and add_media tasks. See \" Adding multiple media \" for more information. fixity_algorithm None Checksum/hash algorithm to use during transmission fixity checking. Must be one of \"md5\", \"sha1\", or \"sha256\". See \" Fixity checking \" for more information. validate_fixity_during_check false Perform checksum validation during --check . See \" Fixity checking \" for more information. delete_media_by_node_media_use_tids [] (empty list) During delete_media_by_node tasks, allows you to specify which media to delete. Only media with the listed terms IDs from the Islandora Media Use vocabulary will be deleted. By default (an empty list), all media are deleted. See \" Deleting Media \" for more information.","title":"Media settings"},{"location":"configuration/#islandora-model-settings","text":"Setting Required Default value Description model [singular] Used in the create_from_files task only. Defines the term ID from the the \"Islandora Models\" vocabulary for all nodes created using this task. Note: one of model or models is required. More detail provided in the \" Creating nodes from files \" section. models [plural] Used in the create_from_files task only. Provides a mapping between file extensions and terms in the \"Islandora Models\" vocabulary. Note: one of model or models is required. More detail provided in the Creating nodes from files \" section.","title":"Islandora model settings"},{"location":"configuration/#paged-and-compound-content-settings","text":"See the section \" Creating paged content \" for more information. Setting Required Default value Description paged_content_from_directories false Set to true if you are using the \" Using subdirectories \" method of creating paged content. page_files_source_dir_field id [or whatever is defined as your id column using the id_field configuration setting] Set to directory if your input CSV contains a \"directory\" column that names each row's page, if are using the \" Using subdirectories \" method of creating paged content. paged_content_sequence_separator - [hyphen] The character used to separate the page sequence number from the rest of the filename. Used when creating paged content with the \" Using subdirectories \" method. Note: this configuration option was originally misspelled \"paged_content_sequence_seprator\". paged_content_page_model_tid Required if paged_content_from_directories is true. The the term ID from the Islandora Models (or its URI) taxonomy to assign to pages. paged_content_image_file_extension If the subdirectories containing your page image files also contain other files (that are not page images), you need to use this setting to tell Workbench which files to create pages from. Common values include tif , tiff , and jpg . paged_content_additional_page_media A mapping of Media Use term IDs (or URIs) to file extensions telling Workbench which Media Use term to apply to media created from additional files such as OCR text files. paged_content_page_display_hints The term ID from the Islandora Display taxonomy to assign to pages. If not included, defaults to the value of the field_display_hints in the parent's record in the CSV file. paged_content_page_content_type Set to the machine name of the Drupal node content type for pages created using the \" Using subdirectories \" method if it is different than the content type of the parent (which is specified in the content_type setting). page_title_template '$parent_title, page $weight' Template used to generate the titles of pages/members in the \" Using subdirectories \" method. paged_content_ignore_files [\"Thumbs.db\"] List of filenames that you want Workbench to ignore when it scans directories to create page- or child-level nodes. See \" Using subdirectories \" for more inforation.","title":"Paged and compound content settings"},{"location":"configuration/#logging-settings","text":"See the \" Logging \" section for more information. Setting Required Default value Description log_file_path workbench.log The path to the log file, absolute or relative to the directory Workbench is run from. log_file_mode a [append] Set to \"w\" to overwrite the log file, if it exists. log_request_url false Whether or not to log the request URL (and its method). Useful for debugging. log_json false Whether or not to log the raw request JSON POSTed, PUT, or PATCHed to Drupal. Useful for debugging. log_headers false Whether or not to log the raw HTTP headers used in all requests. Useful for debugging. log_response_status_code false Whether or not to log the HTTP response code. Useful for debugging. log_response_time false Whether or not to log the response time of each request that is slower than the average response time for the last 20 HTTP requests Workbench makes to the Drupal server. Useful for debugging. log_response_body false Whether or not to log the raw HTTP response body. Useful for debugging. log_file_name_and_line_number false Whether or not to add the filename and line number that created the log entry. Useful for debugging. log_term_creation true Whether or not to log the creation of new terms during \"create\" and \"update\" tasks (does not apply to the \"create_terms\" task). --check will still report that terms in the CSV do not exist in their respective vocabularies.","title":"Logging settings"},{"location":"configuration/#http-settings","text":"Setting Required Default value Description user_agent Islandora Workbench String to use as the User-Agent header in HTTP requests. allow_redirects true Whether or not to allow Islandora Workbench to respond to HTTP redirects. secure_ssl_only true Whether or not to require valid SSL certificates. Set to false if you want to ignore SSL certificates. enable_http_cache true Whether or not to enable Workbench's client-side request cache. Set to false if you want to disable the cache during troubleshooting, etc. http_cache_storage memory The backend storage type for the client-side cache. Set to sqlite if you are getting out of memory errors while running Islandora Workbench. http_cache_storage_expire_after 1200 Length of the client-side cache lifespan (in seconds). Reduce this number if you are using the sqlite storage backend and the database is using too much disk space. Note that reducing the cache lifespan will result in increased load on your Drupal server.","title":"HTTP settings"},{"location":"configuration/#rollback-configuration-and-csv-file-settings","text":"See \" Rolling back \" for more information. Setting Required Default value Description rollback_dir Value of input_dir setting Absolute path to the directory where you want your \"rollback.csv\" file to be written. rollback_config_file_path rollback.yml Relative (to workbench) or absolute path to write the rollback config YAML file to. rollback_csv_file_path [input_dir/rollback.csv] Relative (to workbench) or absolute path to write the rollback CSV file to. timestamp_rollback false Set to true to add a timestamp to the \"rollback.yml\" and corresponding \"rollback.csv\" generated in \"create\" and \"create_from_files\" tasks. rollback_config_filename_template Defines a template that will be used to create the rollback configuration file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_csv_filename_template Defines a template that will be used to create the rollback CSV file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_file_comments Defines a list of lines to be added to both the rollback config and CSV file as comments. include_password_in_rollback_config_file false Set to true to include the value of the password configuration setting in your rollback config YAML file.","title":"Rollback configuration and CSV file settings"},{"location":"configuration/#miscellaneous-settings","text":"Setting Required Default value Description perform_soft_checks false If set to true, --check will not exit when it encounters an error with parent/child order, file extension registration with Drupal media file fields, missing files named in the files CSV column, or EDTF validation. Instead, it will log any errors it encounters and exit after it has checked all rows in the CSV input file. Note: this setting replaces strict_check as of commit dfa60ff (July 14, 2023). temp_dir Value of the temporary directory defined by your system as defined by Python's gettempdir() function. Relative or absolute path to the directory where you want Workbench's temporary files to be written. These include the \".preprocessed\" version of the your input CSV, remote files temporarily downloaded to create media, and the CSV ID to node ID map database. sqlite_db_filename [temp_dir]/workbench_temp_data.db Name of the SQLite database filename used to store session data. pause Defines the number of seconds to pause between all 'POST', 'PUT', 'PATCH', 'DELETE' requests to Drupal. Include it in your configuration to lessen the impact of Islandora Workbench on your site during large jobs, for example pause: 1.5. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause Defines the number of seconds to pause between each REST request to Drupal. Works like \"pause\" but only takes effect when the Drupal server's response to the most recent request is slower (determined by the \"adaptive_pause_threshold\" value) than the average response time for the last 20 requests. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause_threshold 2 A weighting of the response time for the most recent request, relative to the average response times of the last 20 requests. This weighting determines how much slower the Drupal server's response to the most recent Workbench request must be in order for adaptive pausing to take effect for the next request. For example, if set to \"1\", adaptive pausing will happen when the response time is equal to the average of the last 20 response times; if set to \"2\", adaptive pausing will take effect if the last requests's response time is double the average. progress_bar false Show a progress bar when running Workbench instead of row-by-row output. bootstrap List of absolute paths to one or more scripts that execute prior to Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. shutdown List of absolute paths to one or more scripts that execute after Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. preprocessors List of absolute paths to one or more scripts that are applied to CSV values prior to the values being ingested into Drupal. More information is available in the \" Hooks \" documentation. node_post_create List of absolute paths to one or more scripts that execute after a node is created. More information is available in the \" Hooks \" documentation. node_post_update List of absolute paths to one or more scripts that execute after a node is updated. More information is available in the \" Hooks \" documentation. media_post_create List of absolute paths to one or more scripts that execute after a media is created. More information is available in the \" Hooks \" documentation. drupal_8 Deprecated. path_to_python python Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". path_to_workbench_script workbench Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". contact_sheet_output_dir contact_sheet_output Used in create tasks to specify the name of the directory where contact sheet output is written. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. contact_sheet_css_path assets/contact_sheet/contact-sheet.css Used in create tasks to specify the path of the CSS stylesheet to use in contact sheets. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. node_exists_verification_view_endpoint Used in create tasks to tell Workbench to check whether a node with a matching value in your input CSV already exists. See \" Checking if nodes already exist \" for more information. redirect_status_code 301 Used in create_redirect tasks to set the HTTP response code in redirects. See \" Prompting the user \" for more information. remind_user_to_run_check false Prompt the user to confirm whether they have run Workbench with --check . See \" Prompting the user \" for more information. user_prompts [] (empty list) List of custom prompts to present to the user. See \" Prompting the user \" for more information. check_lock_file_path Defines a \"lock file\" that contains data generated by a successfule --check . See \" Checking configuration and input data \" for more information. secondary_tasks A list of configuration file names that are executed as secondary tasks after the primary task to create compound objects. See \" Using a secondary task \" for more information. csv_id_to_node_id_map_path [temp_dir]/csv_id_to_node_id_map.db Name of the SQLite database filename used to store CSV row ID to node ID mappings for nodes created during create and create_from_files tasks. If you want to store your map database outside of a temporary directory, use an absolute path to the database file for this setting. If you want to disable population of this database, set this config setting to false . See \" Generating CSV files \" for more information. query_csv_id_to_node_id_map_for_parents false Queries the CSV ID to node ID map when creating compound content. Set to true if you want to use parent IDs from previous Workbench sessions. Note: this setting is automatically set to true in secondary task config files. See \" Creating parent/child relationships across Workbench sessions for more information.\" ignore_duplicate_parent_ids true Tells Workbench to ignore entries in the CSV ID to node ID map that have the same parent ID value. Set to false if you want Workbench to warn you that there are duplicate parent IDs in your CSV ID to node ID map. See \" Creating parent/child relationships across Workbench sessions for more information.\" When you run Islandora Workbench with the --check argument, it will verify that all configuration options required for the current task are present, and if they aren't tell you so. Note Islandora Workbench automatically converts any configuration keys to lowercase, e.g., Task is equivalent to task .","title":"Miscellaneous settings"},{"location":"configuration/#validating-the-syntax-of-the-configuration-file","text":"When you run Workbench, it confirms that your configuration file is valid YAML. This is a syntax check only, not a content check. If the file is valid YAML, Workbench then goes on to perform a long list of application-specific checks . If this syntax check fails, some detail about the problem will be displayed to the user. The same information plus the entire Python stack trace is also logged to a file named \"workbench.log\" in the directory Islandora Workbench is run from. This file name is Workbench's default log file name, but in this case (validating the config file's YAML syntax), that file name is used regardless of the log file location defined in the configuration's log_file_path option. The reason the error is logged in the default location instead of the value in the configuration file (if one is present) is that the configuration file isn't valid YAML and therefore can't be parsed.","title":"Validating the syntax of the configuration file"},{"location":"configuration/#example-configuration-files","text":"These examples provide inline annotations explaining why the settings are included in the configuration file. Blank rows/lines are included for readability.","title":"Example configuration files"},{"location":"configuration/#create-nodes-only-no-media","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to create nodes with no media. # Also, this tells --check to skip all validation of \"file\" locations. # Other media settings, like \"media_use_tid\", are also ignored. nodes_only: true","title":"Create nodes only, no media"},{"location":"configuration/#use-a-custom-log-file-location","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to write its log file to the location specified # instead of the default \"workbench.log\" within the directory Workbench is run from. log_file_path: /home/mark/workbench_log.txt","title":"Use a custom log file location"},{"location":"configuration/#include-some-csv-field-templates","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # The values in this list of field templates are applied to every row in the # input CSV file before the CSV file is used to populate Drupal fields. The # field templates are also applied during the \"--check\" in order to validate # the values of the fields. csv_field_templates: - field_member_of: 205 - field_model: 25","title":"Include some CSV field templates"},{"location":"configuration/#use-a-google-sheet-as-input-csv","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit # You only need to specify the google_sheets_gid option if the worksheet in the Google Sheet # is not the default one. google_sheets_gid: 1867618389","title":"Use a Google Sheet as input CSV"},{"location":"configuration/#create-nodes-and-media-from-files-no-input-csv-file","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # The files to create the nodes from are in this directory. input_dir: /tmp/sample_files # This tells Workbench to write a CSV file containing node IDs of the # created nodes, plus the field names used in the target content type # (\"islandora_object\" by default). output_csv: /tmp/sample_files.csv # All nodes should get the \"Model\" value corresponding to this URI. model: 'https://schema.org/DigitalDocument'","title":"Create nodes and media from files (no input CSV file)"},{"location":"configuration/#create-taxonomy-terms","text":"task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary","title":"Create taxonomy terms"},{"location":"configuration/#ignore-some-columns-in-your-input-csv-file","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: input.csv # This tells Workbench to ignore the 'date_generated' and 'batch_id' # columns in the input.csv file. ignore_csv_columns: ['date_generated', 'batch_id']","title":"Ignore some columns in your input CSV file"},{"location":"configuration/#generating-sample-islandora-content","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # This directory must match the on defined in the script's 'dest_dir' variable. input_dir: /tmp/autogen_input media_use_tid: 17 output_csv: /tmp/my_sample_content_csv.csv model: http://purl.org/coar/resource_type/c_c513 # This is the script that generates the sample content. bootstrap: - \"/home/mark/Documents/hacking/workbench/generate_image_files.py\"","title":"Generating sample Islandora content"},{"location":"configuration/#running-a-post-action-script","text":"ask: create host: \"http://localhost:8000\" username: admin password: islandora node_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # node_post_update: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # media_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py']","title":"Running a post-action script"},{"location":"configuration/#create-nodes-and-media-from-files","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora input_dir: path/to/files models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4']","title":"Create nodes and media from files"},{"location":"configuration_settings_on_the_command_line/","text":"You can override the following settings in configuration files by including them as command-line arguments to Workbench: input_dir input_csv google_sheets_gid log_file_path rollback_csv_file_path rollback_config_file_path In all cases, you need to add -- to conform with Python's command-line argument syntax. For example, if you want to specify an input CSV file different from the one registered in your configuration file, include --input_csv as a command-line argument to Worbkench: python workbench --config example.yml --input_csv alternate.csv To specify a log file path, include --log_file_path as an argument: python workbench --config example.yml --check --log_file_path custom_log_during_check.log","title":"Specifying configuration settings on the command line"},{"location":"contact_sheet/","text":"During --check in create tasks, you can generate a contact sheet from your input CSV by adding the command-line option --contactsheet , like this: ./workbench --config somecreatetask.yml --check --contactsheet If you use this option, Workbench will tell you the location of your contact sheet, e.g. Contact sheet is at /home/mark/hacking/islandora_workbench/contact_sheet_output/contact_sheet.htm , which you can then open in your web browser. Contact sheets looks like this: The icons (courtesy of icons8 ) indicate if the item in the CSV is an: image video audio PDF miscellaneous binary file compound item There is also an icon that indicates that there is no file in the input CSV for the item. Fields with long values are indicated by link formatted like \"[...]\". Placing your pointer over this link will show a popup that contains the full value. Also, in fields that have multiple values, the subdelimiter defined in your configuration file is replaced by a small square (\u25a1), to improve readability. Compound items in the input CSV sheet are represented by a \"compound\" icon and a link to a separate contact sheet containing the item's members: The members contact sheet looks like this: You can define where your contact sheet is written to using the contact_sheet_output_dir configuration setting. If the directory doesn't exist, Workbench will create it. All the HTML, CSS, and image files that are part of the contact sheet(s) is written to this directory, so if you want to share it with someone (for quality assurance purposes, for example), you can zip up the directory and its contents and give them a copy. All the files required to render the contact sheet are in this directory: You can also define the path to an alternative CSS stylesheet to use for your contact sheets. The default CSS file is a assets/contact_sheet/contact-sheet.css within the Workbench directory. To specify your own CSS file, include its relative (to the Workbench directory) or absolute path in the contact_sheet_css_path config setting. If you use an alternative CSS file, you should base yours on a copy of the default one, since the layout of the contact sheets depends heavily on the default CSS.","title":"Generating a contact sheet"},{"location":"creating_nodes_from_files/","text":"If you want to ingest some files without a metadata CSV you can do so using the create_from_files task. A common application of this ability is in automated workflows where Islandora objects are created from files saved to a watch folder , and metadata is added later. Nodes created using this task have only the following properties/fields populated: Content type: this is defined in the configuration file, using the content_type setting. Title: this is derived from the filename minus the extension. Spaces are allowed in the filenames. Published: published by default, or overridden in the configuration file using the published setting. Model: defined in the configuration file using either the model or models setting. The media attached to the nodes is the file, with its type (image, document, audio, video, file) assigned by the media_types_override configuration setting and its Media Use tag defined in the media_use_tid setting. The configuration options for the create_from_files task are the same as the options used in the create task (with one exception: input_csv is not required). The only option specific to this task is models , which is a mapping from terms IDs (or term URIs) in the \"Islandora Models\" vocabulary to file extensions. Note that either the models or model configuration option is required in the create_from_files task. Use models when your nodes will have different Islandora Model values. Here is a sample configuration file for this task: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv models: - 23: ['zip', 'tar', ''] - 27: ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 25: ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 22: ['mp3', 'wav', 'aac'] - 26: ['mp4'] Using model is convenient when all of the objects you are creating are the same Islandora Model: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv model: 25 You can also use the URIs assigned to terms in the Islandora Models vocabulary, for example: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4'] Note In the workflow described at the beginning of this section, you might want to include the output_csv option in the configuration file, since the resulting CSV file can be populated with metadata later and used in an update task to add it to the stub nodes.","title":"Creating nodes from files"},{"location":"creating_taxonomy_terms/","text":"Islandora Workbench lets you create vocabulary terms from CSV files. This ability is separate from creating vocabulary terms while creating the nodes in a create task, as described in the \" Field data (Drupal and CSV) \" documentation. You should create vocabulary terms using the options described here if any of these situations applies to you: you are working with a vocabulary that has fields in addition to term name you are working with a vocabulary that is hierarchical you want terms to exist before you create nodes using a create task. If you want to create terms during a create task, and if the terms you are creating don't have any additional fields or hierarchical relationships to other terms, then you don't need to use the task described here. You can use the method you can create terms as described in as described in the \"Taxonomy reference fields\" section of \" Field data (Drupal and CSV) .\" The configuration and input CSV files To add terms to a vocabulary, you use a create_terms task. A typical configuration file looks like this: task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary The vocab_id config option is required. It contains the machine name of the vocabulary you are adding the terms to. The CSV file identified in the input_csv option has one required column, term_name , which contains each term's name: term_name Automobiles Sports cars SUVs Jaguar Porche Land Rover Note Unlike input CSV files used during create tasks, input CSV files for create_terms tasks do not have an \"id\" column. Instead, term_name is the column whose values are the unique identifier for each term. Workbench assumes that term names are unique within a vocabulary. If the terms in the term_name column aren't unique, Workbench only creates the term the first time it encounters it in the CSV file. Two reserved but optional columns, weight , and description , are described next. A third reserved column header, parent is described in the \"Hierarchical vocabularies\" section. You can also add columns that correspond to a vocabulary's field names, just like you do when you assemble your CSV for create tasks, as described in the \"Vocabularies with custom fields\" section below. Term weight, description, and published columns Three other reserved CSV column headers are weight , description , and published . All Drupal taxonomy terms have these two fields but populating them is optional. weight is used to sort the terms in the vocabulary overview page in relation to their parent term (or the vocabulary root if a term has no parent). Values in the weight field are integers. The lower the weight, the earlier the term sorts. For example, a value of \"0\" (zero) sorts the term at the top in relation to its parent, and a value of \"100\" sorts the term much lower. description is, as the name suggests, a field that contains a description of the term. published takes a \"0\" or a \"1\" (like it does in create tasks for nodes) to indicate that the term is published or not. If absent, defaults to \"1\" (published). Note If you are going to use published in your create_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. If you do not add weight values, Drupal sorts the terms in the vocabulary alphabetically. Vocabularies with custom fields Example column headers in a CSV file for use in create_terms tasks that has two additional fields, \"field_example\" and \"field_second_example\", in addition to the optional \"description\" column, would look like this: term_name,field_example,field_second_example,description Here is a sample CSV input file with headers for description and field_external_uri fields, and two records for terms named \"Program file\" and \"Data set\": term_name,description,field_external_uri Program file,A program file is executable source code or a binary executable file.,http://id.loc.gov/vocabulary/mfiletype/program Data set,\"A data set is raw, often tabular, data.\",https://www.wikidata.org/wiki/Q1172284 Optional fields don't need to be included in our CSV if you are not populating them, but fields that are configured as required in the vocabulary settings do need to be present, and populated (just like required fields on content types in create tasks). Running --check on a create_terms task will detect any required fields that are missing from your input CSV file. Hierarchical vocabularies If you want to create a vocabulary that is hierarchical, like this: you can add a parent column to your CSV and for each row, include the term name of the term you want as the parent. For example, the above sample vocabulary was created using this CSV input file: term_name,parent Automobiles, Sports cars,Automobiles SUVs,Automobiles Jaguar,Sports cars Porche,Sports cars Land Rover,SUVs One important aspect of creating a hierarchical vocabulary is that all parents must exist before their children are added. That means that within your CSV file, the rows for terms used as parents should be placed earlier in the file than the rows for their children. If a term is named as a parent but doesn't exist yet because it came after the child term in the CSV, Workbench will create the child term and write a warning in the log indicating that the parent didn't exist at the time of creating the child. In these cases, you can manually assign a parent to the terms using Drupal's taxonomy administration tools. You can include the parent column in your CSV along with Drupal field names. Workbench will not only create the hierarchy, it will also add the field data to the terms: term_name,parent,description,field_external_uri Automobiles,,, Sports cars,Automobiles,\"Sports cars focus on performance, handling, and driver experience.\",https://en.wikipedia.org/wiki/Sports_car SUVs,Automobiles,\"SUVs, or Sports Utility Vehicles, are the most popular type of automobile.\",https://en.wikipedia.org/wiki/Sport_utility_vehicle Jaguar,Sports cars,, Porche,Sports cars,, Land Rover,SUVs,,","title":"Creating taxonomy terms"},{"location":"creating_taxonomy_terms/#the-configuration-and-input-csv-files","text":"To add terms to a vocabulary, you use a create_terms task. A typical configuration file looks like this: task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary The vocab_id config option is required. It contains the machine name of the vocabulary you are adding the terms to. The CSV file identified in the input_csv option has one required column, term_name , which contains each term's name: term_name Automobiles Sports cars SUVs Jaguar Porche Land Rover Note Unlike input CSV files used during create tasks, input CSV files for create_terms tasks do not have an \"id\" column. Instead, term_name is the column whose values are the unique identifier for each term. Workbench assumes that term names are unique within a vocabulary. If the terms in the term_name column aren't unique, Workbench only creates the term the first time it encounters it in the CSV file. Two reserved but optional columns, weight , and description , are described next. A third reserved column header, parent is described in the \"Hierarchical vocabularies\" section. You can also add columns that correspond to a vocabulary's field names, just like you do when you assemble your CSV for create tasks, as described in the \"Vocabularies with custom fields\" section below.","title":"The configuration and input CSV files"},{"location":"creating_taxonomy_terms/#term-weight-description-and-published-columns","text":"Three other reserved CSV column headers are weight , description , and published . All Drupal taxonomy terms have these two fields but populating them is optional. weight is used to sort the terms in the vocabulary overview page in relation to their parent term (or the vocabulary root if a term has no parent). Values in the weight field are integers. The lower the weight, the earlier the term sorts. For example, a value of \"0\" (zero) sorts the term at the top in relation to its parent, and a value of \"100\" sorts the term much lower. description is, as the name suggests, a field that contains a description of the term. published takes a \"0\" or a \"1\" (like it does in create tasks for nodes) to indicate that the term is published or not. If absent, defaults to \"1\" (published). Note If you are going to use published in your create_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. If you do not add weight values, Drupal sorts the terms in the vocabulary alphabetically.","title":"Term weight, description, and published columns"},{"location":"creating_taxonomy_terms/#vocabularies-with-custom-fields","text":"Example column headers in a CSV file for use in create_terms tasks that has two additional fields, \"field_example\" and \"field_second_example\", in addition to the optional \"description\" column, would look like this: term_name,field_example,field_second_example,description Here is a sample CSV input file with headers for description and field_external_uri fields, and two records for terms named \"Program file\" and \"Data set\": term_name,description,field_external_uri Program file,A program file is executable source code or a binary executable file.,http://id.loc.gov/vocabulary/mfiletype/program Data set,\"A data set is raw, often tabular, data.\",https://www.wikidata.org/wiki/Q1172284 Optional fields don't need to be included in our CSV if you are not populating them, but fields that are configured as required in the vocabulary settings do need to be present, and populated (just like required fields on content types in create tasks). Running --check on a create_terms task will detect any required fields that are missing from your input CSV file.","title":"Vocabularies with custom fields"},{"location":"creating_taxonomy_terms/#hierarchical-vocabularies","text":"If you want to create a vocabulary that is hierarchical, like this: you can add a parent column to your CSV and for each row, include the term name of the term you want as the parent. For example, the above sample vocabulary was created using this CSV input file: term_name,parent Automobiles, Sports cars,Automobiles SUVs,Automobiles Jaguar,Sports cars Porche,Sports cars Land Rover,SUVs One important aspect of creating a hierarchical vocabulary is that all parents must exist before their children are added. That means that within your CSV file, the rows for terms used as parents should be placed earlier in the file than the rows for their children. If a term is named as a parent but doesn't exist yet because it came after the child term in the CSV, Workbench will create the child term and write a warning in the log indicating that the parent didn't exist at the time of creating the child. In these cases, you can manually assign a parent to the terms using Drupal's taxonomy administration tools. You can include the parent column in your CSV along with Drupal field names. Workbench will not only create the hierarchy, it will also add the field data to the terms: term_name,parent,description,field_external_uri Automobiles,,, Sports cars,Automobiles,\"Sports cars focus on performance, handling, and driver experience.\",https://en.wikipedia.org/wiki/Sports_car SUVs,Automobiles,\"SUVs, or Sports Utility Vehicles, are the most popular type of automobile.\",https://en.wikipedia.org/wiki/Sport_utility_vehicle Jaguar,Sports cars,, Porche,Sports cars,, Land Rover,SUVs,,","title":"Hierarchical vocabularies"},{"location":"csv_value_templates/","text":"Note This section describes using CSV value templates in your configuration file. For information on CSV field templates, see the \" CSV field templates \". For information on CSV file templates, see the \" CSV file templates \" section. Applying CSV value templates to rows in your input CSV In create and update tasks, you can configure templates that are applied to all the values in a CSV column (or to each subvalue if you have multiple values in a single field) as if the templated text were present in the values in your CSV file. The templates are configured in the csv_value_templates option. An example looks like this: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value For each value in the named CSV field (in this example, field_linked_agent ), Workbench will apply the template text (in this example, relators:aut:person: ) to the CSV value, represented in the template as $csv_value . An input CSV file that looks like this: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"Jordan, Mark|Cantelo, M.\" will have the template relators:aut:person:$csv_value applied to so it is converted to: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"relators:aut:person:Jordan, Mark|relators:aut:person:Cantelo, M.\" This example configuration defines only one field/template pair, but csv_value_templates can define multiple field/template pairs, for example: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value - field_subject: subject:$csv_value - field_local_identifier: SFU-$uuid_string The following template variables are available: $csv_value : The verbatim string value of the field. $file : The verbatim value of the file column in the row. $filename_without_extension : The filename portion only (with no leading directory path or file extension) in the file column in the row. $weight : The value of the field_weight column in the CSV, or for paged content created using the \" Using subdirectories \" option, the sequence number embedded in the page's filenme. $random_alphanumeric_string : A randomly generated string containing numbers and mixed-case letters. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $random_number_string : A randomly generated string containing only numbers. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $uuid_string : A valid version 4 UUID. A few things to note about CSV value templates: The variable can be anywhere in your template (beginning, middle, or end). You can only define a single template for a given field in csv_value_templates , but you can include multiple variables in a single template. If multiple variables are present in a template, they are applied in the order listed above. If a CSV field contains multiple subvalues, the same template is applied to all subvalues in the field (as illustrated above). Values in the templated CSV output are validated against Drupal's configuration in the same way that values present in your CSV are validated. By default, CSV value templates won't be applied to empty fields. However, if you want a template to be applied to a field if that field is empty, you can include the allow_csv_value_templates_if_field_empty setting in your config file defining a list of column names. For example, allow_csv_value_templates_if_field_empty: [field_identifier] will apply any templates defined for field_identifier in your csv_value_templates setting, even if field_identifier is empty in your input CSV; for example, the following will apply the template defined in the above example configuration even if the named fields are empty: allow_csv_value_templates_if_field_empty: ['field_local_identifier', 'field_subject'] Applying CSV value templates to paged content Paged content (or as sometimes referred to, children) created using the \" Using subdirectories \" method do not have their own rows in input CSV files. Any fields that are configured to be \"required\" in the parent and child's content type are copied from the parent's CSV row and applied to all that parent's pages/children. If you want to add non-required field data to pages/children, However, you can use CSV value templates to do that. In this case: the CSV row that is used as the source of $csv_value is the page's (or child's) parent row; in other words, the value of $csv_value is inherited from a page/child's parent row the $file variable is the name of the page/child's filename (and $filename_without_extension is derived from this value) the $weight variable is taken from the page/child's sequence indicator, e.g. a filename of page-002.jpg would result in a $weight value of \"2\". If you want to apply CSV field templates to page/child items using this technique, register the templates in the csv_value_templates_for_paged_content config setting. The structure of the field-to-template mappings is identical to those used in the csv_value_templates setting as illustrated above. For example: csv_value_templates_for_paged_content: - field_linked_agent: relators:aut:person:$csv_value - field_edtf_date_issued: $csv_value - field_local_identifier: $csv_value-$weight Even though this section documents how to apply templates with variables, you can also apply \"templates\" to pages/child items that are complete values, that do not use variables. For example, if you want to add the term \"Newspapers\" to the field_genre field in each page in a newspaper issue, you can register that string as your \"template\", e.g. csv_value_templates_for_paged_content: - field_genre: Newspapers","title":"CSV value templates"},{"location":"csv_value_templates/#applying-csv-value-templates-to-rows-in-your-input-csv","text":"In create and update tasks, you can configure templates that are applied to all the values in a CSV column (or to each subvalue if you have multiple values in a single field) as if the templated text were present in the values in your CSV file. The templates are configured in the csv_value_templates option. An example looks like this: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value For each value in the named CSV field (in this example, field_linked_agent ), Workbench will apply the template text (in this example, relators:aut:person: ) to the CSV value, represented in the template as $csv_value . An input CSV file that looks like this: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"Jordan, Mark|Cantelo, M.\" will have the template relators:aut:person:$csv_value applied to so it is converted to: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"relators:aut:person:Jordan, Mark|relators:aut:person:Cantelo, M.\" This example configuration defines only one field/template pair, but csv_value_templates can define multiple field/template pairs, for example: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value - field_subject: subject:$csv_value - field_local_identifier: SFU-$uuid_string The following template variables are available: $csv_value : The verbatim string value of the field. $file : The verbatim value of the file column in the row. $filename_without_extension : The filename portion only (with no leading directory path or file extension) in the file column in the row. $weight : The value of the field_weight column in the CSV, or for paged content created using the \" Using subdirectories \" option, the sequence number embedded in the page's filenme. $random_alphanumeric_string : A randomly generated string containing numbers and mixed-case letters. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $random_number_string : A randomly generated string containing only numbers. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $uuid_string : A valid version 4 UUID. A few things to note about CSV value templates: The variable can be anywhere in your template (beginning, middle, or end). You can only define a single template for a given field in csv_value_templates , but you can include multiple variables in a single template. If multiple variables are present in a template, they are applied in the order listed above. If a CSV field contains multiple subvalues, the same template is applied to all subvalues in the field (as illustrated above). Values in the templated CSV output are validated against Drupal's configuration in the same way that values present in your CSV are validated. By default, CSV value templates won't be applied to empty fields. However, if you want a template to be applied to a field if that field is empty, you can include the allow_csv_value_templates_if_field_empty setting in your config file defining a list of column names. For example, allow_csv_value_templates_if_field_empty: [field_identifier] will apply any templates defined for field_identifier in your csv_value_templates setting, even if field_identifier is empty in your input CSV; for example, the following will apply the template defined in the above example configuration even if the named fields are empty: allow_csv_value_templates_if_field_empty: ['field_local_identifier', 'field_subject']","title":"Applying CSV value templates to rows in your input CSV"},{"location":"csv_value_templates/#applying-csv-value-templates-to-paged-content","text":"Paged content (or as sometimes referred to, children) created using the \" Using subdirectories \" method do not have their own rows in input CSV files. Any fields that are configured to be \"required\" in the parent and child's content type are copied from the parent's CSV row and applied to all that parent's pages/children. If you want to add non-required field data to pages/children, However, you can use CSV value templates to do that. In this case: the CSV row that is used as the source of $csv_value is the page's (or child's) parent row; in other words, the value of $csv_value is inherited from a page/child's parent row the $file variable is the name of the page/child's filename (and $filename_without_extension is derived from this value) the $weight variable is taken from the page/child's sequence indicator, e.g. a filename of page-002.jpg would result in a $weight value of \"2\". If you want to apply CSV field templates to page/child items using this technique, register the templates in the csv_value_templates_for_paged_content config setting. The structure of the field-to-template mappings is identical to those used in the csv_value_templates setting as illustrated above. For example: csv_value_templates_for_paged_content: - field_linked_agent: relators:aut:person:$csv_value - field_edtf_date_issued: $csv_value - field_local_identifier: $csv_value-$weight Even though this section documents how to apply templates with variables, you can also apply \"templates\" to pages/child items that are complete values, that do not use variables. For example, if you want to add the term \"Newspapers\" to the field_genre field in each page in a newspaper issue, you can register that string as your \"template\", e.g. csv_value_templates_for_paged_content: - field_genre: Newspapers","title":"Applying CSV value templates to paged content"},{"location":"deleting_media/","text":"Deleting media using media IDs Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete_media tasks, the value of username will need to be the same as the username used to create the media. If the username defined in a delete_media task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs. You can delete media and their associate files by providing a CSV file with a media_id column that contains the Drupal IDs of media you want to delete. For example, your CSV file could look like this: media_id 100 103 104 The config file looks like this (note the task option is 'delete_media'): task: delete_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete_media.csv Deleting media using node IDs If you want to delete media from specific nodes without having to know the media IDs as described above, you can use the delete_media_by_node task. This task takes a list of node IDs as input, like this: node_id 345 367 246 The configuration file for this task looks like this: task: delete_media_by_node host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data input_csv: delete_node_media.csv This configuration will delete all media attached to nodes 345, 367, and 246. By default, all media attached to the specified nodes are deleted. A \"delete_media_by_node\" configuration file can include a delete_media_by_node_media_use_tids option that lets you specify a list of Islandora Media Use term IDs that a media must have to be deleted: delete_media_by_node_media_use_tids: [17, 1] Before using this option, consult your Islandora's Islandora Media Use vocabulary page at /admin/structure/taxonomy/manage/islandora_media_use/overview to get the term IDs you need to use.","title":"Deleting media"},{"location":"deleting_media/#deleting-media-using-media-ids","text":"Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete_media tasks, the value of username will need to be the same as the username used to create the media. If the username defined in a delete_media task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs. You can delete media and their associate files by providing a CSV file with a media_id column that contains the Drupal IDs of media you want to delete. For example, your CSV file could look like this: media_id 100 103 104 The config file looks like this (note the task option is 'delete_media'): task: delete_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete_media.csv","title":"Deleting media using media IDs"},{"location":"deleting_media/#deleting-media-using-node-ids","text":"If you want to delete media from specific nodes without having to know the media IDs as described above, you can use the delete_media_by_node task. This task takes a list of node IDs as input, like this: node_id 345 367 246 The configuration file for this task looks like this: task: delete_media_by_node host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data input_csv: delete_node_media.csv This configuration will delete all media attached to nodes 345, 367, and 246. By default, all media attached to the specified nodes are deleted. A \"delete_media_by_node\" configuration file can include a delete_media_by_node_media_use_tids option that lets you specify a list of Islandora Media Use term IDs that a media must have to be deleted: delete_media_by_node_media_use_tids: [17, 1] Before using this option, consult your Islandora's Islandora Media Use vocabulary page at /admin/structure/taxonomy/manage/islandora_media_use/overview to get the term IDs you need to use.","title":"Deleting media using node IDs"},{"location":"deleting_nodes/","text":"You can delete nodes by providing a CSV file that contains a single column, node_id , like this: node_id 95 96 200 Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for update operations looks like this (note the task option is 'delete'): task: delete host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete.csv Note that when you delete nodes using this method, all media associated with the nodes are also deleted, unless the delete_media_with_nodes configuration option is set to false (it defaults to true ). Typical output produced by a delete task looks like this: Node http://localhost:8000/node/89 deleted. + Media http://localhost:8000/media/329 deleted. + Media http://localhost:8000/media/331 deleted. + Media http://localhost:8000/media/335 deleted. Note that taxonomy terms created with new nodes are not removed when you delete the nodes. Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete tasks, the value of username will need to be the same as the username used to create the original media attached to nodes. If the username defined in a delete task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs.","title":"Deleting nodes"},{"location":"development_guide/","text":"This documentation is aimed at developers who want to contribute to Islandora Workbench. General Bug reports, improvements, feature requests, and PRs are welcome. Before you open a pull request, please open an issue so we can discuss your idea . In cases where the PR introduces a potentially disruptive change to Workbench, we usually want to start a discussion about its impact on users in the #islandoraworkbench Slack channel. When you open a PR, you will be asked to complete the Workbench PR template . All code must be formatted using Black . You can automatically style your code using Black in your IDE of choice . Where applicable, unit and integration tests to accompany your code are very appreciated. Tests in Workbench fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . workbench_utils.py provides a lot of utility functions. Before writing a new utility function, please ensure that something similar to what you want to write doesn't already exist. Running tests While developing code for Islandora Workbench, you should run tests frequently to ensure that your code hasn't introduced any regression errors. Workbench is a fairly large and complex application, and has many configuration settings. Even minor changes in the code can break things. To run unit tests that do not require a live Islandora instance: Unit tests in tests/unit_tests.py (run with python tests/unit_tests.py ) Unit tests for Workbench's Drupal fields handlers in tests/field_tests.py (run with python tests/field_tests.py ) Note that these tests are run automatically as GitHub actions when you push to the Islandora Workbench repo or when a merge request is merged. To run integration tests that require a live Islandora instance running at https://islandora.traefik.me/ tests/islandora_tests.py , tests/islandora_tests_check.py , tests/islandora_tests_hooks.py , and tests/islandora_tests_paged_content.py can be run with python tests/islandora_tests.py , etc. The Islandora Starter Site deployed with ISLE is recommended way to deploy the Islandora used in these tests. Integration tests remove all nodes and media added during the tests, unless a test fails. Taxonomy terms created by tests are not removed. Some integration and field tests output text that beings with \"Error:.\" This is normal, it's the text that Workbench outputs when it finds something wrong (which is probably what the test is testing). Successful test (whether they test for success or failure) runs will exit with \"OK\". If you can figure out how to suppress this output, please visit this issue . If you want to run the tests within a specific class, include the class name as an argument like this: python tests/unit_tests.py TestCompareStings You can also specify multiple test classes within a single test file: python tests/islandora_tests.py TestMyNewTest TestMyOtherNewTest Writing tests Islandora Workbench's tests are written using the Python built-in module unittest , and as explained above, fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . unittest groups tests into classes. A single test file can contain one or more test classes. Within each test class, you can put one or more test methods. As shown in the second example below, two reserved methods, setUp() and tearDown() , are reserved for setup and cleanup tasks, respectively, within each class. If you are new to using unittest , this is a good tutorial. A simple unit test Islandora Workbench unit tests are much like unit tests in any Python application. The sample test below, from tests/unit_tests.py , tests the validate_latlong_value() method from the workbench_utils.py module. Since workbench_utils.validate_latlong_value() doesn't interact with Islandora, https://islandora.traefik.me/ doesn't need to be available to run this unit test. class TestValidateLatlongValue(unittest.TestCase): def test_validate_good_latlong_values(self): values = ['+90.0, -127.554334', '90.0, -127.554334', '-90,-180', '+50.25,-117.8', '+48.43333,-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertTrue(res) def test_validate_bad_latlong_values(self): values = ['+90.1 -100.111', '045, 180', '+5025,-117.8', '-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertFalse(res) This is a fairly standard Python unit test - we define a list of valid lat/long pairs and run them through the workbench_utils.validate_latlong_value() method expecting it to return True for each value, and then we define a list of bad lat/long pairs and run them through the method expecting it to return False for each value. A simple integration test The two sample integration tests provided below are copied from islandora_tests.py . The first sample integration test, TestCreate , looks like this (with line numbers added for easy reference). Configuration and CSV files used by this test are in tests/assets/create_test : 1. class TestCreate(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. 8. def test_create(self): 9. self.nids = list() 10. create_output = subprocess.check_output(self.create_cmd) 11. create_output = create_output.decode().strip() 12. create_lines = create_output.splitlines() 13. for line in create_lines: 14. if 'created at' in line: 15. nid = line.rsplit('/', 1)[-1] 16. nid = nid.strip('.') 17. self.nids.append(nid) 18. 19. self.assertEqual(len(self.nids), 2) 20. 21. def tearDown(self): 22. for nid in self.nids: 23. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 24. quick_delete_output = subprocess.check_output(quick_delete_cmd) 25. 26. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'rollback.csv') 27. if os.path.exists(self.rollback_file_path): 28. os.remove(self.rollback_file_path) 29. 30. self.preprocessed_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'metadata.csv.preprocessed') 31. if os.path.exists(self.preprocessed_file_path): 32. os.remove(self.preprocessed_file_path) As you can see, this test runs Workbench using the config file create.yml (line 10), which lives at tests/assets/create_test/create.yml , relative to the workbench directory. A tricky aspect of using real config files in tests is that all paths mentioned in the config file must be relative to the workbench directory. This create.yml defines the input_dir setting to be tests/assets/create_test : task: create host: https://islandora.traefik.me username: admin password: password input_dir: \"tests/assets/create_test\" media_type: image allow_missing_files: True The test's setUp() method prepares the file paths, etc. and within the test's only test method, test_create() , runs Workbench using Python's subprocess.check_output() method, grabs the node IDs from the output from the \"created at\" strings emitted by Workbench (lines 14-17), adds them to a list, and then counts the number of members in that list. If the number of nodes created matches the expected number, the test passes. Since this test creates some nodes, we use the test class's tearDown() method to put the target Drupal back into as close a state as we started with as possible. tearDown() basically takes the list of node IDs created in test_create() and runs Workbench with the --quick_delete_node option. It then removes any temporary files created during the test. A more complex integration test Since Workbench is essentially a specialized REST client, writing integration tests that require interaction with Drupal can get a bit complex. But, the overall pattern is: Create some entities (nodes, media, taxonomy terms). Confirm that they were created in the expected way (doing this usually involves keeping track of any node IDs needed to run tests or to clean up, and in some cases involves parsing out values from raw JSON returned by Drupal). Clean up by deleting any Drupal entities created during the tests and also any temporary local files. An integration test that checks data in the node JSON is TestUpdateWithMaxNodeTitleLength() . Files that accompany this test are in tests/assets/max_node_title_length_test . Here is a copy of the test's code: 1. class TestUpdateWithMaxNodeTitleLength(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. self.nids = list() 8. 9. self.update_csv_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update_max_node_title_length.csv') 10. self.update_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update.yml') 11. self.update_cmd = [\"./workbench\", \"--config\", self.update_config_file_path] 12. 13. self.temp_dir = tempfile.gettempdir() 14. 15. def test_create(self): 16. create_output = subprocess.check_output(self.create_cmd) 17. self.create_output = create_output.decode().strip() 18. 19. create_lines = self.create_output.splitlines() 20. for line in create_lines: 21. if 'created at' in line: 22. nid = line.rsplit('/', 1)[-1] 23. nid = nid.strip('.') 24. self.nids.append(nid) 25. 26. self.assertEqual(len(self.nids), 6) 27. 28. # Write out an update CSV file using the node IDs in self.nids. 29. update_csv_file_rows = list() 30. test_titles = ['This title is 37 chars___________long', 31. 'This title is 39 chars_____________long', 32. 'This title is 29 _ chars long', 33. 'This title is 42 chars________________long', 34. 'This title is 44 chars__________________long', 35. 'This title is 28 chars long.'] 36. update_csv_file_rows.append('node_id,title') 37. i = 0 38. while i <= 5: 39. update_csv_file_rows.append(f'{self.nids[i]},{test_titles[i]}') 40. i = i + 1 41. with open(self.update_csv_file_path, mode='wt') as update_csv_file: 42. update_csv_file.write('\\n'.join(update_csv_file_rows)) 43. 44. # Run the update command. 45. check_output = subprocess.check_output(self.update_cmd) 46. 47. # Fetch each node in self.nids and check to see if its title is <= 30 chars long. All should be. 48. for nid_to_update in self.nids: 49. node_url = 'https://islandora.traefik.me/node/' + str(self.nids[0]) + '?_format=json' 50. node_response = requests.get(node_url) 51. node = json.loads(node_response.text) 52. updated_title = str(node['title'][0]['value']) 53. self.assertLessEqual(len(updated_title), 30, '') 54. 55. def tearDown(self): 56. for nid in self.nids: 57. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 58. quick_delete_output = subprocess.check_output(quick_delete_cmd) 59. 60. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'rollback.csv') 61. if os.path.exists(self.rollback_file_path): 62. os.remove(self.rollback_file_path) 63. 64. self.preprocessed_file_path = os.path.join(self.temp_dir, 'create_max_node_title_length.csv.preprocessed') 65. if os.path.exists(self.preprocessed_file_path): 66. os.remove(self.preprocessed_file_path) 67. 68. # Update test: 1) delete the update CSV file, 2) delete the update .preprocessed file. 69. if os.path.exists(self.update_csv_file_path): 70. os.remove(self.update_csv_file_path) 71. 72. self.preprocessed_update_file_path = os.path.join(self.temp_dir, 'update_max_node_title_length.csv.preprocessed') 73. if os.path.exists(self.preprocessed_update_file_path): 74. os.remove(self.preprocessed_update_file_path) This test: (line 16) creates some nodes that will be updated within the same test class (i.e. in line 45) (lines 28-42) writes out a temporary CSV file which will be used as the input_csv file in a subsequent update task containing the new node IDs plus some titles that are longer than max_node_title_length: 30 setting in the assets/max_node_title_length_test/update.yml file (line 45) runs self.update_cmd to execute the update task (lines 47-53) fetches the title values for each of the updated nodes and tests the length of each title string to confirm that it does not exceed the maximum allowed length of 30 characters. tearDown() removes all nodes created by the test and removes all temporary local files. Adding a new Drupal field type Overview of how Workbench handles field types Workbench and Drupal exchange field data represented as JSON, via Drupal's REST API. The specific JSON structure depends on the field type (text, entity reference, geolocation, etc.). Handling the details of these structures when adding new field data during create tasks, updating existing field data during update tasks, etc. is performed by code in the workbench_fields.py module. Each distinct field structure has its own class in that file, and each of the classes has the methods create() , update() , dedupe_values() , remove_invalid_values() , and serialize() . The create() and update() methods convert CSV field data in Workbench input files to Python dictionaries, which are subsequently converted into JSON for pushing up to Drupal. The serialize() method reverses this conversion, taking the JSON field data fetched from Drupal and converting it into a dictionary, and from that, into CSV data. dedupe_values() and remove_invalid_values() are utility methods that do what their names suggest. Currently, Workbench supports the following field types: \"simple\" fields for strings (for string or text fields) integers binary values 1 and 0 existing Drupal-generated entity IDs entity reference fields entity reference revision fields typed relation fields link fields geolocation fields authority link fields media track fields Eventually, classes for new Drupal field types will need to be added to Workbench as the community adopts more field types provided by Drupal contrib modules or creates new field types specific to Islandora. Note If new field types are added to workbench_utils.py, corresponding logic must be added to functions in other Workbench modules (e.g. workbench_utils, workbench) that create, update, or export Drupal entities. Those places are commented in the code with either: \"Assemble Drupal field structures from CSV data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" or \"Assemble CSV output Drupal field data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" Field class methods The most complex aspect of handling field data is cardinality, or in other words, whether a given field's configuration setting \"Allowed number of values\" allows for a single value, a maximum number of values (for example, a maximum of 3 values), or an unlimited number of values. Each field type's cardinality configuration complicates creating and updating field instances because Workbench must deal with situations where input CSV data for a field contains more values than are allowed, or when the user wants to append a value to an existing field instance rather than replace existing values, potentially exceeding the field's configured cardinality. Drupal's REST interface is very strict about cardinality, and if Workbench tries to push up a field's JSON that violates the field's cardinality, the HTTP request fails and returns a 422 Unprocessable Content response for the node containing the malformed field data. To prevent this from happening, code within field type classes needs to contain logic to account for the three different types of cardinality (cardinality of 1, cardinality of more than 1 but with a maximum, and unlimited) and for the specific JSON structure created from the field data in the input CSV. When Workbench detects that the number of field instances in the CSV data surpasses the field's maximum configured cardinality, it will truncate the incoming data and log that it did so via the log_field_cardinality_violation() utility function. However, truncating data is a fallback behavior; --check will report intances of this cardinality violation, providing you with an opportunity to correct the CSV data before committing it to Drupal. To illustrate this complexity, let's look at the update() method within the SimpleField class, which handles field types that have the Python structure [{\"value\": value}] or, for \"formatted\" text, [{\"value\": value, \"format\": text_format}] . Note Note that the example structure in the preceding paragraph shows a single value for that field. It's a list, but a list containing a single dictionary. If there were two values in a field, the structure would be a list containing two dictionaries, like [{\"value\": value}, {\"value\": value}] . If the field contained three values, the structure would be [{\"value\": value}, {\"value\": value}, {\"value\": value}] Lines 47-167 in the sample update() method apply when the field is configured to have a limited cardinality, either 1 or a specific number higher than 1. Within that range of lines, 49-113 apply if the update_mode configuration setting is \"append\", and lines 115-167 apply if the update_mode setting is \"replace\". Lines 169-255 apply when the field's cardinality is unlimited. Within that range of lines, 171-214 apply if the update_mode is \"append\", and lines 215-255 apply if it is \"replace\". An update_mode setting of \"delete\" simply removes all values from the field, in lines 28-30. 1. def update( 2. self, config, field_definitions, entity, row, field_name, entity_field_values 3. ): 4. \"\"\"Note: this method appends incoming CSV values to existing values, replaces existing field 5. values with incoming values, or deletes all values from fields, depending on whether 6. config['update_mode'] is 'append', 'replace', or 'delete'. It does not replace individual 7. values within fields. 8. \"\"\" 9. \"\"\"Parameters 10. ---------- 11. config : dict 12. The configuration settings defined by workbench_config.get_config(). 13. field_definitions : dict 14. The field definitions object defined by get_field_definitions(). 15. entity : dict 16. The dict that will be POSTed to Drupal as JSON. 17. row : OrderedDict. 18. The current CSV record. 19. field_name : string 20. The Drupal fieldname/CSV column header. 21. entity_field_values : list 22. List of dictionaries containing existing value(s) for field_name in the entity being updated. 23. Returns 24. ------- 25. dictionary 26. A dictionary representing the entity that is PATCHed to Drupal as JSON. 27. \"\"\" 28. if config[\"update_mode\"] == \"delete\": 29. entity[field_name] = [] 30. return entity 31. 32. if row[field_name] is None: 33. return entity 34. 35. if field_name in config[\"field_text_format_ids\"]: 36. text_format = config[\"field_text_format_ids\"][field_name] 37. else: 38. text_format = config[\"text_format_id\"] 39. 40. if config[\"task\"] == \"update_terms\": 41. entity_id_field = \"term_id\" 42. if config[\"task\"] == \"update\": 43. entity_id_field = \"node_id\" 44. if config[\"task\"] == \"update_media\": 45. entity_id_field = \"media_id\" 46. 47. # Cardinality has a limit. 48. if field_definitions[field_name][\"cardinality\"] > 0: 49. if config[\"update_mode\"] == \"append\": 50. if config[\"subdelimiter\"] in row[field_name]: 51. subvalues = row[field_name].split(config[\"subdelimiter\"]) 52. subvalues = self.remove_invalid_values( 53. config, field_definitions, field_name, subvalues 54. ) 55. for subvalue in subvalues: 56. subvalue = truncate_csv_value( 57. field_name, 58. row[entity_id_field], 59. field_definitions[field_name], 60. subvalue, 61. ) 62. if ( 63. \"formatted_text\" in field_definitions[field_name] 64. and field_definitions[field_name][\"formatted_text\"] is True 65. ): 66. entity[field_name].append( 67. {\"value\": subvalue, \"format\": text_format} 68. ) 69. else: 70. entity[field_name].append({\"value\": subvalue}) 71. entity[field_name] = self.dedupe_values(entity[field_name]) 72. if len(entity[field_name]) > int( 73. field_definitions[field_name][\"cardinality\"] 74. ): 75. log_field_cardinality_violation( 76. field_name, 77. row[entity_id_field], 78. field_definitions[field_name][\"cardinality\"], 79. ) 80. entity[field_name] = entity[field_name][ 81. : field_definitions[field_name][\"cardinality\"] 82. ] 83. else: 84. row[field_name] = self.remove_invalid_values( 85. config, field_definitions, field_name, row[field_name] 86. ) 87. row[field_name] = truncate_csv_value( 88. field_name, 89. row[entity_id_field], 90. field_definitions[field_name], 91. row[field_name], 92. ) 93. if ( 94. \"formatted_text\" in field_definitions[field_name] 95. and field_definitions[field_name][\"formatted_text\"] is True 96. ): 97. entity[field_name].append( 98. {\"value\": row[field_name], \"format\": text_format} 99. ) 100. else: 101. entity[field_name].append({\"value\": row[field_name]}) 102. entity[field_name] = self.dedupe_values(entity[field_name]) 103. if len(entity[field_name]) > int( 104. field_definitions[field_name][\"cardinality\"] 105. ): 106. log_field_cardinality_violation( 107. field_name, 108. row[entity_id_field], 109. field_definitions[field_name][\"cardinality\"], 110. ) 111. entity[field_name] = entity[field_name][ 112. : field_definitions[field_name][\"cardinality\"] 113. ] 114. 115. if config[\"update_mode\"] == \"replace\": 116. if config[\"subdelimiter\"] in row[field_name]: 117. field_values = [] 118. subvalues = row[field_name].split(config[\"subdelimiter\"]) 119. subvalues = self.remove_invalid_values( 120. config, field_definitions, field_name, subvalues 121. ) 122. subvalues = self.dedupe_values(subvalues) 123. if len(subvalues) > int( 124. field_definitions[field_name][\"cardinality\"] 125. ): 126. log_field_cardinality_violation( 127. field_name, 128. row[entity_id_field], 129. field_definitions[field_name][\"cardinality\"], 130. ) 131. subvalues = subvalues[ 132. : field_definitions[field_name][\"cardinality\"] 133. ] 134. for subvalue in subvalues: 135. subvalue = truncate_csv_value( 136. field_name, 137. row[entity_id_field], 138. field_definitions[field_name], 139. subvalue, 140. ) 141. if ( 142. \"formatted_text\" in field_definitions[field_name] 143. and field_definitions[field_name][\"formatted_text\"] is True 144. ): 145. field_values.append( 146. {\"value\": subvalue, \"format\": text_format} 147. ) 148. else: 149. field_values.append({\"value\": subvalue}) 150. field_values = self.dedupe_values(field_values) 151. entity[field_name] = field_values 152. else: 153. row[field_name] = truncate_csv_value( 154. field_name, 155. row[entity_id_field], 156. field_definitions[field_name], 157. row[field_name], 158. ) 159. if ( 160. \"formatted_text\" in field_definitions[field_name] 161. and field_definitions[field_name][\"formatted_text\"] is True 162. ): 163. entity[field_name] = [ 164. {\"value\": row[field_name], \"format\": text_format} 165. ] 166. else: 167. entity[field_name] = [{\"value\": row[field_name]}] 168. 169. # Cardinality is unlimited. 170. else: 171. if config[\"update_mode\"] == \"append\": 172. if config[\"subdelimiter\"] in row[field_name]: 173. field_values = [] 174. subvalues = row[field_name].split(config[\"subdelimiter\"]) 175. subvalues = self.remove_invalid_values( 176. config, field_definitions, field_name, subvalues 177. ) 178. for subvalue in subvalues: 179. subvalue = truncate_csv_value( 180. field_name, 181. row[entity_id_field], 182. field_definitions[field_name], 183. subvalue, 184. ) 185. if ( 186. \"formatted_text\" in field_definitions[field_name] 187. and field_definitions[field_name][\"formatted_text\"] is True 188. ): 189. field_values.append( 190. {\"value\": subvalue, \"format\": text_format} 191. ) 192. else: 193. field_values.append({\"value\": subvalue}) 194. entity[field_name] = entity_field_values + field_values 195. entity[field_name] = self.dedupe_values(entity[field_name]) 196. else: 197. row[field_name] = truncate_csv_value( 198. field_name, 199. row[entity_id_field], 200. field_definitions[field_name], 201. row[field_name], 202. ) 203. if ( 204. \"formatted_text\" in field_definitions[field_name] 205. and field_definitions[field_name][\"formatted_text\"] is True 206. ): 207. entity[field_name] = entity_field_values + [ 208. {\"value\": row[field_name], \"format\": text_format} 209. ] 210. else: 211. entity[field_name] = entity_field_values + [ 212. {\"value\": row[field_name]} 213. ] 214. entity[field_name] = self.dedupe_values(entity[field_name]) 215. if config[\"update_mode\"] == \"replace\": 216. if config[\"subdelimiter\"] in row[field_name]: 217. field_values = [] 218. subvalues = row[field_name].split(config[\"subdelimiter\"]) 219. subvalues = self.remove_invalid_values( 220. config, field_definitions, field_name, subvalues 221. ) 222. for subvalue in subvalues: 223. subvalue = truncate_csv_value( 224. field_name, 225. row[entity_id_field], 226. field_definitions[field_name], 227. subvalue, 228. ) 229. if ( 230. \"formatted_text\" in field_definitions[field_name] 231. and field_definitions[field_name][\"formatted_text\"] is True 232. ): 233. field_values.append( 234. {\"value\": subvalue, \"format\": text_format} 235. ) 236. else: 237. field_values.append({\"value\": subvalue}) 238. entity[field_name] = field_values 239. entity[field_name] = self.dedupe_values(entity[field_name]) 240. else: 241. row[field_name] = truncate_csv_value( 242. field_name, 243. row[entity_id_field], 244. field_definitions[field_name], 245. row[field_name], 246. ) 247. if ( 248. \"formatted_text\" in field_definitions[field_name] 249. and field_definitions[field_name][\"formatted_text\"] is True 250. ): 251. entity[field_name] = [ 252. {\"value\": row[field_name], \"format\": text_format} 253. ] 254. else: 255. entity[field_name] = [{\"value\": row[field_name]}] 256. 257. return entity Each field type has its own structure. Within the field classes, the field structure is represented in Python dictionaries and converted to JSON when pushed up to Drupal as part of REST requests. These Python dictionaries are converted to JSON automatically as part of the HTTP request to Drupal (you do not do this within the field classes) so we'll focus only on the Python dictionary structure here. Some examples are: SimpleField fields have the Python structure {\"value\": value} or, for \"formatted\" text, {\"value\": value, \"format\": text_format} GeolocationField fields have the Python structure {\"lat\": lat_value, \"lon\": lon_value} LinkField fields have the python structure {\"uri\": uri, \"title\": title} EntityReferenceField fields have the Python structure {\"target_id\": id, \"target_type\": target_type} TypedRelationField fields have the Python structure {\"target_id\": id, \"rel_type\": rel_type:rel_value, \"target_type\": target_type} the value of the rel_type key is the relator type (e.g. MARC relators) and the relator value (e.g. 'art') joined with a colon, e.g. relators:art AuthorityLinkField fields have the Python structure {\"source\": source, \"uri\": uri, \"title\": title} MediaTrackField fields have the Python structure {\"label\": label, \"kind\": kind, \"srclang\": lang, \"file_path\": path} To add a support for a new field type, you will need to figure out the field type's JSON structure and convert that structure into the Python dictionary equivalent in the new field class methods. The best way to inspect a field type's JSON structure is to view the JSON representation of a node that contains instances of the field by tacking ?_format=json to the end of the node's URL. Once you have an example of the field type's JSON, you will need to write the necessary class methods to convert between Workbench CSV data and the field's JSON structure as applicable in all of the field class methods, making sure you account for the field's configured cardinality, account for the update mode within the update() method, etc. Writing field classes is one aspect of Workbench development that demonstrates the value of unit tests. Without writing unit tests to accompany the development of these field classes, you will lose your mind. tests/field_tests.py contains over 80 tests in more than 5,000 lines of test code for a good reason. Islandora Workbench Integration Drupal module Islandora Workbench Integration is a Drupal module that allows Islandora Workbench to communicate with Drupal efficiently and reliably. It enables some Views and REST endpoints that Workbench expects, and also provides a few custom REST endpoints (see the module's README for details). Generally speaking, the only situation where the Integration module will need to be updated (apart from requirements imposed by new versions of Drupal) is if we add a new feature to Workbench that requires a specific View or a specific REST endpoint to be enabled and configured in the target Drupal. If a change is required in the Integration module, it is very important to communicate this to Workbench users, since if the Integration module is not updated to align with the change in Workbench, the new feature won't work. A defensive coding strategy to ensure that changes in the client-side Workbench code that depend on changes in the server-side Integration module will work is, within the Workbench code, invoke the check_integration_module_version() function to check the Integration module's version number and use conditional logic to execute the new Workbench code only if the Integration module's version number meets or exceeds a version number string defined in that section of the Workbench code (e.g., the new Workbench feature requires version 1.1.3 of the Integration module). Under the hood, this function queries /islandora_workbench_integration/version on the target Drupal to get the Integration module's version number, although as a developer all you need to do is invoke the check_integration_module_version() function and inspect its return value. In this example, your code would compare the Integration module's version number with 1.1.3 (possibly using the convert_semver_to_number() utility function) and include logic so that it only executes if the minimum Integration module version is met. Note Some Views used by Islandora Workbench are defined by users and not by the Islandora Workbench Integration module. Specifically, Views described in the \" Generating CSV files \" documentation are created by users.","title":"Development guide"},{"location":"development_guide/#general","text":"Bug reports, improvements, feature requests, and PRs are welcome. Before you open a pull request, please open an issue so we can discuss your idea . In cases where the PR introduces a potentially disruptive change to Workbench, we usually want to start a discussion about its impact on users in the #islandoraworkbench Slack channel. When you open a PR, you will be asked to complete the Workbench PR template . All code must be formatted using Black . You can automatically style your code using Black in your IDE of choice . Where applicable, unit and integration tests to accompany your code are very appreciated. Tests in Workbench fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . workbench_utils.py provides a lot of utility functions. Before writing a new utility function, please ensure that something similar to what you want to write doesn't already exist.","title":"General"},{"location":"development_guide/#running-tests","text":"While developing code for Islandora Workbench, you should run tests frequently to ensure that your code hasn't introduced any regression errors. Workbench is a fairly large and complex application, and has many configuration settings. Even minor changes in the code can break things. To run unit tests that do not require a live Islandora instance: Unit tests in tests/unit_tests.py (run with python tests/unit_tests.py ) Unit tests for Workbench's Drupal fields handlers in tests/field_tests.py (run with python tests/field_tests.py ) Note that these tests are run automatically as GitHub actions when you push to the Islandora Workbench repo or when a merge request is merged. To run integration tests that require a live Islandora instance running at https://islandora.traefik.me/ tests/islandora_tests.py , tests/islandora_tests_check.py , tests/islandora_tests_hooks.py , and tests/islandora_tests_paged_content.py can be run with python tests/islandora_tests.py , etc. The Islandora Starter Site deployed with ISLE is recommended way to deploy the Islandora used in these tests. Integration tests remove all nodes and media added during the tests, unless a test fails. Taxonomy terms created by tests are not removed. Some integration and field tests output text that beings with \"Error:.\" This is normal, it's the text that Workbench outputs when it finds something wrong (which is probably what the test is testing). Successful test (whether they test for success or failure) runs will exit with \"OK\". If you can figure out how to suppress this output, please visit this issue . If you want to run the tests within a specific class, include the class name as an argument like this: python tests/unit_tests.py TestCompareStings You can also specify multiple test classes within a single test file: python tests/islandora_tests.py TestMyNewTest TestMyOtherNewTest","title":"Running tests"},{"location":"development_guide/#writing-tests","text":"Islandora Workbench's tests are written using the Python built-in module unittest , and as explained above, fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . unittest groups tests into classes. A single test file can contain one or more test classes. Within each test class, you can put one or more test methods. As shown in the second example below, two reserved methods, setUp() and tearDown() , are reserved for setup and cleanup tasks, respectively, within each class. If you are new to using unittest , this is a good tutorial.","title":"Writing tests"},{"location":"development_guide/#a-simple-unit-test","text":"Islandora Workbench unit tests are much like unit tests in any Python application. The sample test below, from tests/unit_tests.py , tests the validate_latlong_value() method from the workbench_utils.py module. Since workbench_utils.validate_latlong_value() doesn't interact with Islandora, https://islandora.traefik.me/ doesn't need to be available to run this unit test. class TestValidateLatlongValue(unittest.TestCase): def test_validate_good_latlong_values(self): values = ['+90.0, -127.554334', '90.0, -127.554334', '-90,-180', '+50.25,-117.8', '+48.43333,-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertTrue(res) def test_validate_bad_latlong_values(self): values = ['+90.1 -100.111', '045, 180', '+5025,-117.8', '-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertFalse(res) This is a fairly standard Python unit test - we define a list of valid lat/long pairs and run them through the workbench_utils.validate_latlong_value() method expecting it to return True for each value, and then we define a list of bad lat/long pairs and run them through the method expecting it to return False for each value.","title":"A simple unit test"},{"location":"development_guide/#a-simple-integration-test","text":"The two sample integration tests provided below are copied from islandora_tests.py . The first sample integration test, TestCreate , looks like this (with line numbers added for easy reference). Configuration and CSV files used by this test are in tests/assets/create_test : 1. class TestCreate(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. 8. def test_create(self): 9. self.nids = list() 10. create_output = subprocess.check_output(self.create_cmd) 11. create_output = create_output.decode().strip() 12. create_lines = create_output.splitlines() 13. for line in create_lines: 14. if 'created at' in line: 15. nid = line.rsplit('/', 1)[-1] 16. nid = nid.strip('.') 17. self.nids.append(nid) 18. 19. self.assertEqual(len(self.nids), 2) 20. 21. def tearDown(self): 22. for nid in self.nids: 23. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 24. quick_delete_output = subprocess.check_output(quick_delete_cmd) 25. 26. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'rollback.csv') 27. if os.path.exists(self.rollback_file_path): 28. os.remove(self.rollback_file_path) 29. 30. self.preprocessed_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'metadata.csv.preprocessed') 31. if os.path.exists(self.preprocessed_file_path): 32. os.remove(self.preprocessed_file_path) As you can see, this test runs Workbench using the config file create.yml (line 10), which lives at tests/assets/create_test/create.yml , relative to the workbench directory. A tricky aspect of using real config files in tests is that all paths mentioned in the config file must be relative to the workbench directory. This create.yml defines the input_dir setting to be tests/assets/create_test : task: create host: https://islandora.traefik.me username: admin password: password input_dir: \"tests/assets/create_test\" media_type: image allow_missing_files: True The test's setUp() method prepares the file paths, etc. and within the test's only test method, test_create() , runs Workbench using Python's subprocess.check_output() method, grabs the node IDs from the output from the \"created at\" strings emitted by Workbench (lines 14-17), adds them to a list, and then counts the number of members in that list. If the number of nodes created matches the expected number, the test passes. Since this test creates some nodes, we use the test class's tearDown() method to put the target Drupal back into as close a state as we started with as possible. tearDown() basically takes the list of node IDs created in test_create() and runs Workbench with the --quick_delete_node option. It then removes any temporary files created during the test.","title":"A simple integration test"},{"location":"development_guide/#a-more-complex-integration-test","text":"Since Workbench is essentially a specialized REST client, writing integration tests that require interaction with Drupal can get a bit complex. But, the overall pattern is: Create some entities (nodes, media, taxonomy terms). Confirm that they were created in the expected way (doing this usually involves keeping track of any node IDs needed to run tests or to clean up, and in some cases involves parsing out values from raw JSON returned by Drupal). Clean up by deleting any Drupal entities created during the tests and also any temporary local files. An integration test that checks data in the node JSON is TestUpdateWithMaxNodeTitleLength() . Files that accompany this test are in tests/assets/max_node_title_length_test . Here is a copy of the test's code: 1. class TestUpdateWithMaxNodeTitleLength(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. self.nids = list() 8. 9. self.update_csv_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update_max_node_title_length.csv') 10. self.update_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update.yml') 11. self.update_cmd = [\"./workbench\", \"--config\", self.update_config_file_path] 12. 13. self.temp_dir = tempfile.gettempdir() 14. 15. def test_create(self): 16. create_output = subprocess.check_output(self.create_cmd) 17. self.create_output = create_output.decode().strip() 18. 19. create_lines = self.create_output.splitlines() 20. for line in create_lines: 21. if 'created at' in line: 22. nid = line.rsplit('/', 1)[-1] 23. nid = nid.strip('.') 24. self.nids.append(nid) 25. 26. self.assertEqual(len(self.nids), 6) 27. 28. # Write out an update CSV file using the node IDs in self.nids. 29. update_csv_file_rows = list() 30. test_titles = ['This title is 37 chars___________long', 31. 'This title is 39 chars_____________long', 32. 'This title is 29 _ chars long', 33. 'This title is 42 chars________________long', 34. 'This title is 44 chars__________________long', 35. 'This title is 28 chars long.'] 36. update_csv_file_rows.append('node_id,title') 37. i = 0 38. while i <= 5: 39. update_csv_file_rows.append(f'{self.nids[i]},{test_titles[i]}') 40. i = i + 1 41. with open(self.update_csv_file_path, mode='wt') as update_csv_file: 42. update_csv_file.write('\\n'.join(update_csv_file_rows)) 43. 44. # Run the update command. 45. check_output = subprocess.check_output(self.update_cmd) 46. 47. # Fetch each node in self.nids and check to see if its title is <= 30 chars long. All should be. 48. for nid_to_update in self.nids: 49. node_url = 'https://islandora.traefik.me/node/' + str(self.nids[0]) + '?_format=json' 50. node_response = requests.get(node_url) 51. node = json.loads(node_response.text) 52. updated_title = str(node['title'][0]['value']) 53. self.assertLessEqual(len(updated_title), 30, '') 54. 55. def tearDown(self): 56. for nid in self.nids: 57. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 58. quick_delete_output = subprocess.check_output(quick_delete_cmd) 59. 60. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'rollback.csv') 61. if os.path.exists(self.rollback_file_path): 62. os.remove(self.rollback_file_path) 63. 64. self.preprocessed_file_path = os.path.join(self.temp_dir, 'create_max_node_title_length.csv.preprocessed') 65. if os.path.exists(self.preprocessed_file_path): 66. os.remove(self.preprocessed_file_path) 67. 68. # Update test: 1) delete the update CSV file, 2) delete the update .preprocessed file. 69. if os.path.exists(self.update_csv_file_path): 70. os.remove(self.update_csv_file_path) 71. 72. self.preprocessed_update_file_path = os.path.join(self.temp_dir, 'update_max_node_title_length.csv.preprocessed') 73. if os.path.exists(self.preprocessed_update_file_path): 74. os.remove(self.preprocessed_update_file_path) This test: (line 16) creates some nodes that will be updated within the same test class (i.e. in line 45) (lines 28-42) writes out a temporary CSV file which will be used as the input_csv file in a subsequent update task containing the new node IDs plus some titles that are longer than max_node_title_length: 30 setting in the assets/max_node_title_length_test/update.yml file (line 45) runs self.update_cmd to execute the update task (lines 47-53) fetches the title values for each of the updated nodes and tests the length of each title string to confirm that it does not exceed the maximum allowed length of 30 characters. tearDown() removes all nodes created by the test and removes all temporary local files.","title":"A more complex integration test"},{"location":"development_guide/#adding-a-new-drupal-field-type","text":"","title":"Adding a new Drupal field type"},{"location":"development_guide/#overview-of-how-workbench-handles-field-types","text":"Workbench and Drupal exchange field data represented as JSON, via Drupal's REST API. The specific JSON structure depends on the field type (text, entity reference, geolocation, etc.). Handling the details of these structures when adding new field data during create tasks, updating existing field data during update tasks, etc. is performed by code in the workbench_fields.py module. Each distinct field structure has its own class in that file, and each of the classes has the methods create() , update() , dedupe_values() , remove_invalid_values() , and serialize() . The create() and update() methods convert CSV field data in Workbench input files to Python dictionaries, which are subsequently converted into JSON for pushing up to Drupal. The serialize() method reverses this conversion, taking the JSON field data fetched from Drupal and converting it into a dictionary, and from that, into CSV data. dedupe_values() and remove_invalid_values() are utility methods that do what their names suggest. Currently, Workbench supports the following field types: \"simple\" fields for strings (for string or text fields) integers binary values 1 and 0 existing Drupal-generated entity IDs entity reference fields entity reference revision fields typed relation fields link fields geolocation fields authority link fields media track fields Eventually, classes for new Drupal field types will need to be added to Workbench as the community adopts more field types provided by Drupal contrib modules or creates new field types specific to Islandora. Note If new field types are added to workbench_utils.py, corresponding logic must be added to functions in other Workbench modules (e.g. workbench_utils, workbench) that create, update, or export Drupal entities. Those places are commented in the code with either: \"Assemble Drupal field structures from CSV data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" or \"Assemble CSV output Drupal field data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\"","title":"Overview of how Workbench handles field types"},{"location":"development_guide/#field-class-methods","text":"The most complex aspect of handling field data is cardinality, or in other words, whether a given field's configuration setting \"Allowed number of values\" allows for a single value, a maximum number of values (for example, a maximum of 3 values), or an unlimited number of values. Each field type's cardinality configuration complicates creating and updating field instances because Workbench must deal with situations where input CSV data for a field contains more values than are allowed, or when the user wants to append a value to an existing field instance rather than replace existing values, potentially exceeding the field's configured cardinality. Drupal's REST interface is very strict about cardinality, and if Workbench tries to push up a field's JSON that violates the field's cardinality, the HTTP request fails and returns a 422 Unprocessable Content response for the node containing the malformed field data. To prevent this from happening, code within field type classes needs to contain logic to account for the three different types of cardinality (cardinality of 1, cardinality of more than 1 but with a maximum, and unlimited) and for the specific JSON structure created from the field data in the input CSV. When Workbench detects that the number of field instances in the CSV data surpasses the field's maximum configured cardinality, it will truncate the incoming data and log that it did so via the log_field_cardinality_violation() utility function. However, truncating data is a fallback behavior; --check will report intances of this cardinality violation, providing you with an opportunity to correct the CSV data before committing it to Drupal. To illustrate this complexity, let's look at the update() method within the SimpleField class, which handles field types that have the Python structure [{\"value\": value}] or, for \"formatted\" text, [{\"value\": value, \"format\": text_format}] . Note Note that the example structure in the preceding paragraph shows a single value for that field. It's a list, but a list containing a single dictionary. If there were two values in a field, the structure would be a list containing two dictionaries, like [{\"value\": value}, {\"value\": value}] . If the field contained three values, the structure would be [{\"value\": value}, {\"value\": value}, {\"value\": value}] Lines 47-167 in the sample update() method apply when the field is configured to have a limited cardinality, either 1 or a specific number higher than 1. Within that range of lines, 49-113 apply if the update_mode configuration setting is \"append\", and lines 115-167 apply if the update_mode setting is \"replace\". Lines 169-255 apply when the field's cardinality is unlimited. Within that range of lines, 171-214 apply if the update_mode is \"append\", and lines 215-255 apply if it is \"replace\". An update_mode setting of \"delete\" simply removes all values from the field, in lines 28-30. 1. def update( 2. self, config, field_definitions, entity, row, field_name, entity_field_values 3. ): 4. \"\"\"Note: this method appends incoming CSV values to existing values, replaces existing field 5. values with incoming values, or deletes all values from fields, depending on whether 6. config['update_mode'] is 'append', 'replace', or 'delete'. It does not replace individual 7. values within fields. 8. \"\"\" 9. \"\"\"Parameters 10. ---------- 11. config : dict 12. The configuration settings defined by workbench_config.get_config(). 13. field_definitions : dict 14. The field definitions object defined by get_field_definitions(). 15. entity : dict 16. The dict that will be POSTed to Drupal as JSON. 17. row : OrderedDict. 18. The current CSV record. 19. field_name : string 20. The Drupal fieldname/CSV column header. 21. entity_field_values : list 22. List of dictionaries containing existing value(s) for field_name in the entity being updated. 23. Returns 24. ------- 25. dictionary 26. A dictionary representing the entity that is PATCHed to Drupal as JSON. 27. \"\"\" 28. if config[\"update_mode\"] == \"delete\": 29. entity[field_name] = [] 30. return entity 31. 32. if row[field_name] is None: 33. return entity 34. 35. if field_name in config[\"field_text_format_ids\"]: 36. text_format = config[\"field_text_format_ids\"][field_name] 37. else: 38. text_format = config[\"text_format_id\"] 39. 40. if config[\"task\"] == \"update_terms\": 41. entity_id_field = \"term_id\" 42. if config[\"task\"] == \"update\": 43. entity_id_field = \"node_id\" 44. if config[\"task\"] == \"update_media\": 45. entity_id_field = \"media_id\" 46. 47. # Cardinality has a limit. 48. if field_definitions[field_name][\"cardinality\"] > 0: 49. if config[\"update_mode\"] == \"append\": 50. if config[\"subdelimiter\"] in row[field_name]: 51. subvalues = row[field_name].split(config[\"subdelimiter\"]) 52. subvalues = self.remove_invalid_values( 53. config, field_definitions, field_name, subvalues 54. ) 55. for subvalue in subvalues: 56. subvalue = truncate_csv_value( 57. field_name, 58. row[entity_id_field], 59. field_definitions[field_name], 60. subvalue, 61. ) 62. if ( 63. \"formatted_text\" in field_definitions[field_name] 64. and field_definitions[field_name][\"formatted_text\"] is True 65. ): 66. entity[field_name].append( 67. {\"value\": subvalue, \"format\": text_format} 68. ) 69. else: 70. entity[field_name].append({\"value\": subvalue}) 71. entity[field_name] = self.dedupe_values(entity[field_name]) 72. if len(entity[field_name]) > int( 73. field_definitions[field_name][\"cardinality\"] 74. ): 75. log_field_cardinality_violation( 76. field_name, 77. row[entity_id_field], 78. field_definitions[field_name][\"cardinality\"], 79. ) 80. entity[field_name] = entity[field_name][ 81. : field_definitions[field_name][\"cardinality\"] 82. ] 83. else: 84. row[field_name] = self.remove_invalid_values( 85. config, field_definitions, field_name, row[field_name] 86. ) 87. row[field_name] = truncate_csv_value( 88. field_name, 89. row[entity_id_field], 90. field_definitions[field_name], 91. row[field_name], 92. ) 93. if ( 94. \"formatted_text\" in field_definitions[field_name] 95. and field_definitions[field_name][\"formatted_text\"] is True 96. ): 97. entity[field_name].append( 98. {\"value\": row[field_name], \"format\": text_format} 99. ) 100. else: 101. entity[field_name].append({\"value\": row[field_name]}) 102. entity[field_name] = self.dedupe_values(entity[field_name]) 103. if len(entity[field_name]) > int( 104. field_definitions[field_name][\"cardinality\"] 105. ): 106. log_field_cardinality_violation( 107. field_name, 108. row[entity_id_field], 109. field_definitions[field_name][\"cardinality\"], 110. ) 111. entity[field_name] = entity[field_name][ 112. : field_definitions[field_name][\"cardinality\"] 113. ] 114. 115. if config[\"update_mode\"] == \"replace\": 116. if config[\"subdelimiter\"] in row[field_name]: 117. field_values = [] 118. subvalues = row[field_name].split(config[\"subdelimiter\"]) 119. subvalues = self.remove_invalid_values( 120. config, field_definitions, field_name, subvalues 121. ) 122. subvalues = self.dedupe_values(subvalues) 123. if len(subvalues) > int( 124. field_definitions[field_name][\"cardinality\"] 125. ): 126. log_field_cardinality_violation( 127. field_name, 128. row[entity_id_field], 129. field_definitions[field_name][\"cardinality\"], 130. ) 131. subvalues = subvalues[ 132. : field_definitions[field_name][\"cardinality\"] 133. ] 134. for subvalue in subvalues: 135. subvalue = truncate_csv_value( 136. field_name, 137. row[entity_id_field], 138. field_definitions[field_name], 139. subvalue, 140. ) 141. if ( 142. \"formatted_text\" in field_definitions[field_name] 143. and field_definitions[field_name][\"formatted_text\"] is True 144. ): 145. field_values.append( 146. {\"value\": subvalue, \"format\": text_format} 147. ) 148. else: 149. field_values.append({\"value\": subvalue}) 150. field_values = self.dedupe_values(field_values) 151. entity[field_name] = field_values 152. else: 153. row[field_name] = truncate_csv_value( 154. field_name, 155. row[entity_id_field], 156. field_definitions[field_name], 157. row[field_name], 158. ) 159. if ( 160. \"formatted_text\" in field_definitions[field_name] 161. and field_definitions[field_name][\"formatted_text\"] is True 162. ): 163. entity[field_name] = [ 164. {\"value\": row[field_name], \"format\": text_format} 165. ] 166. else: 167. entity[field_name] = [{\"value\": row[field_name]}] 168. 169. # Cardinality is unlimited. 170. else: 171. if config[\"update_mode\"] == \"append\": 172. if config[\"subdelimiter\"] in row[field_name]: 173. field_values = [] 174. subvalues = row[field_name].split(config[\"subdelimiter\"]) 175. subvalues = self.remove_invalid_values( 176. config, field_definitions, field_name, subvalues 177. ) 178. for subvalue in subvalues: 179. subvalue = truncate_csv_value( 180. field_name, 181. row[entity_id_field], 182. field_definitions[field_name], 183. subvalue, 184. ) 185. if ( 186. \"formatted_text\" in field_definitions[field_name] 187. and field_definitions[field_name][\"formatted_text\"] is True 188. ): 189. field_values.append( 190. {\"value\": subvalue, \"format\": text_format} 191. ) 192. else: 193. field_values.append({\"value\": subvalue}) 194. entity[field_name] = entity_field_values + field_values 195. entity[field_name] = self.dedupe_values(entity[field_name]) 196. else: 197. row[field_name] = truncate_csv_value( 198. field_name, 199. row[entity_id_field], 200. field_definitions[field_name], 201. row[field_name], 202. ) 203. if ( 204. \"formatted_text\" in field_definitions[field_name] 205. and field_definitions[field_name][\"formatted_text\"] is True 206. ): 207. entity[field_name] = entity_field_values + [ 208. {\"value\": row[field_name], \"format\": text_format} 209. ] 210. else: 211. entity[field_name] = entity_field_values + [ 212. {\"value\": row[field_name]} 213. ] 214. entity[field_name] = self.dedupe_values(entity[field_name]) 215. if config[\"update_mode\"] == \"replace\": 216. if config[\"subdelimiter\"] in row[field_name]: 217. field_values = [] 218. subvalues = row[field_name].split(config[\"subdelimiter\"]) 219. subvalues = self.remove_invalid_values( 220. config, field_definitions, field_name, subvalues 221. ) 222. for subvalue in subvalues: 223. subvalue = truncate_csv_value( 224. field_name, 225. row[entity_id_field], 226. field_definitions[field_name], 227. subvalue, 228. ) 229. if ( 230. \"formatted_text\" in field_definitions[field_name] 231. and field_definitions[field_name][\"formatted_text\"] is True 232. ): 233. field_values.append( 234. {\"value\": subvalue, \"format\": text_format} 235. ) 236. else: 237. field_values.append({\"value\": subvalue}) 238. entity[field_name] = field_values 239. entity[field_name] = self.dedupe_values(entity[field_name]) 240. else: 241. row[field_name] = truncate_csv_value( 242. field_name, 243. row[entity_id_field], 244. field_definitions[field_name], 245. row[field_name], 246. ) 247. if ( 248. \"formatted_text\" in field_definitions[field_name] 249. and field_definitions[field_name][\"formatted_text\"] is True 250. ): 251. entity[field_name] = [ 252. {\"value\": row[field_name], \"format\": text_format} 253. ] 254. else: 255. entity[field_name] = [{\"value\": row[field_name]}] 256. 257. return entity Each field type has its own structure. Within the field classes, the field structure is represented in Python dictionaries and converted to JSON when pushed up to Drupal as part of REST requests. These Python dictionaries are converted to JSON automatically as part of the HTTP request to Drupal (you do not do this within the field classes) so we'll focus only on the Python dictionary structure here. Some examples are: SimpleField fields have the Python structure {\"value\": value} or, for \"formatted\" text, {\"value\": value, \"format\": text_format} GeolocationField fields have the Python structure {\"lat\": lat_value, \"lon\": lon_value} LinkField fields have the python structure {\"uri\": uri, \"title\": title} EntityReferenceField fields have the Python structure {\"target_id\": id, \"target_type\": target_type} TypedRelationField fields have the Python structure {\"target_id\": id, \"rel_type\": rel_type:rel_value, \"target_type\": target_type} the value of the rel_type key is the relator type (e.g. MARC relators) and the relator value (e.g. 'art') joined with a colon, e.g. relators:art AuthorityLinkField fields have the Python structure {\"source\": source, \"uri\": uri, \"title\": title} MediaTrackField fields have the Python structure {\"label\": label, \"kind\": kind, \"srclang\": lang, \"file_path\": path} To add a support for a new field type, you will need to figure out the field type's JSON structure and convert that structure into the Python dictionary equivalent in the new field class methods. The best way to inspect a field type's JSON structure is to view the JSON representation of a node that contains instances of the field by tacking ?_format=json to the end of the node's URL. Once you have an example of the field type's JSON, you will need to write the necessary class methods to convert between Workbench CSV data and the field's JSON structure as applicable in all of the field class methods, making sure you account for the field's configured cardinality, account for the update mode within the update() method, etc. Writing field classes is one aspect of Workbench development that demonstrates the value of unit tests. Without writing unit tests to accompany the development of these field classes, you will lose your mind. tests/field_tests.py contains over 80 tests in more than 5,000 lines of test code for a good reason.","title":"Field class methods"},{"location":"development_guide/#islandora-workbench-integration-drupal-module","text":"Islandora Workbench Integration is a Drupal module that allows Islandora Workbench to communicate with Drupal efficiently and reliably. It enables some Views and REST endpoints that Workbench expects, and also provides a few custom REST endpoints (see the module's README for details). Generally speaking, the only situation where the Integration module will need to be updated (apart from requirements imposed by new versions of Drupal) is if we add a new feature to Workbench that requires a specific View or a specific REST endpoint to be enabled and configured in the target Drupal. If a change is required in the Integration module, it is very important to communicate this to Workbench users, since if the Integration module is not updated to align with the change in Workbench, the new feature won't work. A defensive coding strategy to ensure that changes in the client-side Workbench code that depend on changes in the server-side Integration module will work is, within the Workbench code, invoke the check_integration_module_version() function to check the Integration module's version number and use conditional logic to execute the new Workbench code only if the Integration module's version number meets or exceeds a version number string defined in that section of the Workbench code (e.g., the new Workbench feature requires version 1.1.3 of the Integration module). Under the hood, this function queries /islandora_workbench_integration/version on the target Drupal to get the Integration module's version number, although as a developer all you need to do is invoke the check_integration_module_version() function and inspect its return value. In this example, your code would compare the Integration module's version number with 1.1.3 (possibly using the convert_semver_to_number() utility function) and include logic so that it only executes if the minimum Integration module version is met. Note Some Views used by Islandora Workbench are defined by users and not by the Islandora Workbench Integration module. Specifically, Views described in the \" Generating CSV files \" documentation are created by users.","title":"Islandora Workbench Integration Drupal module"},{"location":"drupal_and_workbench/","text":"This page highlights the most important Drupal and Islandora features relevant to the use of Workbench. Its audience is managers of Islandora repositories who want a primer on how Drupal, Islandora, and Workbench relate to each other. The Workbench-specific ideas introduced here are documented in detail elsewhere on this site. This page is not intended to be a replacement for the official Islandora documentation , which provides comprehensive and detailed information about how Islandora is structured, and about how to install, configure, and use it. Help improve this page! Your feedback on the usefulness of this page is very important! Join the #islandoraworkbench channel in the Islandora Slack , or leave a comment on this Github issue . Why would I want to use Islandora Workbench? Islandora Workbench lets you manage content in an Islandora repository at scale. Islandora provides web forms for creating and editing content on an item-by-item basis, but if you want to load a large number of items into an Islandora repository (or update or delete content in large numbers), you need a batch-oriented tool like Workbench. Simply put, Islandora Workbench enables you to get batches of content into an Islandora repository, and also update or delete content in large batches. How do I use Islandora Workbench? Islandora Workbench provides the ability to perform a set of \" tasks \". The focus of this page is the create task, but other tasks Workbench enables include update , delete , and add_media . To use Islandora Workbench to create new content, you need to assemble a CSV file containing metadata describing your content, and arrange the accompanying image, video, PDF, and other files in specific ways so that Workbench knows where to find them. Here is a very simple sample Workbench CSV file: file,id,title,field_model,field_description,date_generated,quality control by IMG_1410.tif,01,Small boats in Havana Harbour,Image,Taken on vacation in Cuba.,2021-02-12,MJ IMG_2549.jp2,02,Manhatten Island,Image,\"Taken from the ferry from downtown New York to Highlands, NJ.\",2021-02-12,MJ IMG_2940.JPG,03,Looking across Burrard Inlet,Image,View from Deep Cove to Burnaby Mountain.,2021-02-18,SP IMG_2958.JPG,04,Amsterdam waterfront,Image,Amsterdam waterfront on an overcast day.,2021-02-12,MJ IMG_5083.JPG,05,Alcatraz Island,Image,\"Taken from Fisherman's Wharf, San Francisco.\",2021-02-18,SP Then, you need to create a configuration file to tell Workbench the URL of your Islandora, which Drupal account credentials to use, and the location of your CSV file. You can customize many other aspects of Islandora Workbench by including various settings in your configuration file. Here is a very simple configuration file: task: create host: http://localhost:8000 username: admin password: islandora input_csv: input.csv content_type: islandora_object Relevance to using Workbench It is very important to run --check before you commit to having Workbench add content to your Drupal. Doing so lets Workbench find issues with your configuration and input CSV and files. See the \" Checking configuration and input data \" documentation for a complete list of the things --check looks for. When you have all these things ready, you tell Workbench to \"check\" your input data and configuration: ./workbench --config test.yml --check Workbench will provide a summary of what passed the check and what needs to be fixed. When your checks are complete, you use Workbench to push your content into your Islandora repository: ./workbench --config test.yml You can use a raw CSV file, a Google Sheet , or an Excel file as input, and your image, PDF, video, and other files can be stored locally , or at URLs on the web. Content types, fields, and nodes Below are the Drupal and Islandora concepts that will help you use Workbench effectively. Content types Relevance to using Workbench Generally speaking, Islandora Workbench can only work with a single content type at a time. You define this content type in the content_type configuration setting. Drupal categorizes what people see as \"pages\" on a Drupal website into content types. By default, Drupal provides \"Article\" and \"Basic Page\" content types, but site administrators can create custom content types. You can see the content types configured on your Drupal by logging in as an admin user and visiting /admin/structure/types . Or, you can navigate to the list of your site's content types by clicking on the Structure menu item, then the Content Types entry: Islandora, by default, creates a content type called a \"Repository Item\". But, many Islandora sites use additional content types, such as \"Collection\". To find the machine name of the content type you want to use with Workbench, visit the content type's configuration page. The machine name will be the last segment of the URL. In the following example, it's islandora_object : Fields Relevance to using Workbench The columns in your CSV file correspond to fields in your Islandora content type. The main structural difference between content types in Drupal is that each content type is configured to use a unique set of fields. A field in Drupal the same as a \"field\" in metadata - it is a container for an individual piece of data. For example, all content types have a \"title\" field (although it might be labeled differently) to hold the page's title. Islandora's default content type, the Repository Item, uses metadata-oriented fields like \"Copyright date\", \"Extent\", \"Resource type\", and \"Subject\". Fields have two properties which you need to be familiar with in order to use Islandora Workbench: machine name type To help explain how these two properties work, we will use the following screenshot showing a few of the default fields in the \"Repository item\" content type: Relevance to using Workbench In most cases you can use a fields' human-readable labels as column headers in your CSV file, but within Islandora Workbench configuration files, you must use field machine names. A field has a human-readable label, such as \"Copyright date\", but that label can change or can be translated, and, more significantly, doesn't need to be unique within a Drupal website. Drupal assigns each field a machine name that is more reliable for software to use than human-readable labels. These field machine names are all lower case, use underscores instead of spaces, and are guaranteed by Drupal to be unique within a content type. In the screenshot above, you can see the machine names in the middle column (you might need to zoom in!). For example, the machine name for the \"Copyright date\" field is field_copyright_date . A field's \"type\" determines the structure of the data it can hold. Some common field types used in Islandora are \"Text\" (and its derivatives \"Text (plain)\" and \"Text (plain, long)\"), \"Entity Reference\", \"Typed Relation\", \"EDTF\", and \"Link\". These field types are explained in the \" Field Data (CSV and Drupal) \" documentation, but the important point here is that they are all represented differently in your Workbench CSV. For example: EDTF fields take dates in the Library of Congress' Extended Date/Time Format (an example CSV entry is 1964/2008 ) Entity reference fields are used for taxonomy terms (an example entry is cats:Tabby , where \"cats\" is the name of the taxonomy and \"Tabby\" is the term) Typed relation fields are used for taxonomy entries that contain additional data indicating what \"type\" they are, such as using MARC relators to indicate the relationship of the taxonomy term to the item being described. An example typed relation CSV value is relators:aut:Jordan, Mark , where \"relators:aut\" indicates the term \"Jordan, Mark\" uses the MARC relator \"aut\", which stands for \"author\". Link fields take two pieces of information, a URL and the link text, like http://acme.com%%Acme Products Inc. Relevance to using Workbench Drupal fields can be configured to have multiple values. Another important aspect of Drupal fields is their cardinality, or in other words, how many individual values they are configured to have. This is similar to the \"repeatability\" of fields in metadata schemas. Some fields are configured to hold only a single value, others to hold a a maximum number of values (three, for example), and others can hold an unlimited number of values. You can find each field's cardinality in its \"Field settings\" tab. Here is an example showing a field with unlimited cardinality: Drupal enforces cardinality very strictly. For this reason, if your CSV file contains more values for a field than the field's configuration allows, Workbench will truncate the number of values to match the maximum number allowed for the field. If it does this, it will leave an entry in its log so you know that it didn't add all the values in your CSV data. See the Islandora documentation for additional information about Drupal fields. Nodes Relevance to using Workbench In Islandora, a node is a metadata description - a grouping of data, contained in fields, that describe an item. Each row in your input CSV contains the field data that is used to create a node. Think of a \"node\" as a specific page in a Drupal website. Every node has a content type (e.g. \"Article\" or \"Repository Item\") containing content in the fields defined by its content type. It has a URL in the Drupal website, like https://mysite.org/node/3798 . The \"3798\" at the end of the URL is the node ID (also known as the \"nid\") and uniquely identifies the node within its website. In Islandora, a node is less like a \"web page\" and more like a \"catalogue record\" since Islandora-specific content types generally contain a lot of metadata-oriented fields rather than long discursive text like a blog would have. In create tasks, each row in your input CSV will create a single node. Islandora Workbench uses the node ID column in your CSV for some operations, for example updating nodes or adding media to nodes. Content in Islandora can be hierarchical. For example, collections contain items, newspapers contain issues which in turn contain pages, and compound items can contain a top-level \"parent\" node and many \"child\" nodes. Islandora defines a specific field, field_member_of (or in English, \"Member Of\") that contains the node ID of another node's parent. If this field is empty in a node, it has no parent; if this field contains a value, the node with that node ID is the first node's parent. Islandora Workbench provides several ways for you to create hierarchical content. If you want to learn more about how Drupal nodes work, consult the Islandora documentation . Taxonomies Relevance to using Workbench Drupal's taxonomy system lets you create local authority lists for names, subjects, genre terms, and other types of data. One of Drupal's most powerful features is its support for structured taxonomies (sometimes referred to as \"vocabularies\"). These can be used to maintain local authority lists of personal and corporate names, subjects, and other concepts, just like in other library/archives/museum tools. Islandora Workbench lets you create taxonomy terms in advance of the nodes they are attached to, or at the same time as the nodes. Also, within your CSV file, you can use term IDs, term URIs, or term names. You can use term names both when you are creating new terms on the fly, or if you are assigning existing terms to newly created nodes. Drupal assigns each term an ID, much like it assigns each node an ID. These are called \"term IDs\" (or \"tids\"). Like node IDs, they are unique within a single Drupal instance but they are not unique across Drupal instances. Islandora uses several specific taxonomies extensively as part of its data model. These include Islandora Models (which determines how derivatives are generated for example) and Islandora Media Use (which indicates if a file is an \"Original file\" or a \"Service file\", for example). The taxonomies created by Islandora, such as Islandora Models and Islandora Media Use, include Linked Data URIs in the taxonomy term entries. These URIs are useful because they uniquely and reliably identify taxonomy terms across Drupal instances. For example, the taxonomy term with the Linked Data URI http://pcdm.org/use#OriginalFile is the same in two Drupal instances even if the term ID for the term is 589 in one instance and 23 in the other, or if the name of the term is in different languages. If you create your own taxonomies, you can also assign each term a Linked Data URI. Media Relevance to using Workbench By default, the file you upload using Islandora Workbench will be assigned the \"Original file\" media use term. Islandora will then automatically generate derivatives, such as thumbnails and extracted text where applicable, from that file and create additional media. However, you can use Workbench to upload additional files or pregenerated derivatives by assigning them other media use terms. Media in Islandora are the image, video, audio, PDF, and other content files that are attached to nodes. Together, a node and its attached media make up a resource or item. Media have types. Standard media types defined by Islandora are: Audio Document Extracted text FITS Technical Metadata File Image Remote video Video In general when using Workbench you don't need to worry about assigning a type to a file. Workbench infers a media's type from the file extensions, but you can override this if necessary. Media are also assigned terms from the Islandora Media Use vocabulary. These terms, combined with the media type, determine how the files are displayed to end users and how and what types of derivatives Islandora generates. They can also be useful in exporting content from Islandora and in digital preservation workflows (for example). A selection of terms from this vocabulary is: Original file Intermediate file Preservation Master File Service file Thumbnail image Transcript Extracted text This is an example of a video media showing how the media use terms are applied: The Islandora documentation provides additional information on media . Views Relevance to using Workbench You usually don't need to know anything about Views when using Islandora Workbench, but you can use Workbench to export CSV data from Drupal via a View. Views are another powerful Drupal feature that Islandora uses extensively. A View is a Drupal configuration that generates a list of things managed by Drupal, most commonly nodes. As a Workbench user, you will probably only use a View if you want to export data from Islandora via a get_data_from_view Workbench task. Behind the scenes, Workbench depends on a Drupal module called Islandora Workbench Integration that creates a number of custom Views that Workbench uses to interact with Drupal. So even though you might only use Views directly when exporting CSV data from Islandora, behind the scenes Workbench is getting information from Drupal constantly using a set of custom Views. REST Relevance to using Workbench As a Workbench user, you don't need to know anything about REST, but if you encounter a problem using Workbench and reach out for help, you might be asked to provide your log file, which will likely contain some raw REST data. REST is the protocol that Workbench uses to interact with Drupal. Fear not: as a user of Workbench, you don't need to know anything about REST - it's Workbench's job to shield you from REST's complexity. However, if things go wrong, Workbench will include in its log file some details about the particular REST request that didn't work (such as HTTP response codes and raw JSON). If you reach out for help , you might be asked to provide your Workbench log file to aid in troubleshooting. It's normal to see the raw data used in REST communication between Workbench and Drupal in the log.","title":"Workbench's relationship to Drupal and Islandora"},{"location":"drupal_and_workbench/#why-would-i-want-to-use-islandora-workbench","text":"Islandora Workbench lets you manage content in an Islandora repository at scale. Islandora provides web forms for creating and editing content on an item-by-item basis, but if you want to load a large number of items into an Islandora repository (or update or delete content in large numbers), you need a batch-oriented tool like Workbench. Simply put, Islandora Workbench enables you to get batches of content into an Islandora repository, and also update or delete content in large batches.","title":"Why would I want to use Islandora Workbench?"},{"location":"drupal_and_workbench/#how-do-i-use-islandora-workbench","text":"Islandora Workbench provides the ability to perform a set of \" tasks \". The focus of this page is the create task, but other tasks Workbench enables include update , delete , and add_media . To use Islandora Workbench to create new content, you need to assemble a CSV file containing metadata describing your content, and arrange the accompanying image, video, PDF, and other files in specific ways so that Workbench knows where to find them. Here is a very simple sample Workbench CSV file: file,id,title,field_model,field_description,date_generated,quality control by IMG_1410.tif,01,Small boats in Havana Harbour,Image,Taken on vacation in Cuba.,2021-02-12,MJ IMG_2549.jp2,02,Manhatten Island,Image,\"Taken from the ferry from downtown New York to Highlands, NJ.\",2021-02-12,MJ IMG_2940.JPG,03,Looking across Burrard Inlet,Image,View from Deep Cove to Burnaby Mountain.,2021-02-18,SP IMG_2958.JPG,04,Amsterdam waterfront,Image,Amsterdam waterfront on an overcast day.,2021-02-12,MJ IMG_5083.JPG,05,Alcatraz Island,Image,\"Taken from Fisherman's Wharf, San Francisco.\",2021-02-18,SP Then, you need to create a configuration file to tell Workbench the URL of your Islandora, which Drupal account credentials to use, and the location of your CSV file. You can customize many other aspects of Islandora Workbench by including various settings in your configuration file. Here is a very simple configuration file: task: create host: http://localhost:8000 username: admin password: islandora input_csv: input.csv content_type: islandora_object Relevance to using Workbench It is very important to run --check before you commit to having Workbench add content to your Drupal. Doing so lets Workbench find issues with your configuration and input CSV and files. See the \" Checking configuration and input data \" documentation for a complete list of the things --check looks for. When you have all these things ready, you tell Workbench to \"check\" your input data and configuration: ./workbench --config test.yml --check Workbench will provide a summary of what passed the check and what needs to be fixed. When your checks are complete, you use Workbench to push your content into your Islandora repository: ./workbench --config test.yml You can use a raw CSV file, a Google Sheet , or an Excel file as input, and your image, PDF, video, and other files can be stored locally , or at URLs on the web.","title":"How do I use Islandora Workbench?"},{"location":"drupal_and_workbench/#content-types-fields-and-nodes","text":"Below are the Drupal and Islandora concepts that will help you use Workbench effectively.","title":"Content types, fields, and nodes"},{"location":"drupal_and_workbench/#content-types","text":"Relevance to using Workbench Generally speaking, Islandora Workbench can only work with a single content type at a time. You define this content type in the content_type configuration setting. Drupal categorizes what people see as \"pages\" on a Drupal website into content types. By default, Drupal provides \"Article\" and \"Basic Page\" content types, but site administrators can create custom content types. You can see the content types configured on your Drupal by logging in as an admin user and visiting /admin/structure/types . Or, you can navigate to the list of your site's content types by clicking on the Structure menu item, then the Content Types entry: Islandora, by default, creates a content type called a \"Repository Item\". But, many Islandora sites use additional content types, such as \"Collection\". To find the machine name of the content type you want to use with Workbench, visit the content type's configuration page. The machine name will be the last segment of the URL. In the following example, it's islandora_object :","title":"Content types"},{"location":"drupal_and_workbench/#fields","text":"Relevance to using Workbench The columns in your CSV file correspond to fields in your Islandora content type. The main structural difference between content types in Drupal is that each content type is configured to use a unique set of fields. A field in Drupal the same as a \"field\" in metadata - it is a container for an individual piece of data. For example, all content types have a \"title\" field (although it might be labeled differently) to hold the page's title. Islandora's default content type, the Repository Item, uses metadata-oriented fields like \"Copyright date\", \"Extent\", \"Resource type\", and \"Subject\". Fields have two properties which you need to be familiar with in order to use Islandora Workbench: machine name type To help explain how these two properties work, we will use the following screenshot showing a few of the default fields in the \"Repository item\" content type: Relevance to using Workbench In most cases you can use a fields' human-readable labels as column headers in your CSV file, but within Islandora Workbench configuration files, you must use field machine names. A field has a human-readable label, such as \"Copyright date\", but that label can change or can be translated, and, more significantly, doesn't need to be unique within a Drupal website. Drupal assigns each field a machine name that is more reliable for software to use than human-readable labels. These field machine names are all lower case, use underscores instead of spaces, and are guaranteed by Drupal to be unique within a content type. In the screenshot above, you can see the machine names in the middle column (you might need to zoom in!). For example, the machine name for the \"Copyright date\" field is field_copyright_date . A field's \"type\" determines the structure of the data it can hold. Some common field types used in Islandora are \"Text\" (and its derivatives \"Text (plain)\" and \"Text (plain, long)\"), \"Entity Reference\", \"Typed Relation\", \"EDTF\", and \"Link\". These field types are explained in the \" Field Data (CSV and Drupal) \" documentation, but the important point here is that they are all represented differently in your Workbench CSV. For example: EDTF fields take dates in the Library of Congress' Extended Date/Time Format (an example CSV entry is 1964/2008 ) Entity reference fields are used for taxonomy terms (an example entry is cats:Tabby , where \"cats\" is the name of the taxonomy and \"Tabby\" is the term) Typed relation fields are used for taxonomy entries that contain additional data indicating what \"type\" they are, such as using MARC relators to indicate the relationship of the taxonomy term to the item being described. An example typed relation CSV value is relators:aut:Jordan, Mark , where \"relators:aut\" indicates the term \"Jordan, Mark\" uses the MARC relator \"aut\", which stands for \"author\". Link fields take two pieces of information, a URL and the link text, like http://acme.com%%Acme Products Inc. Relevance to using Workbench Drupal fields can be configured to have multiple values. Another important aspect of Drupal fields is their cardinality, or in other words, how many individual values they are configured to have. This is similar to the \"repeatability\" of fields in metadata schemas. Some fields are configured to hold only a single value, others to hold a a maximum number of values (three, for example), and others can hold an unlimited number of values. You can find each field's cardinality in its \"Field settings\" tab. Here is an example showing a field with unlimited cardinality: Drupal enforces cardinality very strictly. For this reason, if your CSV file contains more values for a field than the field's configuration allows, Workbench will truncate the number of values to match the maximum number allowed for the field. If it does this, it will leave an entry in its log so you know that it didn't add all the values in your CSV data. See the Islandora documentation for additional information about Drupal fields.","title":"Fields"},{"location":"drupal_and_workbench/#nodes","text":"Relevance to using Workbench In Islandora, a node is a metadata description - a grouping of data, contained in fields, that describe an item. Each row in your input CSV contains the field data that is used to create a node. Think of a \"node\" as a specific page in a Drupal website. Every node has a content type (e.g. \"Article\" or \"Repository Item\") containing content in the fields defined by its content type. It has a URL in the Drupal website, like https://mysite.org/node/3798 . The \"3798\" at the end of the URL is the node ID (also known as the \"nid\") and uniquely identifies the node within its website. In Islandora, a node is less like a \"web page\" and more like a \"catalogue record\" since Islandora-specific content types generally contain a lot of metadata-oriented fields rather than long discursive text like a blog would have. In create tasks, each row in your input CSV will create a single node. Islandora Workbench uses the node ID column in your CSV for some operations, for example updating nodes or adding media to nodes. Content in Islandora can be hierarchical. For example, collections contain items, newspapers contain issues which in turn contain pages, and compound items can contain a top-level \"parent\" node and many \"child\" nodes. Islandora defines a specific field, field_member_of (or in English, \"Member Of\") that contains the node ID of another node's parent. If this field is empty in a node, it has no parent; if this field contains a value, the node with that node ID is the first node's parent. Islandora Workbench provides several ways for you to create hierarchical content. If you want to learn more about how Drupal nodes work, consult the Islandora documentation .","title":"Nodes"},{"location":"drupal_and_workbench/#taxonomies","text":"Relevance to using Workbench Drupal's taxonomy system lets you create local authority lists for names, subjects, genre terms, and other types of data. One of Drupal's most powerful features is its support for structured taxonomies (sometimes referred to as \"vocabularies\"). These can be used to maintain local authority lists of personal and corporate names, subjects, and other concepts, just like in other library/archives/museum tools. Islandora Workbench lets you create taxonomy terms in advance of the nodes they are attached to, or at the same time as the nodes. Also, within your CSV file, you can use term IDs, term URIs, or term names. You can use term names both when you are creating new terms on the fly, or if you are assigning existing terms to newly created nodes. Drupal assigns each term an ID, much like it assigns each node an ID. These are called \"term IDs\" (or \"tids\"). Like node IDs, they are unique within a single Drupal instance but they are not unique across Drupal instances. Islandora uses several specific taxonomies extensively as part of its data model. These include Islandora Models (which determines how derivatives are generated for example) and Islandora Media Use (which indicates if a file is an \"Original file\" or a \"Service file\", for example). The taxonomies created by Islandora, such as Islandora Models and Islandora Media Use, include Linked Data URIs in the taxonomy term entries. These URIs are useful because they uniquely and reliably identify taxonomy terms across Drupal instances. For example, the taxonomy term with the Linked Data URI http://pcdm.org/use#OriginalFile is the same in two Drupal instances even if the term ID for the term is 589 in one instance and 23 in the other, or if the name of the term is in different languages. If you create your own taxonomies, you can also assign each term a Linked Data URI.","title":"Taxonomies"},{"location":"drupal_and_workbench/#media","text":"Relevance to using Workbench By default, the file you upload using Islandora Workbench will be assigned the \"Original file\" media use term. Islandora will then automatically generate derivatives, such as thumbnails and extracted text where applicable, from that file and create additional media. However, you can use Workbench to upload additional files or pregenerated derivatives by assigning them other media use terms. Media in Islandora are the image, video, audio, PDF, and other content files that are attached to nodes. Together, a node and its attached media make up a resource or item. Media have types. Standard media types defined by Islandora are: Audio Document Extracted text FITS Technical Metadata File Image Remote video Video In general when using Workbench you don't need to worry about assigning a type to a file. Workbench infers a media's type from the file extensions, but you can override this if necessary. Media are also assigned terms from the Islandora Media Use vocabulary. These terms, combined with the media type, determine how the files are displayed to end users and how and what types of derivatives Islandora generates. They can also be useful in exporting content from Islandora and in digital preservation workflows (for example). A selection of terms from this vocabulary is: Original file Intermediate file Preservation Master File Service file Thumbnail image Transcript Extracted text This is an example of a video media showing how the media use terms are applied: The Islandora documentation provides additional information on media .","title":"Media"},{"location":"drupal_and_workbench/#views","text":"Relevance to using Workbench You usually don't need to know anything about Views when using Islandora Workbench, but you can use Workbench to export CSV data from Drupal via a View. Views are another powerful Drupal feature that Islandora uses extensively. A View is a Drupal configuration that generates a list of things managed by Drupal, most commonly nodes. As a Workbench user, you will probably only use a View if you want to export data from Islandora via a get_data_from_view Workbench task. Behind the scenes, Workbench depends on a Drupal module called Islandora Workbench Integration that creates a number of custom Views that Workbench uses to interact with Drupal. So even though you might only use Views directly when exporting CSV data from Islandora, behind the scenes Workbench is getting information from Drupal constantly using a set of custom Views.","title":"Views"},{"location":"drupal_and_workbench/#rest","text":"Relevance to using Workbench As a Workbench user, you don't need to know anything about REST, but if you encounter a problem using Workbench and reach out for help, you might be asked to provide your log file, which will likely contain some raw REST data. REST is the protocol that Workbench uses to interact with Drupal. Fear not: as a user of Workbench, you don't need to know anything about REST - it's Workbench's job to shield you from REST's complexity. However, if things go wrong, Workbench will include in its log file some details about the particular REST request that didn't work (such as HTTP response codes and raw JSON). If you reach out for help , you might be asked to provide your Workbench log file to aid in troubleshooting. It's normal to see the raw data used in REST communication between Workbench and Drupal in the log.","title":"REST"},{"location":"exporting_islandora_7_content/","text":"Overview Islandora Workbench's main purpose is to load batches of content into an Islandora 2 repository. However, loading content can also be the last step in migrating from Islandora 7 to Islandora 2. As noted in the \" Workflows \" documentation, Workbench can be used in the \"load\" phase of a typical extract, transform, load (ETL) process. Workbench comes with a standalone script, get_islandora_7_content.py , that can be used to extract (a.k.a. \"export\") metadata and OBJ datastreams from an Islandora 7 instance. This data can form the basis for Workbench input data. To run the script, change into the Workbench \"i7Import\" directory and run: python3 get_islandora_7_content.py --config The script uses a number of configuration variables, all of which come with sensible defaults. Any of the following parameters can be changed in the user-supplied config file. Parameter Default Value Description solr_base_url http://localhost:8080/solr URL of your source Islandora 7.x Solr instance. islandora_base_url http://localhost:8000 URL of your source Islandora instance. csv_output_path islandora7_metadata.csv Path to the CSV file containing values from Solr. obj_directory /tmp/objs Path to the directory where datastream files will be saved. log_file_path islandora_content.log Path to the log file. fetch_files true Whether or not to fetch and save the datastream files from the source Islandora 7.x instance. get_file_uri false Whether or not to write datastream file URLs to the CSV file instead of fetching the files. One of `get_file_uri` or `fetch_files` can be set to `true`, but not both. field_pattern mods_.*(_s|_ms)$ A regular expression pattern to matching Solr fields to include in the CSV. field_pattern_do_not_want (marcrelator|isSequenceNumberOf) A regular expression pattern to matching Solr fields to not include in the CSV. standard_fields ['PID', 'RELS_EXT_hasModel_uri_s', 'RELS_EXT_isMemberOfCollection_uri_ms', 'RELS_EXT_isMemberOf_uri_ms', 'RELS_EXT_isConstituentOf_uri_ms', 'RELS_EXT_isPageOf_uri_ms'] List of fields to Solr fields to include in the CSV not matched by the regular expression in `field_pattern`. id_field PID The Solr field that uniquely identifies each object in the source Islandora 7.x instance. id_start_number 1 The number to use as the first Workbench ID within the CSV file. datastreams ['OBJ', 'PDF'] List of datastream IDs to fetch from the source Islandora 7.x instance. namespace * The namespace of objects you want to export from the source Islandora 7.x instance. collection PID of a single collection limiting the objects to fetch from the source Islandora 7.x instance. Only matches objects that have the specified collection as their immediate parent. For recursive collection membership, add `ancestors_ms` as a `solr_filter`, as documented below. Note: the colon in the collection PID must be escaped with a backslash (`\\`), e.g., `cartoons\\:collection`. content_model PID of a single content model limiting the objects to fetch from the source Islandora 7.x instance. Note: the colon in the content model PID must be escaped with a backslash (`\\`), e.g., `islandora\\:sp_large_image_cmodel`. solr_filters key:value pairs to add as filters to the Solr query. See examples below. pids_to_skip List of PIDs to not export, e.g. `pids_to_skip: [\"foo:234\", \"bar:7890\"]`. Useful if you are aware of problems with specifis Islandora 7 source objects and you don't want those objects to crash out the script. debug false Print debug information to the console. deep_debug false Print additional debug information to the console. secure_ssl_only true Whether or not to require valid SSL certificates. Set to `false` if you want to ignore SSL certificates. Analyzing your Islandora 7 Solr index In order to use the configuration options outlined above, you will need to know what fields are in your Islandora 7 Solr index. Not all Islandoras are indexed in the same way, and since most Solr field names are derived from MODS or other XML datastream element names, there is enough variability in Solr fieldnames across Islandora 7 instances to make reliably predicting Solr fieldnames impractical. Two ways you can see the specific fields in your Solr index are 1) using the Islandora Metadata Extras module, and 2) issuing raw requests to your Islandora 7's Solr, both to get Solr content for sample objects and to get a list of all fields in your index. Using the Islandora Metadata Extras module The Islandora Metadata Extras module provides a \"Solr Metadata\" tab in each object's Manage menu that shows the raw Solr document for the object. The Solr field names and the values for the current object are easy to identify. Here are two screenshots showing the top section of the output and a sample from the middle of the output (broken into top and middle excerpts for illustration purposes here since the entire Solr document is very long): Top: Middle: Fetching sample Solr documents If you can't or don't want to install the Islandora Metadata Extras module, you can query Solr directly to get a sample document. To get the entire Solr document for an object in JSON format, issue the following request to your Solr, replacing km\\:10571 with the PID of your object. The -o option in the curl command (for \"output\") tells curl to save the response to the named file: curl -o km_10571.json \"http://localhost:8080/solr/select?q=PID:km\\:10571&wt=json\" The resulting JSON file will look like this . To get Solr XML document for a specific object, remove the wt parameter from the request URL: curl -o km_10571.xml \"http://localhost:8080/solr/select?q=PID:km\\:10571\" The resulting file will look like this . In the XML version, the Solr fieldnames are the values of the \"name\" attribute of each element. Warning Solr documents for individual objects will not necessarily contain all the fieldnames in your index. In general, empty fields in the source XML (e.g. MODS) are not added to a Solr document. This means that even though inspecting individual Solr documents may not reveal all of the Solr fields you want to include in your Workbench CSV. You should get samples from a number of objects that you think will represent all of the Solr fields you are interested in. Fetching a CSV list of all Solr fieldnames To get a 1-row CSV file containing all of the fieldnames in your Solr index, issue the following request: curl -o allfields.csv \"http://localhost:8080/solr/select?q=*:*&wt=csv&rows=0&fl=*\" Unlike the queries for individual Islandora 7 objects' Solr documents, the results of this query will contain all the fields in your index. Configuring which Solr fields to include in the CSV As we can see from the examples above, Islandora 7's Solr schema contains a lot of fields, mirroring the richness of MODS (or other XML-based metadata) and the Fedora 3.x RELS-EXT properties. By default, this script fetches all the fields in the Islandora 7's Solr's index, which will invariably be many, many more fields that you will normally want in the output CSV. You will need to tell the script which fields to exclude and which to include. The script takes the following approach to providing control over what fields end up in the CSV data it generates: It fetches a list of all fieldnames used in the Solr index. It then matches each fieldname against the regular expression pattern defined in the script's field_pattern variable, and if the match is successful, includes the fieldname in the CSV. For example, field_pattern = 'mods_.*(_s|_ms)$' will match every Solr field that starts with \"mods_\" and ends with either \"_s\" or \"_ms\". Next, it matches each remaining fieldname against the regular expression patterns defined in the script's field_pattern_do_not_want variable, and if the match is successful, removes the fieldname from the CSV. For example, field_pattern_do_not_want = '(marcrelator|isSequenceNumberOf)' will remove all fieldnames that contain either the string \"marcrelator\" or \"isSequenceNumberOf\". Note that the regular expression used in this configuration variable is not a negative pattern; in other words, if a fieldname matches this pattern, it is excluded from the field list. Finally, it adds to the start of the remaining list of fieldnames every Solr fieldname defined in the standard_fields configuration variable. This configuration variable provides a mechanism to ensure than any fields that are not included in step 2 are present in the generated CSV file. Warning You will always want at least the Solr fields \"PID\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_hasModel_uri_s\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_isConstituentOf_uri_ms\", and \"RELS_EXT_isPageOf_uri_ms\" in your standard_fields configuration variable since these fields contain information about objects' relationships to each other. Even with a well-configured set of pattern variables, the column headers are ugly, and there are a lot of them. Here is a sample from a minimal Islandora 7.x: file,PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isConstituentOf_uri_ms,RELS_EXT_isPageOf_uri_ms,mods_recordInfo_recordOrigin_ms,mods_name_personal_author_ms,mods_abstract_s,mods_name_aut_role_roleTerm_code_s,mods_name_personal_author_s,mods_typeOfResource_s,mods_subject_geographic_ms,mods_identifier_local_ms,mods_genre_ms,mods_name_photographer_role_roleTerm_code_s,mods_physicalDescription_form_all_ms,mods_physicalDescription_extent_ms,mods_subject_topic_ms,mods_name_namePart_s,mods_physicalDescription_form_authority_marcform_ms,mods_name_pht_s,mods_identifier_uuid_ms,mods_language_languageTerm_code_s,mods_physicalDescription_form_s,mods_accessCondition_use_and_reproduction_s,mods_name_personal_role_roleTerm_text_s,mods_name__role_roleTerm_code_ms,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_ms,mods_name_aut_s,mods_originInfo_encoding_iso8601_dateIssued_s,mods_originInfo_dateIssued_ms,mods_name_photographer_namePart_s,mods_name_pht_role_roleTerm_text_ms,mods_identifier_all_ms,mods_name_namePart_ms,mods_subject_geographic_s,mods_originInfo_publisher_ms,mods_subject_descendants_all_ms,mods_titleInfo_title_all_ms,mods_name_photographer_role_roleTerm_text_ms,mods_name_role_roleTerm_text_s,mods_titleInfo_title_ms,mods_name_photographer_s,mods_originInfo_place_placeTerm_text_s,mods_name_role_roleTerm_code_ms,mods_name_pht_role_roleTerm_code_s,mods_name_pht_namePart_s,mods_name_pht_namePart_ms,mods_name_role_roleTerm_code_s,mods_genre_all_ms,mods_physicalDescription_form_authority_marcform_s,mods_name_pht_role_roleTerm_code_ms,mods_extension_display_date_ms,mods_name_photographer_namePart_ms,mods_genre_authority_bgtchm_ms,mods_name_personal_role_roleTerm_text_ms,mods_name_pht_ms,mods_name_photographer_role_roleTerm_text_s,mods_language_languageTerm_code_ms,mods_originInfo_place_placeTerm_text_ms,mods_titleInfo_title_s,mods_identifier_uuid_s,mods_language_languageTerm_code_authority_iso639-2b_s,mods_genre_s,mods_name_aut_role_roleTerm_code_ms,mods_typeOfResource_ms,mods_originInfo_encoding_iso8601_dateIssued_ms,mods_name_personal_author_role_roleTerm_text_ms,mods_abstract_ms,mods_language_languageTerm_text_s,mods_genre_authority_bgtchm_s,mods_language_languageTerm_s,mods_language_languageTerm_ms,mods_subject_topic_s,mods_name_photographer_ms,mods_name_pht_role_roleTerm_text_s,mods_recordInfo_recordOrigin_s,mods_name_aut_ms,mods_originInfo_publisher_s,mods_identifier_local_s,mods_language_languageTerm_text_ms,mods_physicalDescription_extent_s,mods_language_languageTerm_code_authority_iso639-2b_ms,mods_name__role_roleTerm_code_s,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_s,mods_name_photographer_role_roleTerm_code_ms,mods_name_role_roleTerm_text_ms,mods_name_personal_author_role_roleTerm_text_s,mods_accessCondition_use_and_reproduction_ms,mods_physicalDescription_form_ms,sequence The script-generated solr request may not in most cases be useful or even workable. You may need to experiment with the field_pattern and field_pattern_do_not_want configuration settings to reduce the number of Solr fields. Adding filters to your Solr query to limit the objects fetched from the source Islandora The namespace , collection , content_model , and solr_filters options documented above allow you to scope the set of objects exported from the source Islandora 7.x instance. The first three take simple, single values. The last option allows you to add arbitrary filters to the query sent to Solr in the form of key:value pairs, like this: solr_filters: - ancestors_ms: 'some\\:collection' - fgs_state_s: 'Active' Warning Colons within PIDs used in filters must be escaped with a backslash ( \\ ), e.g., some\\:collection . Putting your Solr request in a file You have the option of providing their own solr query in a text file and pointing to the file using the --metadata_solr_request option when running the script: python3 get_islandora_7_content.py --config --metadata_solr_request The contents of the file must contain a full HTTP request to Solr, e.g.: http://localhost:8080/solr/select?q=PID:*&wt=csv&rows=1000000&fl=PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isMemberOf_uri_ms,mods_originInfo_encoding_iso8601_dateIssued_mdt The advantage of putting your Solr request in its own file is that you have complete control over the Solr query. Requests to Solr in this file should always include the \"wt=csv\" and \"rows=1000000\" parameters (the \"rows\" parameter should have a value that is greater than the number of objects in your repository, otherwise Solr won't return all objects). Using the CSV as input for Workbench The CSV file generated by this script will almost certainly contain many more columns than you will want to ingest into Islandora 2. You will probably want to delete columns you don't need, combine the contents of several columns into one, and edit the contents of others. As we can see from the example above, the column headings in the CSV are Solr fieldnames ( RELS_EXT_hasModel_uri_s , mods_titleInfo_title_ms , etc.). You will need to replace those column headers with the equivalent fields as defined in your Drupal 9 content type . In addition, the metadata stored in Islandora 7's Solr index does not in many cases have the structure Workbench requires, so the data in the CSV file will need to be edited before it can be used by Workbench to create nodes. The content of Islandora 7 Solr fields is derived from MODS (or other) XML elements, and, with the exception of text-type fields, will not necessarily map cleanly to Drupal fields' data types. In other words, to use the CSV data generated by get_islandora_7_content.py , you will need to do some work to prepare it (or \"transform\" it, to use ETL language) to use it as input for Workbench. However, the script adds three columns to the CSV file that do not use Solr fieldnames and whose contents you should not edit but that you may need to rename: file , PID , and sequence : Do not edit name Rename 'sequence' column (e.g., or contents of to 'field_weight') but do not 'file' column. edit its contents. | / Every other column | | / Rename 'PID' to the | will need to be | | | value of your 'id_field' | deleted, renamed, | | | setting but do not edit | or its content | | | column contents. | edited. | v v v v --------------------------------------------------------------------- file,PID,RELS_EXT_hasModel_uri_s,[...],mods_typeOfResource_s,sequence First, the required Workbench column \"file\" is added to the beginning of each CSV row, and is populated with the filename of the OBJ datastream. This filename is based on the object's PID, with the the : replaced with an underscore, and has an extension determined by the OBJ datastream's MIME type. Second, \"PID\" is the Islandora 7.x PID of each object in the CSV file. This column header can be changed to \"id\" or whatever you have defined in your Workbench configuration file's id_field setting. Alternatively, you can set the value of id_field to PID and not rename that CSV column. Third, a \"sequence\" column is added at the end of each CSV row. This is where the get_islandora_7_content.py script stores the sequence number of each child object/page in relation to its parent . If an Islandora 7.x object has a property in its RELS-EXT datastream islandora:sSequenceNumberOfxxx (where \"xxx\" is the object's parent), the value of that property is added to the \"sequence\" column at the end of each row in the CSV. For paged content, this value is taken from the islandora:isSequenceNumber RELS-EXT property. These values are ready for use in the \"field_weight\" Drupal field created by the Islandora Defaults module; you can simply rename the \"sequence\" column header to \" field_weight\" when you use the CSV as input for Islandora Workbench. Note that you don't need to configure the script to include fields that contain \"isSequenceNumberOf\" or \"isSequenceNumber\" in your CSV; in fact, because there are so many of them in a typical Islandora 7 Solr index, you will want to exclude them using the field_pattern_do_not_want configuration variable. Excluding them is safe, since the script fetches the sequence information separately from the other CSV data. A fourth column in your Workbench CSV, field_member_of , is not added automatically. It contains the PID of the parent object, whether it is a collection, the top level object (parent) in a compound object, book object that has pages, etc. If an Islandora object has a value that should be in the field_member_of column, it will be in one or more (usually just one) of the following columns in the CSV created by the get_islandora_7_content.py script: RELS_EXT_isMemberOfCollection RELS_EXT_isPageOf RELS_EXT_isSequenceNumberOfXXX (where XXX is the PID of the parent object) RELS_EXT_isConstituentOf All of these columns will likely be present in your CSV, but it is possible that some may not be, for example if your Islandora 7 repository did not have a module enabled that uses one of those RELS-EXT properties. Note In general, the CSV that you need to end up with to ingest content into Islandora 2 using Workbench needs to have a structure similar to that described in the \" With page/child-level metadata \" section of the Workbench documentation for \"Creating paged, compound, and collection content.\" You should review the points in the \"Some important things to note\" section of that documentation. Note Solr escapes commas in its exported CSV with backslashes ( \\ ). Should look for and replace these escaped commas (e.g. \\, ) with regular commas before using the CSV with Workbench.","title":"Exporting Islandora 7 content"},{"location":"exporting_islandora_7_content/#overview","text":"Islandora Workbench's main purpose is to load batches of content into an Islandora 2 repository. However, loading content can also be the last step in migrating from Islandora 7 to Islandora 2. As noted in the \" Workflows \" documentation, Workbench can be used in the \"load\" phase of a typical extract, transform, load (ETL) process. Workbench comes with a standalone script, get_islandora_7_content.py , that can be used to extract (a.k.a. \"export\") metadata and OBJ datastreams from an Islandora 7 instance. This data can form the basis for Workbench input data. To run the script, change into the Workbench \"i7Import\" directory and run: python3 get_islandora_7_content.py --config The script uses a number of configuration variables, all of which come with sensible defaults. Any of the following parameters can be changed in the user-supplied config file. Parameter Default Value Description solr_base_url http://localhost:8080/solr URL of your source Islandora 7.x Solr instance. islandora_base_url http://localhost:8000 URL of your source Islandora instance. csv_output_path islandora7_metadata.csv Path to the CSV file containing values from Solr. obj_directory /tmp/objs Path to the directory where datastream files will be saved. log_file_path islandora_content.log Path to the log file. fetch_files true Whether or not to fetch and save the datastream files from the source Islandora 7.x instance. get_file_uri false Whether or not to write datastream file URLs to the CSV file instead of fetching the files. One of `get_file_uri` or `fetch_files` can be set to `true`, but not both. field_pattern mods_.*(_s|_ms)$ A regular expression pattern to matching Solr fields to include in the CSV. field_pattern_do_not_want (marcrelator|isSequenceNumberOf) A regular expression pattern to matching Solr fields to not include in the CSV. standard_fields ['PID', 'RELS_EXT_hasModel_uri_s', 'RELS_EXT_isMemberOfCollection_uri_ms', 'RELS_EXT_isMemberOf_uri_ms', 'RELS_EXT_isConstituentOf_uri_ms', 'RELS_EXT_isPageOf_uri_ms'] List of fields to Solr fields to include in the CSV not matched by the regular expression in `field_pattern`. id_field PID The Solr field that uniquely identifies each object in the source Islandora 7.x instance. id_start_number 1 The number to use as the first Workbench ID within the CSV file. datastreams ['OBJ', 'PDF'] List of datastream IDs to fetch from the source Islandora 7.x instance. namespace * The namespace of objects you want to export from the source Islandora 7.x instance. collection PID of a single collection limiting the objects to fetch from the source Islandora 7.x instance. Only matches objects that have the specified collection as their immediate parent. For recursive collection membership, add `ancestors_ms` as a `solr_filter`, as documented below. Note: the colon in the collection PID must be escaped with a backslash (`\\`), e.g., `cartoons\\:collection`. content_model PID of a single content model limiting the objects to fetch from the source Islandora 7.x instance. Note: the colon in the content model PID must be escaped with a backslash (`\\`), e.g., `islandora\\:sp_large_image_cmodel`. solr_filters key:value pairs to add as filters to the Solr query. See examples below. pids_to_skip List of PIDs to not export, e.g. `pids_to_skip: [\"foo:234\", \"bar:7890\"]`. Useful if you are aware of problems with specifis Islandora 7 source objects and you don't want those objects to crash out the script. debug false Print debug information to the console. deep_debug false Print additional debug information to the console. secure_ssl_only true Whether or not to require valid SSL certificates. Set to `false` if you want to ignore SSL certificates.","title":"Overview"},{"location":"exporting_islandora_7_content/#analyzing-your-islandora-7-solr-index","text":"In order to use the configuration options outlined above, you will need to know what fields are in your Islandora 7 Solr index. Not all Islandoras are indexed in the same way, and since most Solr field names are derived from MODS or other XML datastream element names, there is enough variability in Solr fieldnames across Islandora 7 instances to make reliably predicting Solr fieldnames impractical. Two ways you can see the specific fields in your Solr index are 1) using the Islandora Metadata Extras module, and 2) issuing raw requests to your Islandora 7's Solr, both to get Solr content for sample objects and to get a list of all fields in your index.","title":"Analyzing your Islandora 7 Solr index"},{"location":"exporting_islandora_7_content/#using-the-islandora-metadata-extras-module","text":"The Islandora Metadata Extras module provides a \"Solr Metadata\" tab in each object's Manage menu that shows the raw Solr document for the object. The Solr field names and the values for the current object are easy to identify. Here are two screenshots showing the top section of the output and a sample from the middle of the output (broken into top and middle excerpts for illustration purposes here since the entire Solr document is very long): Top: Middle:","title":"Using the Islandora Metadata Extras module"},{"location":"exporting_islandora_7_content/#fetching-sample-solr-documents","text":"If you can't or don't want to install the Islandora Metadata Extras module, you can query Solr directly to get a sample document. To get the entire Solr document for an object in JSON format, issue the following request to your Solr, replacing km\\:10571 with the PID of your object. The -o option in the curl command (for \"output\") tells curl to save the response to the named file: curl -o km_10571.json \"http://localhost:8080/solr/select?q=PID:km\\:10571&wt=json\" The resulting JSON file will look like this . To get Solr XML document for a specific object, remove the wt parameter from the request URL: curl -o km_10571.xml \"http://localhost:8080/solr/select?q=PID:km\\:10571\" The resulting file will look like this . In the XML version, the Solr fieldnames are the values of the \"name\" attribute of each element. Warning Solr documents for individual objects will not necessarily contain all the fieldnames in your index. In general, empty fields in the source XML (e.g. MODS) are not added to a Solr document. This means that even though inspecting individual Solr documents may not reveal all of the Solr fields you want to include in your Workbench CSV. You should get samples from a number of objects that you think will represent all of the Solr fields you are interested in.","title":"Fetching sample Solr documents"},{"location":"exporting_islandora_7_content/#fetching-a-csv-list-of-all-solr-fieldnames","text":"To get a 1-row CSV file containing all of the fieldnames in your Solr index, issue the following request: curl -o allfields.csv \"http://localhost:8080/solr/select?q=*:*&wt=csv&rows=0&fl=*\" Unlike the queries for individual Islandora 7 objects' Solr documents, the results of this query will contain all the fields in your index.","title":"Fetching a CSV list of all Solr fieldnames"},{"location":"exporting_islandora_7_content/#configuring-which-solr-fields-to-include-in-the-csv","text":"As we can see from the examples above, Islandora 7's Solr schema contains a lot of fields, mirroring the richness of MODS (or other XML-based metadata) and the Fedora 3.x RELS-EXT properties. By default, this script fetches all the fields in the Islandora 7's Solr's index, which will invariably be many, many more fields that you will normally want in the output CSV. You will need to tell the script which fields to exclude and which to include. The script takes the following approach to providing control over what fields end up in the CSV data it generates: It fetches a list of all fieldnames used in the Solr index. It then matches each fieldname against the regular expression pattern defined in the script's field_pattern variable, and if the match is successful, includes the fieldname in the CSV. For example, field_pattern = 'mods_.*(_s|_ms)$' will match every Solr field that starts with \"mods_\" and ends with either \"_s\" or \"_ms\". Next, it matches each remaining fieldname against the regular expression patterns defined in the script's field_pattern_do_not_want variable, and if the match is successful, removes the fieldname from the CSV. For example, field_pattern_do_not_want = '(marcrelator|isSequenceNumberOf)' will remove all fieldnames that contain either the string \"marcrelator\" or \"isSequenceNumberOf\". Note that the regular expression used in this configuration variable is not a negative pattern; in other words, if a fieldname matches this pattern, it is excluded from the field list. Finally, it adds to the start of the remaining list of fieldnames every Solr fieldname defined in the standard_fields configuration variable. This configuration variable provides a mechanism to ensure than any fields that are not included in step 2 are present in the generated CSV file. Warning You will always want at least the Solr fields \"PID\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_hasModel_uri_s\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_isConstituentOf_uri_ms\", and \"RELS_EXT_isPageOf_uri_ms\" in your standard_fields configuration variable since these fields contain information about objects' relationships to each other. Even with a well-configured set of pattern variables, the column headers are ugly, and there are a lot of them. Here is a sample from a minimal Islandora 7.x: file,PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isConstituentOf_uri_ms,RELS_EXT_isPageOf_uri_ms,mods_recordInfo_recordOrigin_ms,mods_name_personal_author_ms,mods_abstract_s,mods_name_aut_role_roleTerm_code_s,mods_name_personal_author_s,mods_typeOfResource_s,mods_subject_geographic_ms,mods_identifier_local_ms,mods_genre_ms,mods_name_photographer_role_roleTerm_code_s,mods_physicalDescription_form_all_ms,mods_physicalDescription_extent_ms,mods_subject_topic_ms,mods_name_namePart_s,mods_physicalDescription_form_authority_marcform_ms,mods_name_pht_s,mods_identifier_uuid_ms,mods_language_languageTerm_code_s,mods_physicalDescription_form_s,mods_accessCondition_use_and_reproduction_s,mods_name_personal_role_roleTerm_text_s,mods_name__role_roleTerm_code_ms,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_ms,mods_name_aut_s,mods_originInfo_encoding_iso8601_dateIssued_s,mods_originInfo_dateIssued_ms,mods_name_photographer_namePart_s,mods_name_pht_role_roleTerm_text_ms,mods_identifier_all_ms,mods_name_namePart_ms,mods_subject_geographic_s,mods_originInfo_publisher_ms,mods_subject_descendants_all_ms,mods_titleInfo_title_all_ms,mods_name_photographer_role_roleTerm_text_ms,mods_name_role_roleTerm_text_s,mods_titleInfo_title_ms,mods_name_photographer_s,mods_originInfo_place_placeTerm_text_s,mods_name_role_roleTerm_code_ms,mods_name_pht_role_roleTerm_code_s,mods_name_pht_namePart_s,mods_name_pht_namePart_ms,mods_name_role_roleTerm_code_s,mods_genre_all_ms,mods_physicalDescription_form_authority_marcform_s,mods_name_pht_role_roleTerm_code_ms,mods_extension_display_date_ms,mods_name_photographer_namePart_ms,mods_genre_authority_bgtchm_ms,mods_name_personal_role_roleTerm_text_ms,mods_name_pht_ms,mods_name_photographer_role_roleTerm_text_s,mods_language_languageTerm_code_ms,mods_originInfo_place_placeTerm_text_ms,mods_titleInfo_title_s,mods_identifier_uuid_s,mods_language_languageTerm_code_authority_iso639-2b_s,mods_genre_s,mods_name_aut_role_roleTerm_code_ms,mods_typeOfResource_ms,mods_originInfo_encoding_iso8601_dateIssued_ms,mods_name_personal_author_role_roleTerm_text_ms,mods_abstract_ms,mods_language_languageTerm_text_s,mods_genre_authority_bgtchm_s,mods_language_languageTerm_s,mods_language_languageTerm_ms,mods_subject_topic_s,mods_name_photographer_ms,mods_name_pht_role_roleTerm_text_s,mods_recordInfo_recordOrigin_s,mods_name_aut_ms,mods_originInfo_publisher_s,mods_identifier_local_s,mods_language_languageTerm_text_ms,mods_physicalDescription_extent_s,mods_language_languageTerm_code_authority_iso639-2b_ms,mods_name__role_roleTerm_code_s,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_s,mods_name_photographer_role_roleTerm_code_ms,mods_name_role_roleTerm_text_ms,mods_name_personal_author_role_roleTerm_text_s,mods_accessCondition_use_and_reproduction_ms,mods_physicalDescription_form_ms,sequence The script-generated solr request may not in most cases be useful or even workable. You may need to experiment with the field_pattern and field_pattern_do_not_want configuration settings to reduce the number of Solr fields.","title":"Configuring which Solr fields to include in the CSV"},{"location":"exporting_islandora_7_content/#adding-filters-to-your-solr-query-to-limit-the-objects-fetched-from-the-source-islandora","text":"The namespace , collection , content_model , and solr_filters options documented above allow you to scope the set of objects exported from the source Islandora 7.x instance. The first three take simple, single values. The last option allows you to add arbitrary filters to the query sent to Solr in the form of key:value pairs, like this: solr_filters: - ancestors_ms: 'some\\:collection' - fgs_state_s: 'Active' Warning Colons within PIDs used in filters must be escaped with a backslash ( \\ ), e.g., some\\:collection .","title":"Adding filters to your Solr query to limit the objects fetched from the source Islandora"},{"location":"exporting_islandora_7_content/#putting-your-solr-request-in-a-file","text":"You have the option of providing their own solr query in a text file and pointing to the file using the --metadata_solr_request option when running the script: python3 get_islandora_7_content.py --config --metadata_solr_request The contents of the file must contain a full HTTP request to Solr, e.g.: http://localhost:8080/solr/select?q=PID:*&wt=csv&rows=1000000&fl=PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isMemberOf_uri_ms,mods_originInfo_encoding_iso8601_dateIssued_mdt The advantage of putting your Solr request in its own file is that you have complete control over the Solr query. Requests to Solr in this file should always include the \"wt=csv\" and \"rows=1000000\" parameters (the \"rows\" parameter should have a value that is greater than the number of objects in your repository, otherwise Solr won't return all objects).","title":"Putting your Solr request in a file"},{"location":"exporting_islandora_7_content/#using-the-csv-as-input-for-workbench","text":"The CSV file generated by this script will almost certainly contain many more columns than you will want to ingest into Islandora 2. You will probably want to delete columns you don't need, combine the contents of several columns into one, and edit the contents of others. As we can see from the example above, the column headings in the CSV are Solr fieldnames ( RELS_EXT_hasModel_uri_s , mods_titleInfo_title_ms , etc.). You will need to replace those column headers with the equivalent fields as defined in your Drupal 9 content type . In addition, the metadata stored in Islandora 7's Solr index does not in many cases have the structure Workbench requires, so the data in the CSV file will need to be edited before it can be used by Workbench to create nodes. The content of Islandora 7 Solr fields is derived from MODS (or other) XML elements, and, with the exception of text-type fields, will not necessarily map cleanly to Drupal fields' data types. In other words, to use the CSV data generated by get_islandora_7_content.py , you will need to do some work to prepare it (or \"transform\" it, to use ETL language) to use it as input for Workbench. However, the script adds three columns to the CSV file that do not use Solr fieldnames and whose contents you should not edit but that you may need to rename: file , PID , and sequence : Do not edit name Rename 'sequence' column (e.g., or contents of to 'field_weight') but do not 'file' column. edit its contents. | / Every other column | | / Rename 'PID' to the | will need to be | | | value of your 'id_field' | deleted, renamed, | | | setting but do not edit | or its content | | | column contents. | edited. | v v v v --------------------------------------------------------------------- file,PID,RELS_EXT_hasModel_uri_s,[...],mods_typeOfResource_s,sequence First, the required Workbench column \"file\" is added to the beginning of each CSV row, and is populated with the filename of the OBJ datastream. This filename is based on the object's PID, with the the : replaced with an underscore, and has an extension determined by the OBJ datastream's MIME type. Second, \"PID\" is the Islandora 7.x PID of each object in the CSV file. This column header can be changed to \"id\" or whatever you have defined in your Workbench configuration file's id_field setting. Alternatively, you can set the value of id_field to PID and not rename that CSV column. Third, a \"sequence\" column is added at the end of each CSV row. This is where the get_islandora_7_content.py script stores the sequence number of each child object/page in relation to its parent . If an Islandora 7.x object has a property in its RELS-EXT datastream islandora:sSequenceNumberOfxxx (where \"xxx\" is the object's parent), the value of that property is added to the \"sequence\" column at the end of each row in the CSV. For paged content, this value is taken from the islandora:isSequenceNumber RELS-EXT property. These values are ready for use in the \"field_weight\" Drupal field created by the Islandora Defaults module; you can simply rename the \"sequence\" column header to \" field_weight\" when you use the CSV as input for Islandora Workbench. Note that you don't need to configure the script to include fields that contain \"isSequenceNumberOf\" or \"isSequenceNumber\" in your CSV; in fact, because there are so many of them in a typical Islandora 7 Solr index, you will want to exclude them using the field_pattern_do_not_want configuration variable. Excluding them is safe, since the script fetches the sequence information separately from the other CSV data. A fourth column in your Workbench CSV, field_member_of , is not added automatically. It contains the PID of the parent object, whether it is a collection, the top level object (parent) in a compound object, book object that has pages, etc. If an Islandora object has a value that should be in the field_member_of column, it will be in one or more (usually just one) of the following columns in the CSV created by the get_islandora_7_content.py script: RELS_EXT_isMemberOfCollection RELS_EXT_isPageOf RELS_EXT_isSequenceNumberOfXXX (where XXX is the PID of the parent object) RELS_EXT_isConstituentOf All of these columns will likely be present in your CSV, but it is possible that some may not be, for example if your Islandora 7 repository did not have a module enabled that uses one of those RELS-EXT properties. Note In general, the CSV that you need to end up with to ingest content into Islandora 2 using Workbench needs to have a structure similar to that described in the \" With page/child-level metadata \" section of the Workbench documentation for \"Creating paged, compound, and collection content.\" You should review the points in the \"Some important things to note\" section of that documentation. Note Solr escapes commas in its exported CSV with backslashes ( \\ ). Should look for and replace these escaped commas (e.g. \\, ) with regular commas before using the CSV with Workbench.","title":"Using the CSV as input for Workbench"},{"location":"field_templates/","text":"Note This section describes using CSV field templates in your configuration file. For information on CSV value templates, see \" CSV value templates \". For information on CSV file templates, see the \" CSV file templates \" section. In create and update tasks, you can configure field templates that are applied to each node as if the fields were present in your CSV file. The templates are configured in the csv_field_templates option. An example looks like this: csv_field_templates: - field_rights: \"The author of this work dedicates any and all copyright interest to the public domain.\" - field_member_of: 205 - field_model: 25 - field_tags: 231|257 Values in CSV field templates are structured the same as field values in your CSV (e.g., in the example above, field_tags is multivalued), and are validated against Drupal's configuration in the same way that values present in your CSV are validated. If a column with the field name used in a template is present in the CSV file, Workbench ignores the template and uses the data in the CSV file. If a column listed in the ignore_csv_columns setting , the value from the template is used.","title":"CSV field templates"},{"location":"fields/","text":"Workbench uses a CSV file to populate Islandora objects' metadata. This file contains the field values that is to be added to new or existing nodes, and some additional reserved columns specific to Workbench. Data in this CSV file can be: strings (for string or text fields) like Using Islandora Workbench for Fun and Profit integers like 7281 the binary values 1 or 0 Existing Drupal-generated entity IDs (term IDs for taxonomy terms or node IDs for collections and parents), which are integers like 10 or 3549 Workbench-specific structured strings for typed relation (e.g., relators:art:30 ), link fields (e.g., https://acme.net%%Acme Products ), geolocation fields (e.g., \"49.16667,-123.93333\" ), and authority link data (e.g., viaf%%http://viaf.org/viaf/10646807%%VIAF Record ) Note As is standard with CSV data, values do not need to be wrapped in double quotation marks ( \" ) unless they contain an instance of the delimiter character (e.g., a comma) or line breaks. Spreadsheet applications such as Google Sheets, LibreOffice Calc, and Excel will output valid CSV data. If you are using a spreadsheet application, it will take care of wrapping the CSV values in double quotation marks when they are necessary - you do not need to wrap the field values yourself. Reserved CSV columns The following CSV columns are used for specific purposes and in some cases are required in your CSV file, depending on the task you are performing (see below for specific cases). Data in them does not directly populate Drupal content-type fields. CSV field name Task(s) Note id create This CSV field is used by Workbench for internal purposes, and is not added to the metadata of your Islandora objects. Therefore, it doesn't need to have any relationship to the item described in the rest of the fields in the CSV file. You can configure this CSV field name to be something other than id by using the id_field option in your configuration file. Note that if the specified field contains multiple values, (e.g. 0001|spec-86389 ), the entire field value will be used as the internal Workbench identifier. Also note that it is important to use values in this column that are unique across input CSV files if you plan to create parent/child relationships accross Workbench sessions . parent_id create When creating paged or compound content, this column identifies the parent of the item described in the current row. For information on how to use this field, see \" With page/child-level metadata .\" Note that this field can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. node_id update, delete, add_media, export_csv, delete_media_by_node The ID of the node you are updating, deleting, or adding media to. Full URLs (including URL aliases) are also allowed in this CSV field. file create, add_media See detail in \"Values in the 'file' column\", below. media_use_tid create, add_media Tells Workbench which terms from the Islandora Media Use vocabulary to assign to media created in create and add_media tasks. This can be set for all new media in the configuration file; only include it in your CSV if you want row-level control over this value. More detail is available in the \" Configuration \" docs for media_use_tid . url_alias create, update See detail in \" Assigning URL aliases \". image_alt_text create See detail in \" Adding alt text to images \". checksum create See detail in \" Fixity checking \". term_name create_terms See detail in \" Creating taxonomy terms \". Values in the \"file\" column Values in the reserved file CSV field contain the location of files that are used to create Drupal Media. By default, Workbench pushes up to Drupal only one file, and creates only one resulting media per CSV record. However, it is possible to push up multiple files per CSV record (and create all of their corresponding media). File locations in the file field can be relative to the directory named in input_dir , absolute paths, or URLs. Examples of each: relative to directory named in the input_dir configuration setting: myfile.png absolute: /tmp/data/myfile.png . On Windows, you can use values like c:\\users\\mjordan\\files\\myfile.png or \\\\some.windows.file.share.org\\share_name\\files\\myfile.png . URL: http://example.com/files/myfile.png Things to note about file values in general: Relative, absolute, and URL file locations can exist within the same CSV file, or even within the same CSV value. By default, if the file value for a row is empty, Workbench will log the empty value, both in and outside of --check . file values that point to files that don't exist will result in Workbench logging the missing file and then exiting, unless allow_missing_files: true is present in your config file. Adding perform_soft_checks will also tell Workbench to not error out when the value in the file column can't be found. If you want do not want to create media for any of the rows in your CSV file, include nodes_only: true in your configuration file. More detail is available . file values that contain non-ASCII characters are normalized to their ASCII equivalents. See this issue for more information. The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Things to note about URLs as file values: Workbench downloads files identified by URLs and saves them in the directory named in temp_dir before processing them further; within this directory, each file is saved in a subdirectory named after the value in the row's id_field field. It does not delete the files from these locations after they have been ingested into Islandora unless the delete_tmp_upload configuration option is set to true . Files identified by URLs must be accessible to the Workbench script, which means they must not require a username/password; however, they can be protected by a firewall, etc. as long as the computer running Workbench is allowed to retrieve the files without authenticating. Currently Workbench requires that the URLs point directly to a file or a service that generates a file, and not a wrapper page or other indirect route to the file. Required columns A small number of columns are required in your CSV, depending on the task you are performing: Task Required in CSV Note create id See detail in \"Reserved CSV fields\", above. title The node title. file Empty values in the file field are allowed if allow_missing_files is present in your configuration file, in which case a node will be created but it will have no attached media. update node_id The node ID of an existing node you are updating. delete node_id The node ID of an existing node you are deleting. add_media node_id The node ID of an existing node you are attaching media to. file Must contain a filename, file path, or URL. allow_missing_files only works with the create task. If a required field is missing from your CSV, --check will tell you. Columns you want Workbench to ignore In some cases you might want to include columns in your CSV that you want Workbench to ignore. More information on this option is available in the \"Sharing the input CSV with other applications\" section of the Workflows documentation. CSV fields that contain Drupal field data These are of two types of Drupal fields, base fields and content-type specific fields. Base fields Base fields are basic node properties, shared by all content types. The base fields you can include in your CSV file are: title : This field is required for all rows in your CSV for the create task. Optional for the 'update' task. Drupal limits the title's length to 255 characters,unless the Node Title Length contrib module is installed. If that module is installed, you can set the maximum allowed title length using the max_node_title_length configuration setting. langcode : The language of the node. Optional. If included, use one of Drupal's language codes as values (common values are 'en', 'fr', and 'es'; the entire list can be seen here . If absent, Drupal sets the value to the default value for your content type. uid : The Drupal user ID to assign to the node and media created with the node. Optional. Only available in create tasks. If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). created : The timestamp to use in the node's \"created\" attribute and in the \"created\" attribute of the media created with the node. Optional, but if present, it must be in format 2020-11-15T23:49:22+00:00 (the +00:00 is the difference to Greenwich time/GMT). If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). published : Whether or not the node (and all accompanying media) is published. If present in add_media tasks, will override parent node's published value. Values in this field are either 1 (for published) or 0 (for unpublished). The default value for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . promote : Whether or not the node is promoted to the site's front page. 1 (for promoted) or 0 (for not promoted). The default vaue for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . All base fields other than uid can be included in both create and update tasks. Content type-specific fields These fields correspond directly to fields configured in Drupal nodes, and data you provide in them populates their equivalent field in Drupal entities. The column headings in the CSV file must match machine names of fields that exist in the target node content type. Fields' machine names are visible within the \"Manage fields\" section of each content type's configuration, here circled in red: These field names, plus the fields indicated in the \"Reserved CSV fields\" section above, are the column headers in your CSV file, like this: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Note If content-type field values apply to all of the rows in your CSV file, you can avoid including them in the CSV and instead use \" CSV field templates \". Using field labels as CSV column headers By default, Workbench requires that column headers in your CSV file use the machine name of Drupal fields. However, in \"create\", \"update\", and \"create_terms\" tasks, you can use the field labels if you include csv_headers: labels in your configuration file. If you do this, you can use CSV file like this: file,id,title,Model,Description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Some things to note about using field labels in your CSV: if the content type (or vocabulary) that you are populating uses the same label for multiple fields, you won't be able to use labels as your CSV column headers. --check will tell you if there are any duplicate field labels. Spaces in field labels are OK, e.g. Country of Publication . Spelling, capitalization, punctuation, etc. in CSV column headers must match the field labels exactly. If any field labels contain the character you are using as the CSV delimiter (defined in the delimiter config setting), you will need to wrap the column header in quotation marks, e.g. \"Height, length, weight\" . Single and multi-valued fields Drupal allows for fields to have a single value, a specific maximum number of values, or unlimited number of values. In the CSV input file, each Drupal field corresponds to a single CSV field. In other words, the CSV column names must be unique, even if a Drupal field allows multiple values. Populating multivalued fields is explained below. Single-valued fields In your CSV file, single-valued fields simply contain the value, which, depending on the field type, can be a string or an integer. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: file,title,id,field_model,field_description,field_rights,field_extent,field_access_terms,field_member_of myfile.jpg,My nice image,obj_00001,24,\"A fine image, yes?\",Do whatever you want with it.,There's only one image.,27,45 In this example, the term ID for the tag you want to assign in field_access_terms is 27, and the node ID of the collection you want to add the object to (in field_member_of ) is 45. Multivalued fields For multivalued fields, you separate the values within a field with a pipe ( | ), like this: file,title,field_something IMG_1410.tif,Small boats in Havana Harbour,One subvalue|Another subvalue IMG_2549.jp2,Manhatten Island,first subvalue|second subvalue|third subvalue This works for string fields as well as taxonomy reference fields, e.g.: file,title,field_my_multivalued_taxonomy_field IMG_1410.tif,Small boats in Havana Harbour,35|46 IMG_2549.jp2,Manhatten Island,34|56|28 Drupal strictly enforces the maximum number of values allowed in a field. If the number of values in your CSV file for a field exceed a field's configured maximum number of fields, Workbench will only populate the field to the field's configured limit. The subdelimiter character defaults to a pipe ( | ) but can be set in your config file using the subdelimiter configuration setting. Note Workbench will remove duplicate values in CSV fields. For example, if you accidentally use first subvalue|second subvalue|second subvalue in your CSV, Workbench will filter out the superfluous second subvalue . This applies to both create and update tasks, and within update tasks, replacing values and appending values to existing ones. Workbench deduplicates CVS values silently: it doesn't log the fact that it is doing it. Drupal field types The following types of Drupal fields can be populated from data in your input CSV file: text (plain, plain long, etc.) fields integer fields boolean fields, with values 1 or 0 EDTF date fields entity reference (taxonomy and linked node) fields typed relation (taxonomy) fields link fields geolocation fields Drupal is very strict about not accepting malformed data. Therefore, Islandora Workbench needs to provide data to Drupal that is consistent with field types (string, taxonomy reference, EDTF, etc.) we are populating. This applies not only to Drupal's base fields (as we saw above) but to all fields. A field's type is indicated in the same place as its machine name, within the \"Manage fields\" section of each content type's configuration. The field types are circled in red in the screen shot below: Below are guidelines for preparing CSV data that is compatible with common field types configured in Islandora repositories. Text fields Generally speaking, any Drupal field where the user enters free text into a node add/edit form is configured to be one of the Drupal \"Text\" field types. Islandora Workbench supports non-Latin characters in CSV, provided the CSV file is encoded as UTF-8. For example, the following non-Latin text will be added as expected to Drupal fields: \u4e00\u4e5d\u4e8c\u56db\u5e74\u516d\u6708\u5341\u4e8c\u65e5 (Traditional Chinese) \u0938\u0930\u0915\u093e\u0930\u0940 \u0926\u0938\u094d\u0924\u093e\u0935\u0947\u095b, \u0905\u0916\u092c\u093e\u0930\u094b\u0902 \u092e\u0947\u0902 \u091b\u092a\u0947 \u0932\u0947\u0916, \u0905\u0915\u093e\u0926\u092e\u093f\u0915 \u0915\u093f\u0924\u093e\u092c\u0947\u0902 (Hindi) \u140a\u1455\u1405\u14ef\u1585 \u14c4\u14c7, \u1405\u14c4\u1585\u1450\u1466 \u14c2\u1432\u1466 (Inuktitut) However, if all of your characters are Latin (basically, the characters found on a standard US keyboard) your CSV file can be encoded as ASCII. Some things to note about Drupal text fields: Some specialized forms of text fields, such as EDTF, enforce or prohibit the presence of specific types of characters (see below for EDTF's requirements). Islandora Workbench populates Drupal text fields verbatim with the content provided in the CSV file, with these exceptions . Plus, if a text value in your CSV is longer than its field's maximum configured length, Workbench will truncate the text (see the next point and warning below). Text fields may be configured to have a maximum length. Running Workbench with --check will produce a warning (both shown to the user and written to the Workbench log) if any of the values in your CSV file are longer than their field's configured maximum length. Warning If the CSV value for text field exceeds its configured maximum length, Workbench truncates the value to the maximum length before populating the Drupal field, leaving a log message indicating that it has done so. Text fields with markup Drupal text fields that are configured to contain \"formatted\" text (for example, text with line breaks or HTML markup) will have one of the available text formats, such as \"Full HTML\" or \"Basic HTML\", applied to them. Workbench treats these fields these fields the same as if they are populated using the node add/edit form, but you will have to tell Workbench, in your configuration file, which text format to apply to them. When you populate these fields using the node add/edit form, you need to select a text format within the WYSIWYG editor: When populating these fields using Workbench, you can configure which text format to use either 1) for all Drupal \"formatted\" text fields or 2) using a per-field configuration. 1) To configure the text format to use for all \"formatted\" text fields, include the text_format_id setting in your configuration file, indicating the ID of the text format to use, e.g., text_format_id: full_html . The default value for this setting is basic_html . 2) To configure text formats on a per-field basis, include the field_text_format_ids (plural) setting in your configuration file, along with a field machine name-to-format ID mapping, like this: field_text_format_ids: - field_description_long: full_html - field_abstract: restricted_html If you use both settings in your configuration file, field_text_format_ids takes precedence. You only need to configure text formats per field to override the global setting. Note Workbench has no way of knowing what text formats are configured in the target Drupal, and has no way of validating that the text format ID you use in your configuration file exists. However, if you use a text format ID that is invalid, Drupal will not allow nodes to be created or updated and will leave error messages in your Workbench log that contain text like Unprocessable Entity: validation failed.\\nfield_description_long.0.format: The value you selected is not a valid choice. By default, Drupal comes configured with three text formats, full_html , basic_html , and restricted_html . If you create your own text format at admin/config/content/formats , you can use its ID in the Workbench configuration settings described above. If you want to include line breaks in your CSV, they must be physical line breaks. \\n and other escaped line break characters are not recognized by Drupal's \"Convert line breaks into HTML (i.e.
    and

    )\" text filter. CSV data containing physical line breaks must be wrapped in quotation marks, like this: id,file,title,field_model,field_description_long 01,,I am a title,Image,\"Line breaks are awesome.\" Taxonomy reference fields Note In the list of a content type's fields, as pictured above, Drupal uses \"Entity reference\" for all types of entity reference fields, of which Taxonomy references are one. The other most common kind of entity reference field is a node reference field. Islandora Workbench lets you assign both existing and new taxonomy terms to nodes. Creating new terms on demand during node creation reduces the need to prepopulate your vocabularies prior to creating nodes. In CSV columns for taxonomy fields, you can use either term IDs (integers) or term names (strings). You can even mix IDs and names in the same field: file,title,field_my_multivalued_taxonomy_field img001.png,Picture of cats and yarn,Cats|46 img002.png,Picture of dogs and sticks,Dogs|Sticks img003.png,Picture of yarn and needles,\"Yarn, Balls of|Knitting needles\" By default, if you use a term name in your CSV data that doesn't match a term name that exists in the referenced taxonomy, Workbench will detect this when you use --check , warn you, and exit. This strict default is intended to prevent users from accidentally adding unwanted terms through data entry error. Terms can be from any level in a vocabulary's hierarchy. In other words, if you have a vocabulary whose structure looks like this: you can use the terms IDs or labels for \"Automobiles\", \"Sports cars\", or \"Land Rover\" in your CSV. The term name (or ID) is all you need; no indication of the term's place in the hierarchy is required. If you add allow_adding_terms: true to your configuration file for any of the entity \"create\" or \"update\" tasks, Workbench will create the new term the first time it is used in the CSV file following these rules: If multiple records in your CSV contain the same new term name in the same field, the term is only created once. When Workbench checks to see if the term with the new name exists in the target vocabulary, it queries Drupal for the new term name, looking for an exact match against an existing term in the specified vocabulary. Therefore it is important that term names used in your CSV are identical to existing term names. The query to find existing term names follows these two rules: Leading and trailing whitespace on term names is ignored. Internal whitespace is significant. Case is ignored. Note that Drupal does not distinguish between diacritics. For example, it does not distinguish between \"Chylek\" and \"Ch\u00fdlek\". If the term name you provide in the CSV file does not match an existing term name in its vocabulary, the term name from the CSV data is used to create a new term. If it does match, Workbench populates the field in your nodes with a reference to the matching term. allow_adding_terms applies to all vocabularies. In general, you do not want to add new terms to vocabularies used by Islandora for system functions such as Islandora Models and Islandora Media Use. In order to exclude vocabularies from being added to, you can register vocabulary machine names in the protected_vocabularies setting, like this: protected_vocabularies: - islandora_model - islandora_display - islandora_media_use Adding new terms has some constraints: Terms created in this way do not have any external URIs, other fields, or if they are hierarchical. If you want your terms that have any of these features, you will need to either create the terms manually, through a create_terms task, or using a third-party module like Taxonomy Import prior to using their term names in an input CSV. Workbench cannot distinguish between identical term names within the same vocabulary. This means you cannot create two different terms that have the same term name (for example, two terms in the Person vocabulary that are identical but refer to two different people). The workaround for this is to create one of the terms before using Workbench and use the term ID instead of the term string. If the same term name exists multiple times in the same vocabulary (again using the example of two Person terms that describe two different people) you should be aware that when you use these identical term names within the same vocabulary in your CSV, Workbench will always choose the first one it encounters when it converts from term names to term IDs while populating your nodes. The workaround for this is to use the term ID for one (or both) of the identical terms, or to use URIs for one (or both) of the identical terms. --check will identify any new terms that exceed Drupal's maximum allowed length for term names, 255 characters. If a term name is longer than 255 characters, Workbench will truncate it at that length, log that it has done so, and create the term. Taxonomy terms created with new nodes are not removed when you delete the nodes. Using term names in multi-vocabulary fields While most node taxonomy fields reference only a single vocabulary, Drupal does allow fields to reference multiple vocabularies. This ability poses a problem when we use term names instead of term IDs in our CSV files: in a multi-vocabulary field, Workbench can't be sure which term name belongs in which of the multiple vocabularies referenced by that field. This applies to both existing terms and to new terms we want to add when creating node content. To avoid this problem, we need to tell Workbench which of the multiple vocabularies each term name should (or does) belong to. We do this by namespacing terms with the applicable vocabulary ID. For example, let's imagine we have a node field whose name is field_sample_tags , and this field references two vocabularies, \"cats\" and \"dogs\". To use the terms Tuxedo , Tabby , German Shepherd in the CSV when adding new nodes, we need to namespace them with vocabulary IDs like this: field_sample_tags cats:Tabby cats:Tuxedo dogs:German Shepherd If you want to use multiple terms in a single field, you would namespace them all: cats:Tuxedo|cats:Misbehaving|dogs:German Shepherd To find the vocabulary ID (referred to above as the \"namespace\") to use, visit the list of your site's vocabularies at admin/structure/taxonomy : Hover your pointer over the \"List terms\" button for each vocabulary to reveal the URL to its overview page. The ID for the vocabulary is the string between \"manage\" and \"overview\" in the URL. For example, in the URL admin/structure/taxonomy/manage/person/overview , the vocabulary ID is \"person\". This is the namespace you need to use to indicate which vocabulary to add new terms to. CSV values containing term names that have commas ( , ) in multi-valued, multi-vocabulary fields need to be wrapped in quotation marks (like any CSV value containing a comma), and in addition, the need to specify the namespace within each of the subvalues: \"tags:gum, Bubble|tags:candy, Hard\" Using these conventions, Workbench will be certain which vocabulary the term names belong to. Workbench will remind you during its --check operation that you need to namespace terms. It determines 1) if the field references multiple vocabularies, and then checks to see 2) if the field's values in the CSV are term IDs or term names. If you use term names in multi-vocabulary fields, and the term names aren't namespaced, Workbench will warn you: Error: Term names in multi-vocabulary CSV field \"field_tags\" require a vocabulary namespace; value \"Dogs\" in row 4 does not have one. Note that since : is a special character when you use term names in multi-vocabulary CSV fields, you can't add a namespaced term that itself contains a : . You need to add it manually to Drupal and then use its term ID (or name, or URI) in your CSV file. Using term URIs instead of term IDs Islandora Workbench lets you use URIs assigned to terms instead of term IDs. You can use a term URI in the media_use_tid configuration option (for example, \"http://pcdm.org/use#OriginalFile\" ) and in taxonomy fields in your metadata CSV file: field_model https://schema.org/DigitalDocument http://purl.org/coar/resource_type/c_18cc During --check , Workbench will validate that URIs correspond to existing taxonomy terms. Using term URIs has some constraints: You cannot create a new term by providing a URI like you can by providing a term name. If the same URI is registered with more than one term, Workbench will choose one and write a warning to the log indicating which term it chose and which terms the URI is registered with. However, --check will detect that a URI is registered with more than one term and warn you. At that point you can edit your CSV file to use the correct term ID rather than the URI. Using numbers as term names If you want to use a term name like \"1990\" in your CSV, you need to tell Workbench to not interpret that term name as a term ID. To do this, add a list of CSV columns to your config file using the columns_with_term_names config setting that will only contain term names (and not IDs or URIs, as explained next): columns_with_term_names: - field_subject - field_tags If you register a column name here, it can contain only terms names. Any term ID or URIs will be interpreted as term names. Note that this is only necessary if your term names are comprised entirely of integers. If they contain decimals (like \"1990.3\"), Workbench will not interpret them as term IDs and you will not need to tell Workbench to do otherwise. Entity Reference Views fields Islandora Workbench fully supports taxonomy reference fields that use the \"Default\" reference type, but only partially supports \"Views: Filter by an entity reference View\" taxonomy reference fields. To populate this type of entity reference in \"create\" and \"update\" tasks, you have two options. Warning Regardless of whether you use term IDs or term names in your CSV, Workbench will not validate values in \"Views: Filter by an entity reference View\" taxonomy reference fields. Term IDs or term names that are not in the referenced View will result in the node not being created or updated (Drupal will return a 422 response). However, if allow_adding_terms is set to true , terms that are not in the referenced vocabulary will be added to the vocabulary if your CSV data contains a vocabulary ID/namespace in the form vocabid:newterm . The terms will be added regardless of whether they are within the referenced View. Therefore, for this type of Drupal field, you should not include vocabulary IDs/namespaces in your CSV data for that field. Further work on supporting this type of field is being tracked in this Github issue . First, if your input CSV contains only term IDs in this type of column, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Alternatively, if you prefer to use term names instead of term IDs in CSV column for this type of field, you will need to create a special Display for the View that is refererenced from that field. To do this: In the View that is referenced, duplicate the view display as a REST Export display. When making the changes to the resulting REST Export display, be careful to modify that display only (and not \"All displays\") by alaways choosing \"This rest_export (override)\" during every change. Format: Serializer Settings: json Path: some_path (do not include the leading / ) Authentication: Basic Auth Access restrictions: Role > View published content (the default; \"administrator vocabularies and terms\" is needed for other endpoints used by Workbench, but this view doesn't require this.) Filter Criteria: Add \"Taxonomy term / name\" from the list of fields Expose this filter Choose the \"Is equal to\" operator Leave the Value field empty In the Field identifier field, enter \"name\" Your configuration for the new \"Taxonomy term: name\" filter should look like this: Then in your Workbench configuration file, using the entity_reference_view_endpoints setting, provide a mapping between columns in your CSV file and the applicable Views REST Export display path value (configured in step 3 above). In this example, we define three field/path mappings: entity_reference_view_endpoints: - field_linked_agent: /taxonomy/linked_agents - field_language: /language/lookup - field_subject: /taxonomy/subjects During \"create\" and \"update\" tasks, --check will tell you if the View REST Export display path is accesible. Typed Relation fields Typed relation fields contain information about the relationship (or \"relation\") between a taxonomy term and the node it is attached to. For example, a term from the Person vocabulary, \"Jordan, Mark\", can be an author, illustrator, or editor of the book described in the node. In this example, \"author\", \"illustrator\", and \"editor\" are the typed relations. Note Although Islandora supports Typed Relation fields that allow adding relations to other nodes, currently Workbench only supports adding relations to taxonomies. If you need support for adding Typed Relations to other entities, please leave a comment on this issue . The Controlled Access Terms module allows the relations to be sets of terms from external authority lists (for example like the MARC Relators list maintained by the Library of Congress). Within a Typed Relation field's configuration, the configured relations look like this: In this screenshot, \"relators\" is a namespace for the MARC Relators authority list, the codes \"acp\", \"adi\", etc are the codes for each relator, and \"Art copyist\", \"Art director\", etc. are the human-readable label for each relator. Within the edit form of a node that has a Typed Relation field, the user interface adds a select list of the relation (the target taxonomy term here is \"Jordan, Mark (30))\", like this: To be able to populate Typed Relation fields using CSV data with the three pieces of required data (authority list, relation type, target term), Islandora Workbench supports CSV values that contain the corresponding namespace, relator code, and taxonomy term ID, each separated by a colon ( : ), like this: relators:art:30 In this example CSV value, relators is the namespace that the relation type art is from (the Library of Congress Relators vocabulary), and the target taxonomy term ID is 30 . Note Note that the structure required for typed relation values in the CSV file is not the same as the structure of the relations configuration depicted in the screenshot of the \"Available Relations\" list above. A second option for populating Typed Relation fields is to use taxonomy term names (as opposed to term IDs) as targets: \"relators:art:Jordan, Mark\" Warning In the next few paragraphs, the word \"namespace\" is used to describe two different kinds of namespaces - first, a vocabulary ID in the local Drupal and second, an ID for the external authority list of relators, for example by the Library of Congress. As we saw in the \"Using term names in multi-vocabulary fields\" section above, if the field that we are populating references multiple vocabularies, we need to tell Drupal which vocabulary we are referring to with a local vocabulary namespace. To add a local vocabulary namespace to Typed Relation field CSV structure, we prepend it to the term name, like this (note the addition of \"person\"): \"relators:art:person:Jordan, Mark\" (In this example, relators is the external authority lists's namespace, and person is the local Drupal vocabulary namespace, prepended to the taxonomy term name, \"Jordan, Mark\".) If this seems confusing and abstruse, don't worry. Running --check will tell you that you need to add the Drupal vocabulary namespace to values in specific CSV columns. The final option for populating Typed Relation field is to use HTTP URIs as typed relation targets: relators:art:http://markjordan.net If you want to include multiple typed relation values in a single field of your CSV file (such as in \"field_linked_agent\"), separate the three-part values with the same subdelimiter character you use in other fields, e.g. ( | ) (or whatever you have configured as your subdelimiter ): relators:art:30|relators:art:45 or \"relators:art:person:Jordan, Mark|relators:art:45\" Adding new typed relation targets Islandora Workbench allows you to add new typed relation targets while creating and updating nodes. These targets are taxonomy terms. Your configuration file must include the allow_adding_terms: true option to add new targets. In general, adding new typed relation targets is just like adding new taxonomy terms as described above in the \"Taxonomy relation fields\" section. An example of a CSV value that adds a new target term is: \"relators:art:person:Jordan, Mark\" You can also add multiple new targets: \"relators:art:person:Annez, Melissa|relators:art:person:Jordan, Mark\" Note that: For multi-vocabulary fields, new typed relator targets must be accompanied by a vocabulary namespace ( person in the above examples). You cannot add new relators (e.g. relators:foo ) in your CSV file, only new target terms. Note Adding the typed relation namespace, relators, and vocabulary names is a major hassle. If this information is the same for all values (in all rows) in your field_linked_agent column (or any other typed relation field), you can use CSV value templates to reduce the tedium. EDTF fields Running Islandora Workbench with --check will validate Extended Date/Time Format (EDTF) Specification dates (Levels 0, 1, and 2) in EDTF fields. Some common examples include: Type Examples Date 1976-04-23 1976-04 Qualified date 1976? 1976-04~ 1976-04-24% Date and time 1985-04-12T23:20:30 Interval 1964/2008 2004-06/2006-08 2004-06-04/2006-08-01 2004-06/2006-08-01 Set [1667,1668,1670..1672] [1672..1682] [1672,1673] [..1672] [1672..] Subvalues in multivalued CSV fields are validated separately, e.g. if your CSV value is 2004-06/2006-08|2007-01/2007-04 , 2004-06/2006-08 and 2007-01/2007-04 are validated separately. Note EDTF supports a very wide range of specific and general dates, and in some cases, valid dates can look counterintuitive. For example, \"2001-34\" is valid (it's Sub-Year Grouping meaning 2nd quarter of 2001). Link fields The link field type stores URLs (e.g. https://acme.com ) and link text in separate data elements. To add or update fields of this type, Workbench needs to provide the URL and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the URL and link text pairs in CSV values with double percent signs ( %% ), like this: field_related_websites http://acme.com%%Acme Products Inc. You can include multiple pairs of URL/link text pairs in one CSV field if you separate them with the subdelimiter character: field_related_websites http://acme.com%%Acme Products Inc.|http://diy-first-aid.net%%DIY First Aid The URL is required, but the link text is not. If you don't have or want any link text, omit it and the double quotation marks: field_related_websites http://acme.com field_related_websites http://acme.com|http://diy-first-aid.net%%DIY First Aid Authority link fields The authority link field type stores abbreviations for authority sources (i.e., external controlled vocabularies such as national name authorities), authority URIs (e.g. http://viaf.org/viaf/153525475 ) and link text in separate data elements. Authority link fields are most commonly used on taxonomy terms, but can be used on nodes as well. To add or update fields of this type, Workbench needs to provide the authority source abbreviation, URI and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the three parts in CSV values with double percent signs ( %% ), like this: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record You can include multiple triplets of source abbreviation/URL/link text in one CSV field if you separate them with the subdelimiter character: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record|other%%https://github.com/mjordan%%Github The authority source abbreviation and the URI are required, but the link text is not. If you don't have or want any link text, omit it: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807 field_authority_vocabs viaf%%http://viaf.org/viaf/10646807|other%%https://github.com/mjordan%%Github Geolocation fields The Geolocation field type, managed by the Geolocation Field contrib module, stores latitude and longitude coordinates in separate data elements. To add or update fields of this type, Workbench needs to provide the latitude and longitude data in these separate elements. To simplify entering geocoordinates in the CSV file, Workbench allows geocoordinates to be in lat,long format, i.e., the latitude coordinate followed by a comma followed by the longitude coordinate. When Workbench reads your CSV file, it will split data on the comma into the required lat and long parts. An example of a single geocoordinate in a field would be: field_coordinates \"49.16667,-123.93333\" You can include multiple pairs of geocoordinates in one CSV field if you separate them with the subdelimiter character: field_coordinates \"49.16667,-123.93333|49.25,-124.8\" Note that: Geocoordinate values in your CSV need to be wrapped in double quotation marks, unless the delimiter key in your configuration file is set to something other than a comma. If you are entering geocoordinates into a spreadsheet, you may need to escape leading + and - signs since they will make the spreadsheet application think you are entering a formula. You can work around this by escaping the + an - with a backslash ( \\ ), e.g., 49.16667,-123.93333 should be \\+49.16667,-123.93333 , and 49.16667,-123.93333|49.25,-124.8 should be \\+49.16667,-123.93333|\\+49.25,-124.8 . Workbench will strip the leading \\ before it populates the Drupal fields. Excel: leading + and - need to be escaped Google Sheets: only + needs to be escaped LibreOffice Calc: neither + nor - needs to be escaped Paragraphs (Entity Reference Revisions fields) Entity Reference Revisions fields are similar to Drupal's core Entity Reference fields used for taxonomy terms but are intended for entities that are not intended for reference outside the context of the item that references them. For Islandora sites, this is used for Paragraph entities. In order to populate paragraph entites using Workbench in create and update tasks, you need to enable and configure the REST endpoints for paragraphs. To do this, ensure the REST UI module is enabled, then go to Configuration/Web Services/REST Resources ( /admin/config/services/rest ) and enable \"Paragraph\". Then edit the settings for Paragraph to the following: Granularity = Method GET formats: jsonld, json authentication: jwt_auth, basic_auth, cookie POST formats: json authentication: jwt_auth, basic_auth, cookie DELETE formats: json authentication: jwt_auth, basic_auth, cookie PATCH formats: json authentication: jwt_auth, basic_auth, cookie Note Pargraphs are locked down from REST updates by default. To add new and update paragraph values you must enable the paragraphs_type_permissions submodule and ensure the Drupal user in your configuration file has sufficient privledges granted at /admin/people/permissions/module/paragraphs_type_permissions . Paragraphs are handy for Islandora as they provide flexibility in the creation of more complex metadata (such as complex titles, typed notes, or typed identifiers) by adding Drupal fields to paragraph entities, unlike the Typed Relationship field which hard-codes properties. However, this flexibility makes creating an Workbench import more complicated and, as such, requires additional configration. For example, suppose you have a \"Full Title\" field ( field_full_title ) on your Islandora Object content type referencing a paragraph type called \"Complex Title\" ( complex_title ) that contains \"main title\" ( field_main_title ) and \"subtitle\" ( field_subtitle ) text fields. The input CSV would look like: field_full_title My Title: A Subtitle|Alternate Title In this example we have two title values, \"My Title: A Subtitle\" (where \"My Title\" is the main title and \" A Subtitle\" is the subtitle) and \"Alternate Title\" (which only has a main title). To map these CSV values to our paragraph fields, we need to add the following to our configuration file: paragraph_fields: node: field_full_title: type: complex_title field_order: - field_main_title - field_subtitle field_delimiter: ':' subdelimiter: '|' This configuration defines the paragraph field on the node ( field_full_title ) and its child fields ( field_main_title and field_subtitle ), which occur within the paragraph in the order they are named in the field_order property. Within the data in the CSV column, the values corresponding to the order of those fields are separated by the character defined in the field_delimiter property. subdelimiter here is the same as the subdelimiter configuration setting used in non-paragraph multi-valued fields; in this example it overrides the default or globally configured value. We use a colon for the field delimiter in this example as it is often used in titles to denote subtitles. Note that in the above example, the space before \"A\" in the subtitle will be preserved. Whether or not you want a space there in your data will depend on how you display the Full Title field. Warning Note that Workbench assumes all fields within a paragraph are single-valued. When using Workbench to update paragraphs using update_mode: replace , any null values for fields within the paragraph (such as the null subtitle in the second \"Alternate Title\" instance in the example) will null out existing field values. However, considering each paragraph as a whole field value, Workbench behaves the same as for all other fields - update_mode: replace will replace all paragraph entities with the ones in the CSV, but if the CSV does not contain any values for this field then the field will be left as is. Values in the \"field_member_of\" column The field_member_of column can take a node ID, a full URL to a node, or a URL alias. For instance, all of these refer to the same node and can be used in field_member_of : 648 (node ID) https://islandora.traefik.me/node/648 (full URL) https://islandora.traefik.me/mycollection (full URL using an alias) /mycollection (URL alias) If you use any of these types of values other than the bare node ID, Workbench will look up the node ID based on the URL or alias. Values in the \"field_domain_access\" column The Domain Access module, part of the Domain suite of modules, creates a required, multivalued field with the machine name field_domain_access that controls, at the node level, which domains the node shows up in. When populating this field in your Workbench CSV, replace the periods in domain names with _ . For example, if the domains you want to allow a node to show up in are test1.testing.edu and test2.testing.edu , the values in your field_domain_access field look like this: test1_testing_edu test1_testing_edu test1_testing_edu|test2_testing_edu test2_testing_edu","title":"Field data (Drupal and CSV)"},{"location":"fields/#reserved-csv-columns","text":"The following CSV columns are used for specific purposes and in some cases are required in your CSV file, depending on the task you are performing (see below for specific cases). Data in them does not directly populate Drupal content-type fields. CSV field name Task(s) Note id create This CSV field is used by Workbench for internal purposes, and is not added to the metadata of your Islandora objects. Therefore, it doesn't need to have any relationship to the item described in the rest of the fields in the CSV file. You can configure this CSV field name to be something other than id by using the id_field option in your configuration file. Note that if the specified field contains multiple values, (e.g. 0001|spec-86389 ), the entire field value will be used as the internal Workbench identifier. Also note that it is important to use values in this column that are unique across input CSV files if you plan to create parent/child relationships accross Workbench sessions . parent_id create When creating paged or compound content, this column identifies the parent of the item described in the current row. For information on how to use this field, see \" With page/child-level metadata .\" Note that this field can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. node_id update, delete, add_media, export_csv, delete_media_by_node The ID of the node you are updating, deleting, or adding media to. Full URLs (including URL aliases) are also allowed in this CSV field. file create, add_media See detail in \"Values in the 'file' column\", below. media_use_tid create, add_media Tells Workbench which terms from the Islandora Media Use vocabulary to assign to media created in create and add_media tasks. This can be set for all new media in the configuration file; only include it in your CSV if you want row-level control over this value. More detail is available in the \" Configuration \" docs for media_use_tid . url_alias create, update See detail in \" Assigning URL aliases \". image_alt_text create See detail in \" Adding alt text to images \". checksum create See detail in \" Fixity checking \". term_name create_terms See detail in \" Creating taxonomy terms \".","title":"Reserved CSV columns"},{"location":"fields/#values-in-the-file-column","text":"Values in the reserved file CSV field contain the location of files that are used to create Drupal Media. By default, Workbench pushes up to Drupal only one file, and creates only one resulting media per CSV record. However, it is possible to push up multiple files per CSV record (and create all of their corresponding media). File locations in the file field can be relative to the directory named in input_dir , absolute paths, or URLs. Examples of each: relative to directory named in the input_dir configuration setting: myfile.png absolute: /tmp/data/myfile.png . On Windows, you can use values like c:\\users\\mjordan\\files\\myfile.png or \\\\some.windows.file.share.org\\share_name\\files\\myfile.png . URL: http://example.com/files/myfile.png Things to note about file values in general: Relative, absolute, and URL file locations can exist within the same CSV file, or even within the same CSV value. By default, if the file value for a row is empty, Workbench will log the empty value, both in and outside of --check . file values that point to files that don't exist will result in Workbench logging the missing file and then exiting, unless allow_missing_files: true is present in your config file. Adding perform_soft_checks will also tell Workbench to not error out when the value in the file column can't be found. If you want do not want to create media for any of the rows in your CSV file, include nodes_only: true in your configuration file. More detail is available . file values that contain non-ASCII characters are normalized to their ASCII equivalents. See this issue for more information. The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Things to note about URLs as file values: Workbench downloads files identified by URLs and saves them in the directory named in temp_dir before processing them further; within this directory, each file is saved in a subdirectory named after the value in the row's id_field field. It does not delete the files from these locations after they have been ingested into Islandora unless the delete_tmp_upload configuration option is set to true . Files identified by URLs must be accessible to the Workbench script, which means they must not require a username/password; however, they can be protected by a firewall, etc. as long as the computer running Workbench is allowed to retrieve the files without authenticating. Currently Workbench requires that the URLs point directly to a file or a service that generates a file, and not a wrapper page or other indirect route to the file.","title":"Values in the \"file\" column"},{"location":"fields/#required-columns","text":"A small number of columns are required in your CSV, depending on the task you are performing: Task Required in CSV Note create id See detail in \"Reserved CSV fields\", above. title The node title. file Empty values in the file field are allowed if allow_missing_files is present in your configuration file, in which case a node will be created but it will have no attached media. update node_id The node ID of an existing node you are updating. delete node_id The node ID of an existing node you are deleting. add_media node_id The node ID of an existing node you are attaching media to. file Must contain a filename, file path, or URL. allow_missing_files only works with the create task. If a required field is missing from your CSV, --check will tell you.","title":"Required columns"},{"location":"fields/#columns-you-want-workbench-to-ignore","text":"In some cases you might want to include columns in your CSV that you want Workbench to ignore. More information on this option is available in the \"Sharing the input CSV with other applications\" section of the Workflows documentation.","title":"Columns you want Workbench to ignore"},{"location":"fields/#csv-fields-that-contain-drupal-field-data","text":"These are of two types of Drupal fields, base fields and content-type specific fields.","title":"CSV fields that contain Drupal field data"},{"location":"fields/#base-fields","text":"Base fields are basic node properties, shared by all content types. The base fields you can include in your CSV file are: title : This field is required for all rows in your CSV for the create task. Optional for the 'update' task. Drupal limits the title's length to 255 characters,unless the Node Title Length contrib module is installed. If that module is installed, you can set the maximum allowed title length using the max_node_title_length configuration setting. langcode : The language of the node. Optional. If included, use one of Drupal's language codes as values (common values are 'en', 'fr', and 'es'; the entire list can be seen here . If absent, Drupal sets the value to the default value for your content type. uid : The Drupal user ID to assign to the node and media created with the node. Optional. Only available in create tasks. If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). created : The timestamp to use in the node's \"created\" attribute and in the \"created\" attribute of the media created with the node. Optional, but if present, it must be in format 2020-11-15T23:49:22+00:00 (the +00:00 is the difference to Greenwich time/GMT). If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). published : Whether or not the node (and all accompanying media) is published. If present in add_media tasks, will override parent node's published value. Values in this field are either 1 (for published) or 0 (for unpublished). The default value for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . promote : Whether or not the node is promoted to the site's front page. 1 (for promoted) or 0 (for not promoted). The default vaue for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . All base fields other than uid can be included in both create and update tasks.","title":"Base fields"},{"location":"fields/#content-type-specific-fields","text":"These fields correspond directly to fields configured in Drupal nodes, and data you provide in them populates their equivalent field in Drupal entities. The column headings in the CSV file must match machine names of fields that exist in the target node content type. Fields' machine names are visible within the \"Manage fields\" section of each content type's configuration, here circled in red: These field names, plus the fields indicated in the \"Reserved CSV fields\" section above, are the column headers in your CSV file, like this: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Note If content-type field values apply to all of the rows in your CSV file, you can avoid including them in the CSV and instead use \" CSV field templates \".","title":"Content type-specific fields"},{"location":"fields/#using-field-labels-as-csv-column-headers","text":"By default, Workbench requires that column headers in your CSV file use the machine name of Drupal fields. However, in \"create\", \"update\", and \"create_terms\" tasks, you can use the field labels if you include csv_headers: labels in your configuration file. If you do this, you can use CSV file like this: file,id,title,Model,Description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Some things to note about using field labels in your CSV: if the content type (or vocabulary) that you are populating uses the same label for multiple fields, you won't be able to use labels as your CSV column headers. --check will tell you if there are any duplicate field labels. Spaces in field labels are OK, e.g. Country of Publication . Spelling, capitalization, punctuation, etc. in CSV column headers must match the field labels exactly. If any field labels contain the character you are using as the CSV delimiter (defined in the delimiter config setting), you will need to wrap the column header in quotation marks, e.g. \"Height, length, weight\" .","title":"Using field labels as CSV column headers"},{"location":"fields/#single-and-multi-valued-fields","text":"Drupal allows for fields to have a single value, a specific maximum number of values, or unlimited number of values. In the CSV input file, each Drupal field corresponds to a single CSV field. In other words, the CSV column names must be unique, even if a Drupal field allows multiple values. Populating multivalued fields is explained below.","title":"Single and multi-valued fields"},{"location":"fields/#single-valued-fields","text":"In your CSV file, single-valued fields simply contain the value, which, depending on the field type, can be a string or an integer. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: file,title,id,field_model,field_description,field_rights,field_extent,field_access_terms,field_member_of myfile.jpg,My nice image,obj_00001,24,\"A fine image, yes?\",Do whatever you want with it.,There's only one image.,27,45 In this example, the term ID for the tag you want to assign in field_access_terms is 27, and the node ID of the collection you want to add the object to (in field_member_of ) is 45.","title":"Single-valued fields"},{"location":"fields/#multivalued-fields","text":"For multivalued fields, you separate the values within a field with a pipe ( | ), like this: file,title,field_something IMG_1410.tif,Small boats in Havana Harbour,One subvalue|Another subvalue IMG_2549.jp2,Manhatten Island,first subvalue|second subvalue|third subvalue This works for string fields as well as taxonomy reference fields, e.g.: file,title,field_my_multivalued_taxonomy_field IMG_1410.tif,Small boats in Havana Harbour,35|46 IMG_2549.jp2,Manhatten Island,34|56|28 Drupal strictly enforces the maximum number of values allowed in a field. If the number of values in your CSV file for a field exceed a field's configured maximum number of fields, Workbench will only populate the field to the field's configured limit. The subdelimiter character defaults to a pipe ( | ) but can be set in your config file using the subdelimiter configuration setting. Note Workbench will remove duplicate values in CSV fields. For example, if you accidentally use first subvalue|second subvalue|second subvalue in your CSV, Workbench will filter out the superfluous second subvalue . This applies to both create and update tasks, and within update tasks, replacing values and appending values to existing ones. Workbench deduplicates CVS values silently: it doesn't log the fact that it is doing it.","title":"Multivalued fields"},{"location":"fields/#drupal-field-types","text":"The following types of Drupal fields can be populated from data in your input CSV file: text (plain, plain long, etc.) fields integer fields boolean fields, with values 1 or 0 EDTF date fields entity reference (taxonomy and linked node) fields typed relation (taxonomy) fields link fields geolocation fields Drupal is very strict about not accepting malformed data. Therefore, Islandora Workbench needs to provide data to Drupal that is consistent with field types (string, taxonomy reference, EDTF, etc.) we are populating. This applies not only to Drupal's base fields (as we saw above) but to all fields. A field's type is indicated in the same place as its machine name, within the \"Manage fields\" section of each content type's configuration. The field types are circled in red in the screen shot below: Below are guidelines for preparing CSV data that is compatible with common field types configured in Islandora repositories.","title":"Drupal field types"},{"location":"fields/#text-fields","text":"Generally speaking, any Drupal field where the user enters free text into a node add/edit form is configured to be one of the Drupal \"Text\" field types. Islandora Workbench supports non-Latin characters in CSV, provided the CSV file is encoded as UTF-8. For example, the following non-Latin text will be added as expected to Drupal fields: \u4e00\u4e5d\u4e8c\u56db\u5e74\u516d\u6708\u5341\u4e8c\u65e5 (Traditional Chinese) \u0938\u0930\u0915\u093e\u0930\u0940 \u0926\u0938\u094d\u0924\u093e\u0935\u0947\u095b, \u0905\u0916\u092c\u093e\u0930\u094b\u0902 \u092e\u0947\u0902 \u091b\u092a\u0947 \u0932\u0947\u0916, \u0905\u0915\u093e\u0926\u092e\u093f\u0915 \u0915\u093f\u0924\u093e\u092c\u0947\u0902 (Hindi) \u140a\u1455\u1405\u14ef\u1585 \u14c4\u14c7, \u1405\u14c4\u1585\u1450\u1466 \u14c2\u1432\u1466 (Inuktitut) However, if all of your characters are Latin (basically, the characters found on a standard US keyboard) your CSV file can be encoded as ASCII. Some things to note about Drupal text fields: Some specialized forms of text fields, such as EDTF, enforce or prohibit the presence of specific types of characters (see below for EDTF's requirements). Islandora Workbench populates Drupal text fields verbatim with the content provided in the CSV file, with these exceptions . Plus, if a text value in your CSV is longer than its field's maximum configured length, Workbench will truncate the text (see the next point and warning below). Text fields may be configured to have a maximum length. Running Workbench with --check will produce a warning (both shown to the user and written to the Workbench log) if any of the values in your CSV file are longer than their field's configured maximum length. Warning If the CSV value for text field exceeds its configured maximum length, Workbench truncates the value to the maximum length before populating the Drupal field, leaving a log message indicating that it has done so.","title":"Text fields"},{"location":"fields/#text-fields-with-markup","text":"Drupal text fields that are configured to contain \"formatted\" text (for example, text with line breaks or HTML markup) will have one of the available text formats, such as \"Full HTML\" or \"Basic HTML\", applied to them. Workbench treats these fields these fields the same as if they are populated using the node add/edit form, but you will have to tell Workbench, in your configuration file, which text format to apply to them. When you populate these fields using the node add/edit form, you need to select a text format within the WYSIWYG editor: When populating these fields using Workbench, you can configure which text format to use either 1) for all Drupal \"formatted\" text fields or 2) using a per-field configuration. 1) To configure the text format to use for all \"formatted\" text fields, include the text_format_id setting in your configuration file, indicating the ID of the text format to use, e.g., text_format_id: full_html . The default value for this setting is basic_html . 2) To configure text formats on a per-field basis, include the field_text_format_ids (plural) setting in your configuration file, along with a field machine name-to-format ID mapping, like this: field_text_format_ids: - field_description_long: full_html - field_abstract: restricted_html If you use both settings in your configuration file, field_text_format_ids takes precedence. You only need to configure text formats per field to override the global setting. Note Workbench has no way of knowing what text formats are configured in the target Drupal, and has no way of validating that the text format ID you use in your configuration file exists. However, if you use a text format ID that is invalid, Drupal will not allow nodes to be created or updated and will leave error messages in your Workbench log that contain text like Unprocessable Entity: validation failed.\\nfield_description_long.0.format: The value you selected is not a valid choice. By default, Drupal comes configured with three text formats, full_html , basic_html , and restricted_html . If you create your own text format at admin/config/content/formats , you can use its ID in the Workbench configuration settings described above. If you want to include line breaks in your CSV, they must be physical line breaks. \\n and other escaped line break characters are not recognized by Drupal's \"Convert line breaks into HTML (i.e.
    and

    )\" text filter. CSV data containing physical line breaks must be wrapped in quotation marks, like this: id,file,title,field_model,field_description_long 01,,I am a title,Image,\"Line breaks are awesome.\"","title":"Text fields with markup"},{"location":"fields/#taxonomy-reference-fields","text":"Note In the list of a content type's fields, as pictured above, Drupal uses \"Entity reference\" for all types of entity reference fields, of which Taxonomy references are one. The other most common kind of entity reference field is a node reference field. Islandora Workbench lets you assign both existing and new taxonomy terms to nodes. Creating new terms on demand during node creation reduces the need to prepopulate your vocabularies prior to creating nodes. In CSV columns for taxonomy fields, you can use either term IDs (integers) or term names (strings). You can even mix IDs and names in the same field: file,title,field_my_multivalued_taxonomy_field img001.png,Picture of cats and yarn,Cats|46 img002.png,Picture of dogs and sticks,Dogs|Sticks img003.png,Picture of yarn and needles,\"Yarn, Balls of|Knitting needles\" By default, if you use a term name in your CSV data that doesn't match a term name that exists in the referenced taxonomy, Workbench will detect this when you use --check , warn you, and exit. This strict default is intended to prevent users from accidentally adding unwanted terms through data entry error. Terms can be from any level in a vocabulary's hierarchy. In other words, if you have a vocabulary whose structure looks like this: you can use the terms IDs or labels for \"Automobiles\", \"Sports cars\", or \"Land Rover\" in your CSV. The term name (or ID) is all you need; no indication of the term's place in the hierarchy is required. If you add allow_adding_terms: true to your configuration file for any of the entity \"create\" or \"update\" tasks, Workbench will create the new term the first time it is used in the CSV file following these rules: If multiple records in your CSV contain the same new term name in the same field, the term is only created once. When Workbench checks to see if the term with the new name exists in the target vocabulary, it queries Drupal for the new term name, looking for an exact match against an existing term in the specified vocabulary. Therefore it is important that term names used in your CSV are identical to existing term names. The query to find existing term names follows these two rules: Leading and trailing whitespace on term names is ignored. Internal whitespace is significant. Case is ignored. Note that Drupal does not distinguish between diacritics. For example, it does not distinguish between \"Chylek\" and \"Ch\u00fdlek\". If the term name you provide in the CSV file does not match an existing term name in its vocabulary, the term name from the CSV data is used to create a new term. If it does match, Workbench populates the field in your nodes with a reference to the matching term. allow_adding_terms applies to all vocabularies. In general, you do not want to add new terms to vocabularies used by Islandora for system functions such as Islandora Models and Islandora Media Use. In order to exclude vocabularies from being added to, you can register vocabulary machine names in the protected_vocabularies setting, like this: protected_vocabularies: - islandora_model - islandora_display - islandora_media_use Adding new terms has some constraints: Terms created in this way do not have any external URIs, other fields, or if they are hierarchical. If you want your terms that have any of these features, you will need to either create the terms manually, through a create_terms task, or using a third-party module like Taxonomy Import prior to using their term names in an input CSV. Workbench cannot distinguish between identical term names within the same vocabulary. This means you cannot create two different terms that have the same term name (for example, two terms in the Person vocabulary that are identical but refer to two different people). The workaround for this is to create one of the terms before using Workbench and use the term ID instead of the term string. If the same term name exists multiple times in the same vocabulary (again using the example of two Person terms that describe two different people) you should be aware that when you use these identical term names within the same vocabulary in your CSV, Workbench will always choose the first one it encounters when it converts from term names to term IDs while populating your nodes. The workaround for this is to use the term ID for one (or both) of the identical terms, or to use URIs for one (or both) of the identical terms. --check will identify any new terms that exceed Drupal's maximum allowed length for term names, 255 characters. If a term name is longer than 255 characters, Workbench will truncate it at that length, log that it has done so, and create the term. Taxonomy terms created with new nodes are not removed when you delete the nodes.","title":"Taxonomy reference fields"},{"location":"fields/#using-term-names-in-multi-vocabulary-fields","text":"While most node taxonomy fields reference only a single vocabulary, Drupal does allow fields to reference multiple vocabularies. This ability poses a problem when we use term names instead of term IDs in our CSV files: in a multi-vocabulary field, Workbench can't be sure which term name belongs in which of the multiple vocabularies referenced by that field. This applies to both existing terms and to new terms we want to add when creating node content. To avoid this problem, we need to tell Workbench which of the multiple vocabularies each term name should (or does) belong to. We do this by namespacing terms with the applicable vocabulary ID. For example, let's imagine we have a node field whose name is field_sample_tags , and this field references two vocabularies, \"cats\" and \"dogs\". To use the terms Tuxedo , Tabby , German Shepherd in the CSV when adding new nodes, we need to namespace them with vocabulary IDs like this: field_sample_tags cats:Tabby cats:Tuxedo dogs:German Shepherd If you want to use multiple terms in a single field, you would namespace them all: cats:Tuxedo|cats:Misbehaving|dogs:German Shepherd To find the vocabulary ID (referred to above as the \"namespace\") to use, visit the list of your site's vocabularies at admin/structure/taxonomy : Hover your pointer over the \"List terms\" button for each vocabulary to reveal the URL to its overview page. The ID for the vocabulary is the string between \"manage\" and \"overview\" in the URL. For example, in the URL admin/structure/taxonomy/manage/person/overview , the vocabulary ID is \"person\". This is the namespace you need to use to indicate which vocabulary to add new terms to. CSV values containing term names that have commas ( , ) in multi-valued, multi-vocabulary fields need to be wrapped in quotation marks (like any CSV value containing a comma), and in addition, the need to specify the namespace within each of the subvalues: \"tags:gum, Bubble|tags:candy, Hard\" Using these conventions, Workbench will be certain which vocabulary the term names belong to. Workbench will remind you during its --check operation that you need to namespace terms. It determines 1) if the field references multiple vocabularies, and then checks to see 2) if the field's values in the CSV are term IDs or term names. If you use term names in multi-vocabulary fields, and the term names aren't namespaced, Workbench will warn you: Error: Term names in multi-vocabulary CSV field \"field_tags\" require a vocabulary namespace; value \"Dogs\" in row 4 does not have one. Note that since : is a special character when you use term names in multi-vocabulary CSV fields, you can't add a namespaced term that itself contains a : . You need to add it manually to Drupal and then use its term ID (or name, or URI) in your CSV file.","title":"Using term names in multi-vocabulary fields"},{"location":"fields/#using-term-uris-instead-of-term-ids","text":"Islandora Workbench lets you use URIs assigned to terms instead of term IDs. You can use a term URI in the media_use_tid configuration option (for example, \"http://pcdm.org/use#OriginalFile\" ) and in taxonomy fields in your metadata CSV file: field_model https://schema.org/DigitalDocument http://purl.org/coar/resource_type/c_18cc During --check , Workbench will validate that URIs correspond to existing taxonomy terms. Using term URIs has some constraints: You cannot create a new term by providing a URI like you can by providing a term name. If the same URI is registered with more than one term, Workbench will choose one and write a warning to the log indicating which term it chose and which terms the URI is registered with. However, --check will detect that a URI is registered with more than one term and warn you. At that point you can edit your CSV file to use the correct term ID rather than the URI.","title":"Using term URIs instead of term IDs"},{"location":"fields/#using-numbers-as-term-names","text":"If you want to use a term name like \"1990\" in your CSV, you need to tell Workbench to not interpret that term name as a term ID. To do this, add a list of CSV columns to your config file using the columns_with_term_names config setting that will only contain term names (and not IDs or URIs, as explained next): columns_with_term_names: - field_subject - field_tags If you register a column name here, it can contain only terms names. Any term ID or URIs will be interpreted as term names. Note that this is only necessary if your term names are comprised entirely of integers. If they contain decimals (like \"1990.3\"), Workbench will not interpret them as term IDs and you will not need to tell Workbench to do otherwise.","title":"Using numbers as term names"},{"location":"fields/#entity-reference-views-fields","text":"Islandora Workbench fully supports taxonomy reference fields that use the \"Default\" reference type, but only partially supports \"Views: Filter by an entity reference View\" taxonomy reference fields. To populate this type of entity reference in \"create\" and \"update\" tasks, you have two options. Warning Regardless of whether you use term IDs or term names in your CSV, Workbench will not validate values in \"Views: Filter by an entity reference View\" taxonomy reference fields. Term IDs or term names that are not in the referenced View will result in the node not being created or updated (Drupal will return a 422 response). However, if allow_adding_terms is set to true , terms that are not in the referenced vocabulary will be added to the vocabulary if your CSV data contains a vocabulary ID/namespace in the form vocabid:newterm . The terms will be added regardless of whether they are within the referenced View. Therefore, for this type of Drupal field, you should not include vocabulary IDs/namespaces in your CSV data for that field. Further work on supporting this type of field is being tracked in this Github issue . First, if your input CSV contains only term IDs in this type of column, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Alternatively, if you prefer to use term names instead of term IDs in CSV column for this type of field, you will need to create a special Display for the View that is refererenced from that field. To do this: In the View that is referenced, duplicate the view display as a REST Export display. When making the changes to the resulting REST Export display, be careful to modify that display only (and not \"All displays\") by alaways choosing \"This rest_export (override)\" during every change. Format: Serializer Settings: json Path: some_path (do not include the leading / ) Authentication: Basic Auth Access restrictions: Role > View published content (the default; \"administrator vocabularies and terms\" is needed for other endpoints used by Workbench, but this view doesn't require this.) Filter Criteria: Add \"Taxonomy term / name\" from the list of fields Expose this filter Choose the \"Is equal to\" operator Leave the Value field empty In the Field identifier field, enter \"name\" Your configuration for the new \"Taxonomy term: name\" filter should look like this: Then in your Workbench configuration file, using the entity_reference_view_endpoints setting, provide a mapping between columns in your CSV file and the applicable Views REST Export display path value (configured in step 3 above). In this example, we define three field/path mappings: entity_reference_view_endpoints: - field_linked_agent: /taxonomy/linked_agents - field_language: /language/lookup - field_subject: /taxonomy/subjects During \"create\" and \"update\" tasks, --check will tell you if the View REST Export display path is accesible.","title":"Entity Reference Views fields"},{"location":"fields/#typed-relation-fields","text":"Typed relation fields contain information about the relationship (or \"relation\") between a taxonomy term and the node it is attached to. For example, a term from the Person vocabulary, \"Jordan, Mark\", can be an author, illustrator, or editor of the book described in the node. In this example, \"author\", \"illustrator\", and \"editor\" are the typed relations. Note Although Islandora supports Typed Relation fields that allow adding relations to other nodes, currently Workbench only supports adding relations to taxonomies. If you need support for adding Typed Relations to other entities, please leave a comment on this issue . The Controlled Access Terms module allows the relations to be sets of terms from external authority lists (for example like the MARC Relators list maintained by the Library of Congress). Within a Typed Relation field's configuration, the configured relations look like this: In this screenshot, \"relators\" is a namespace for the MARC Relators authority list, the codes \"acp\", \"adi\", etc are the codes for each relator, and \"Art copyist\", \"Art director\", etc. are the human-readable label for each relator. Within the edit form of a node that has a Typed Relation field, the user interface adds a select list of the relation (the target taxonomy term here is \"Jordan, Mark (30))\", like this: To be able to populate Typed Relation fields using CSV data with the three pieces of required data (authority list, relation type, target term), Islandora Workbench supports CSV values that contain the corresponding namespace, relator code, and taxonomy term ID, each separated by a colon ( : ), like this: relators:art:30 In this example CSV value, relators is the namespace that the relation type art is from (the Library of Congress Relators vocabulary), and the target taxonomy term ID is 30 . Note Note that the structure required for typed relation values in the CSV file is not the same as the structure of the relations configuration depicted in the screenshot of the \"Available Relations\" list above. A second option for populating Typed Relation fields is to use taxonomy term names (as opposed to term IDs) as targets: \"relators:art:Jordan, Mark\" Warning In the next few paragraphs, the word \"namespace\" is used to describe two different kinds of namespaces - first, a vocabulary ID in the local Drupal and second, an ID for the external authority list of relators, for example by the Library of Congress. As we saw in the \"Using term names in multi-vocabulary fields\" section above, if the field that we are populating references multiple vocabularies, we need to tell Drupal which vocabulary we are referring to with a local vocabulary namespace. To add a local vocabulary namespace to Typed Relation field CSV structure, we prepend it to the term name, like this (note the addition of \"person\"): \"relators:art:person:Jordan, Mark\" (In this example, relators is the external authority lists's namespace, and person is the local Drupal vocabulary namespace, prepended to the taxonomy term name, \"Jordan, Mark\".) If this seems confusing and abstruse, don't worry. Running --check will tell you that you need to add the Drupal vocabulary namespace to values in specific CSV columns. The final option for populating Typed Relation field is to use HTTP URIs as typed relation targets: relators:art:http://markjordan.net If you want to include multiple typed relation values in a single field of your CSV file (such as in \"field_linked_agent\"), separate the three-part values with the same subdelimiter character you use in other fields, e.g. ( | ) (or whatever you have configured as your subdelimiter ): relators:art:30|relators:art:45 or \"relators:art:person:Jordan, Mark|relators:art:45\"","title":"Typed Relation fields"},{"location":"fields/#adding-new-typed-relation-targets","text":"Islandora Workbench allows you to add new typed relation targets while creating and updating nodes. These targets are taxonomy terms. Your configuration file must include the allow_adding_terms: true option to add new targets. In general, adding new typed relation targets is just like adding new taxonomy terms as described above in the \"Taxonomy relation fields\" section. An example of a CSV value that adds a new target term is: \"relators:art:person:Jordan, Mark\" You can also add multiple new targets: \"relators:art:person:Annez, Melissa|relators:art:person:Jordan, Mark\" Note that: For multi-vocabulary fields, new typed relator targets must be accompanied by a vocabulary namespace ( person in the above examples). You cannot add new relators (e.g. relators:foo ) in your CSV file, only new target terms. Note Adding the typed relation namespace, relators, and vocabulary names is a major hassle. If this information is the same for all values (in all rows) in your field_linked_agent column (or any other typed relation field), you can use CSV value templates to reduce the tedium.","title":"Adding new typed relation targets"},{"location":"fields/#edtf-fields","text":"Running Islandora Workbench with --check will validate Extended Date/Time Format (EDTF) Specification dates (Levels 0, 1, and 2) in EDTF fields. Some common examples include: Type Examples Date 1976-04-23 1976-04 Qualified date 1976? 1976-04~ 1976-04-24% Date and time 1985-04-12T23:20:30 Interval 1964/2008 2004-06/2006-08 2004-06-04/2006-08-01 2004-06/2006-08-01 Set [1667,1668,1670..1672] [1672..1682] [1672,1673] [..1672] [1672..] Subvalues in multivalued CSV fields are validated separately, e.g. if your CSV value is 2004-06/2006-08|2007-01/2007-04 , 2004-06/2006-08 and 2007-01/2007-04 are validated separately. Note EDTF supports a very wide range of specific and general dates, and in some cases, valid dates can look counterintuitive. For example, \"2001-34\" is valid (it's Sub-Year Grouping meaning 2nd quarter of 2001).","title":"EDTF fields"},{"location":"fields/#link-fields","text":"The link field type stores URLs (e.g. https://acme.com ) and link text in separate data elements. To add or update fields of this type, Workbench needs to provide the URL and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the URL and link text pairs in CSV values with double percent signs ( %% ), like this: field_related_websites http://acme.com%%Acme Products Inc. You can include multiple pairs of URL/link text pairs in one CSV field if you separate them with the subdelimiter character: field_related_websites http://acme.com%%Acme Products Inc.|http://diy-first-aid.net%%DIY First Aid The URL is required, but the link text is not. If you don't have or want any link text, omit it and the double quotation marks: field_related_websites http://acme.com field_related_websites http://acme.com|http://diy-first-aid.net%%DIY First Aid","title":"Link fields"},{"location":"fields/#authority-link-fields","text":"The authority link field type stores abbreviations for authority sources (i.e., external controlled vocabularies such as national name authorities), authority URIs (e.g. http://viaf.org/viaf/153525475 ) and link text in separate data elements. Authority link fields are most commonly used on taxonomy terms, but can be used on nodes as well. To add or update fields of this type, Workbench needs to provide the authority source abbreviation, URI and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the three parts in CSV values with double percent signs ( %% ), like this: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record You can include multiple triplets of source abbreviation/URL/link text in one CSV field if you separate them with the subdelimiter character: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record|other%%https://github.com/mjordan%%Github The authority source abbreviation and the URI are required, but the link text is not. If you don't have or want any link text, omit it: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807 field_authority_vocabs viaf%%http://viaf.org/viaf/10646807|other%%https://github.com/mjordan%%Github","title":"Authority link fields"},{"location":"fields/#geolocation-fields","text":"The Geolocation field type, managed by the Geolocation Field contrib module, stores latitude and longitude coordinates in separate data elements. To add or update fields of this type, Workbench needs to provide the latitude and longitude data in these separate elements. To simplify entering geocoordinates in the CSV file, Workbench allows geocoordinates to be in lat,long format, i.e., the latitude coordinate followed by a comma followed by the longitude coordinate. When Workbench reads your CSV file, it will split data on the comma into the required lat and long parts. An example of a single geocoordinate in a field would be: field_coordinates \"49.16667,-123.93333\" You can include multiple pairs of geocoordinates in one CSV field if you separate them with the subdelimiter character: field_coordinates \"49.16667,-123.93333|49.25,-124.8\" Note that: Geocoordinate values in your CSV need to be wrapped in double quotation marks, unless the delimiter key in your configuration file is set to something other than a comma. If you are entering geocoordinates into a spreadsheet, you may need to escape leading + and - signs since they will make the spreadsheet application think you are entering a formula. You can work around this by escaping the + an - with a backslash ( \\ ), e.g., 49.16667,-123.93333 should be \\+49.16667,-123.93333 , and 49.16667,-123.93333|49.25,-124.8 should be \\+49.16667,-123.93333|\\+49.25,-124.8 . Workbench will strip the leading \\ before it populates the Drupal fields. Excel: leading + and - need to be escaped Google Sheets: only + needs to be escaped LibreOffice Calc: neither + nor - needs to be escaped","title":"Geolocation fields"},{"location":"fields/#paragraphs-entity-reference-revisions-fields","text":"Entity Reference Revisions fields are similar to Drupal's core Entity Reference fields used for taxonomy terms but are intended for entities that are not intended for reference outside the context of the item that references them. For Islandora sites, this is used for Paragraph entities. In order to populate paragraph entites using Workbench in create and update tasks, you need to enable and configure the REST endpoints for paragraphs. To do this, ensure the REST UI module is enabled, then go to Configuration/Web Services/REST Resources ( /admin/config/services/rest ) and enable \"Paragraph\". Then edit the settings for Paragraph to the following: Granularity = Method GET formats: jsonld, json authentication: jwt_auth, basic_auth, cookie POST formats: json authentication: jwt_auth, basic_auth, cookie DELETE formats: json authentication: jwt_auth, basic_auth, cookie PATCH formats: json authentication: jwt_auth, basic_auth, cookie Note Pargraphs are locked down from REST updates by default. To add new and update paragraph values you must enable the paragraphs_type_permissions submodule and ensure the Drupal user in your configuration file has sufficient privledges granted at /admin/people/permissions/module/paragraphs_type_permissions . Paragraphs are handy for Islandora as they provide flexibility in the creation of more complex metadata (such as complex titles, typed notes, or typed identifiers) by adding Drupal fields to paragraph entities, unlike the Typed Relationship field which hard-codes properties. However, this flexibility makes creating an Workbench import more complicated and, as such, requires additional configration. For example, suppose you have a \"Full Title\" field ( field_full_title ) on your Islandora Object content type referencing a paragraph type called \"Complex Title\" ( complex_title ) that contains \"main title\" ( field_main_title ) and \"subtitle\" ( field_subtitle ) text fields. The input CSV would look like: field_full_title My Title: A Subtitle|Alternate Title In this example we have two title values, \"My Title: A Subtitle\" (where \"My Title\" is the main title and \" A Subtitle\" is the subtitle) and \"Alternate Title\" (which only has a main title). To map these CSV values to our paragraph fields, we need to add the following to our configuration file: paragraph_fields: node: field_full_title: type: complex_title field_order: - field_main_title - field_subtitle field_delimiter: ':' subdelimiter: '|' This configuration defines the paragraph field on the node ( field_full_title ) and its child fields ( field_main_title and field_subtitle ), which occur within the paragraph in the order they are named in the field_order property. Within the data in the CSV column, the values corresponding to the order of those fields are separated by the character defined in the field_delimiter property. subdelimiter here is the same as the subdelimiter configuration setting used in non-paragraph multi-valued fields; in this example it overrides the default or globally configured value. We use a colon for the field delimiter in this example as it is often used in titles to denote subtitles. Note that in the above example, the space before \"A\" in the subtitle will be preserved. Whether or not you want a space there in your data will depend on how you display the Full Title field. Warning Note that Workbench assumes all fields within a paragraph are single-valued. When using Workbench to update paragraphs using update_mode: replace , any null values for fields within the paragraph (such as the null subtitle in the second \"Alternate Title\" instance in the example) will null out existing field values. However, considering each paragraph as a whole field value, Workbench behaves the same as for all other fields - update_mode: replace will replace all paragraph entities with the ones in the CSV, but if the CSV does not contain any values for this field then the field will be left as is.","title":"Paragraphs (Entity Reference Revisions fields)"},{"location":"fields/#values-in-the-field_member_of-column","text":"The field_member_of column can take a node ID, a full URL to a node, or a URL alias. For instance, all of these refer to the same node and can be used in field_member_of : 648 (node ID) https://islandora.traefik.me/node/648 (full URL) https://islandora.traefik.me/mycollection (full URL using an alias) /mycollection (URL alias) If you use any of these types of values other than the bare node ID, Workbench will look up the node ID based on the URL or alias.","title":"Values in the \"field_member_of\" column"},{"location":"fields/#values-in-the-field_domain_access-column","text":"The Domain Access module, part of the Domain suite of modules, creates a required, multivalued field with the machine name field_domain_access that controls, at the node level, which domains the node shows up in. When populating this field in your Workbench CSV, replace the periods in domain names with _ . For example, if the domains you want to allow a node to show up in are test1.testing.edu and test2.testing.edu , the values in your field_domain_access field look like this: test1_testing_edu test1_testing_edu test1_testing_edu|test2_testing_edu test2_testing_edu","title":"Values in the \"field_domain_access\" column"},{"location":"fixity/","text":"Transmission fixity Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. \"hash\") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted. This functionality is available within create and add_media tasks. Only files named in the file CSV column are checked. To enable this feature, include the fixity_algorithm option in your create or add_media configuration file, specifying one of \"md5\", \"sha1\", or \"sha256\" hash algorithms. For example, to use the \"md5\" algorithm, include the following in your config file: fixity_algorithm: md5 Comparing checksums to known values Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check phase, as described below. If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum column in your create or add_media CSV input file. No further configuration other than indicating the fixity_algorithm as described above is necessary. If the checksum column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm configuration setting. Validating checksums during --check If you have pregenerated checksum values for your files (as described in the \"Comparing checksums to known values\" section, above), you can tell Workbench to compare those checksums with checksums during its --check phase. To do this, include the following options in your create or add_media configuration file: fixity_algorithm: md5 validate_fixity_during_check: true You must also include both a file and checksum column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm setting. Results of the checks are written to the log file. Some things to note: Fixity checking is currently only available to files named in the file CSV column, and not in any \" additional files \" columns. For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms. If you are including pregenerated checksum values in your CSV file (in the checksum column), the checksums must have been generated using the same has algorithm indicated in your fixity_algorithm configuration setting: \"md5\", \"sha1\", or \"sha256\". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail. Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches. If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the --check phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal. Validation during --check happens entirely on the computer running Workbench. During --check , Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point. Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.","title":"Fixity checking"},{"location":"fixity/#transmission-fixity","text":"Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. \"hash\") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted. This functionality is available within create and add_media tasks. Only files named in the file CSV column are checked. To enable this feature, include the fixity_algorithm option in your create or add_media configuration file, specifying one of \"md5\", \"sha1\", or \"sha256\" hash algorithms. For example, to use the \"md5\" algorithm, include the following in your config file: fixity_algorithm: md5","title":"Transmission fixity"},{"location":"fixity/#comparing-checksums-to-known-values","text":"Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check phase, as described below. If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum column in your create or add_media CSV input file. No further configuration other than indicating the fixity_algorithm as described above is necessary. If the checksum column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm configuration setting.","title":"Comparing checksums to known values"},{"location":"fixity/#validating-checksums-during-check","text":"If you have pregenerated checksum values for your files (as described in the \"Comparing checksums to known values\" section, above), you can tell Workbench to compare those checksums with checksums during its --check phase. To do this, include the following options in your create or add_media configuration file: fixity_algorithm: md5 validate_fixity_during_check: true You must also include both a file and checksum column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm setting. Results of the checks are written to the log file. Some things to note: Fixity checking is currently only available to files named in the file CSV column, and not in any \" additional files \" columns. For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms. If you are including pregenerated checksum values in your CSV file (in the checksum column), the checksums must have been generated using the same has algorithm indicated in your fixity_algorithm configuration setting: \"md5\", \"sha1\", or \"sha256\". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail. Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches. If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the --check phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal. Validation during --check happens entirely on the computer running Workbench. During --check , Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point. Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.","title":"Validating checksums during --check"},{"location":"generating_csv_files/","text":"Generating CSV Files Islandora Workbench can generate several different CSV files you might find useful. CSV file templates Note This section describes creating CSV file templates. For information on CSV field templates, see the \" Using CSV field templates \" section. You can generate a template CSV file by running Workbench with the --get_csv_template argument: ./workbench --config config.yml --get_csv_template With this option, Workbench will fetch the field definitions for the content type named in your configuration's content_type option and save a CSV file with a column for each of the content type's fields. You can then populate this template with values you want to use in a create or update task. The template file is saved in the directory indicated in your configuration's input_dir option, using the filename defined in input_csv with .csv_file_template appended. The template also contains three additional rows: human-readable label whether or not the field is required in your CSV for create tasks sample data number of values allowed (either a specific maximum number or 'unlimited') the name of the section in the documentation covering the field type Here is a screenshot of this CSV file template loaded into a spreadsheet application: Note that the first column, and all the rows other than the field machine names, should be deleted before you use a populated version of this CSV file in a create task. Also, you can remove any columns you do not intend on populating during the create task: CSV file containing a row for every newly created node In some situations, you may want to create stub nodes that only have a small subset of fields, and then populate the remaining fields later. To facilitate this type of workflow, Workbench provides an option to generate a simple CSV file containing a record for every node created during a create task. This file can then be used later in update tasks to add additional metadata or in add_media tasks to add media. You tell Workbench to generate this file by including the optional output_csv setting in your create task configuration file. If this setting is present, Workbench will write a CSV file at the specified location containing one record per node created. This CSV file contains the following fields: id (or whatever column is specified in your id_field setting): the value in your input CSV file's ID field node_id : the node ID for the newly created node uuid : the new node's UUID status : true if the node is published, False if it is unpublished title : the node's title The file will also contain empty columns corresponding to all of the fields in the target content type. An example, generated from a 2-record input CSV file, looks like this (only left-most part of the spreadsheet shown): This CSV file is suitable as a template for subsequent update tasks, since it already contains the node_id s for all the stub nodes plus column headers for all of the fields in those nodes. You can remove from the template any columns you do not want to include in your update task. You can also use the node IDs in this file as the basis for later add_media tasks; all you will need to do is delete the other columns and add a file column containing the new nodes' corresponding filenames. If you want to include in your output CSV all of the fields (and their values) from the input CSV, add output_csv_include_input_csv: true to your configuration file. This option is useful if you want a CSV that contains the node ID and a field such as field_identifier or other fields that contain local identifiers, DOIs, file paths, etc. If you use this option, all the fields from the input CSV are added to the output CSV; you cannot configure which fields are included. CSV file containing field data for existing nodes The export_csv task generates a CSV file that contains one row for each node identified in the input CSV file. The cells of the CSV are populated with data that is consistent with the structures that Workbench uses in update tasks. Using this CSV file, you can: see in one place all of the field values for nodes, which might be useful during quality assurance after a create task modify the data and use it as input for an update task using the update_mode: replace configuration option. The CSV file contains two of the extra rows included in the CSV file template, described above (specifically, the human-readable field label and number of values allowed), and the left-most \"REMOVE THIS COLUMN (KEEP THIS ROW)\" column. To use the file as input for an update task, simply delete the extraneous column and rows. A sample configuration file for an export_csv task is: task: export_csv host: \"http://localhost:8000\" username: admin password: islandora input_csv: nodes_to_export.csv export_csv_term_mode: name content_type: my_custom_content_type # If export_csv_field_list is not present, all fields will be exported. export_csv_field_list: ['title', 'field_description'] # Specifying the output path is optional; see below for more information. export_csv_file_path: output.csv The file identified by input_file has only one column, \"node_id\": node_id 7653 7732 7653 Some things to note: The output includes data from nodes only, not media. Unless a file path is specified in the export_csv_file_path configuration option, the output CSV file name is the name of the input CSV file (containing node IDs) with \".csv_file_with_field_values\" appended. For example, if you export_csv configuration file defines the input_csv as \"my_export_nodes.csv\", the CSV file created by the task will be named \"my_export_nodes.csv.csv_file_with_field_values\". The file is saved in the directory identified by the input_dir configuration option. You can include either vocabulary term IDs or term names (with accompanying vocabulary namespaces) in the CSV. By default, term IDs are included; to include term names instead, include export_csv_term_mode: name in you configuration file. A single export_csv job can only export nodes that have the content type identified in your Workbench configuration. By default, this is \"islandora_object\". If you include node IDs in your input file for nodes that have a different content type, Workbench will skip exporting their data and log the fact that it has done so. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Warning Using the export_csv_term_mode: name option will slow down the export, since Workbench must query Drupal to get the name of each term. The more taxonomy or typed relation fields in your content type, the slower the export will be with export_csv_term_mode set to \"name\". Using a Drupal View to identify content to export as CSV You can use a new or existing View to tell Workbench what nodes to export into CSV. This is done using a get_data_from_view task. A sample configuration file looks like this: task: get_data_from_view host: \"http://localhost:8000/\" view_path: '/daily_nodes_created_test' username: admin password: islandora content_type: pubished_work export_csv_file_path: /tmp/islandora_export.csv # If export_csv_field_list is not present, all fields will be exported. # node_id and title are always included. export_csv_field_list: ['field_description', 'field_extent'] # 'view_paramters' is optinal, and used only if your View uses Contextual Filters. # In this setting you identify any URL parameters configured as Contextual Filters # for the View. Note that values in the 'view_parameters' configuration setting # are literal parameter=value strings that include the =, not YAML key: value # pairs used elsewhere in the Workbench configuration file. view_parameters: - 'date=20231202' The view_path setting should contain the value of the \"Path\" option in the Views configuration page's \"Path settings\" section. The export_csv_file_path is the location where you want your CSV file saved. In the View configuration page: Add a \"REST export\" display. Under \"Format\" > \"Serializer\" > \"Settings\", choose \"json\". In the View \"Fields\" settings, leave \"The selected style or row format does not use fields\" as is (see explanation below). Under \"Path\", add a path where your REST export will be accessible to Workbench. As noted above, this value is also what you should use in the view_path setting in your Workbench configuration file. Under \"Pager\" > \"Items to display\", choose \"Paged output, mini pager\". In \"Pager options\" choose 10 items to display. Under \"Path settings\" > \"Access\", choose \"Permission\" and \"View published content\". Under \"Authentication\", choose \"basic_auth\" and \"cookie\". Here is a screenshot illustrating these settings: To test your REST export, in your browser, join your Drupal hostname and the \"Path\" defined in your View configuration. Using the values in the configuration file above, that would be http://localhost:8000/workbench-export-test . You should see raw JSON (or formatted JSON if your browser renders JSON to be human readable) that lists the nodes in your View. Warning If your View includes nodes that you do not want to be seen by anonymous users, or if it contains unpublished nodes, adjust the access permissions settings appropriately, and ensure that the user identified in your Workbench configuration file has sufficient permissions. You can optionally configure your View to use a single Contextual Filters, and expose that Contextual Filter to use one or more query parameters. This way, you can include each query parameter's name and its value in your configuration file using Workbench's view_parameters config setting, as illustrated in the sample configuration file above. The configuration in the View's Contextual Filters for this type of parameter looks like this: By adding a Contextual Filterf to your View display, you can control what nodes end up in the output CSV by including the value you want to filter on in your Workbench configuration's view_parameters setting. In the screenshot of the \"Created date\" Contextual Filter shown here, the query parameter is date , so you include that parameter in your view_parameters list in your configuration file along with the value you want to assign to the parameter (separated by an = sign), e.g.: view_parameters: - 'date=20231202' will set the value of the date query parameter in the \"Created date\" Contextual Filter to \"20231202\". Some things to note: Note that the values you include in view_parameters apply only to your View's Contextual Filter. Any \"Filter Criteria\" you include in the main part of your View configuration also take effect. In other words, both \"Filter Criteria\" and \"Contextual Filters\" determine what nodes end up in your output CSV file. You can only include a single Contextual Filter in your View, but it can have multiple query parameters. REST export Views displays don't use fields in the same way that other Views displays do. In fact, Drupal says within the Views user interface that for REST export displays, \"The selected style or row format does not use fields.\" Instead, these displays export the entire node in JSON format. Workbench iterates through all fields on the node JSON that start with field_ and includes those fields, plus node_id and title , in the output CSV. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Only content from nodes that have the content type identified in the content_type configuration setting will be written to the CSV file. If you want to export term names instead of term IDs, include export_csv_term_mode: name in your configuration file. The warning about this option slowing down the export applies to this task and the export_csv task. Using a Drupal View to generate a media report as CSV You can get a report of which media a set of nodes has using a View. This report is generated using a get_media_report_from_view task, and the View configuration it uses is the same as the View configuration described above (in fact, you can use the same View with both get_data_from_view and get_media_report_from_view tasks). A sample Workbench configuration file looks like: task: get_media_report_from_view host: \"http://localhost:8000/\" view_path: daily_nodes_created username: admin password: islandora export_csv_file_path: /tmp/media_report.csv # view_paramters is optinal, and used only if your View uses Contextual Filters. view_parameters: - 'date=20231201' The output contains colums for Node ID, Title, Content Type, Islandora Model, and Media. For each node in the View, the Media column contains the media use terms for all media attached to the node separated by semicolons, and sorted alphabetically: Exporting image, video, etc. files along with CSV data In export_csv and get_data_from_view tasks, you can optionally export media files. To do this, add the following settings to your configuration file: export_file_directory : Required. This is the path to the directory where Workbench will save the exported files. export_file_media_use_term_id : Optional. This setting tells Workbench which Islandora Media Use term to use to identify the file to export. Defaults to http://pcdm.org/use#OriginalFile (for Original File). Can be either a term ID or a term URI. Note that currently only a single file per node can be exported, and that files need to be accessible to the anonymous Drupal user to be exported. Using the CSV ID to node ID map By default, Workbench maintains a simple database that maps values in your CSV's ID column (or whatever column you define in the id_field config setting) to node IDs created in create tasks. Note You do not need to install anything extra for Workbench to create this database. Workbench provides a utility script, manage_csv_to_node_id_map.py (described below), for exporting and pruning the data. You only need to install the sqlite3 client or a third-party utility if you want to access the database in ways that surpass the capabilities of the manage_csv_to_node_id_map.py script. A useful third-party tool for viewing and modifying SQLite databases is DB Browser for SQLite . Here is a sample screenshot illustrating the CSV to node ID map database table in DB Browser for SQLite (CSV ID and node ID are the two right-most columns): Workbench optionally uses this database to determine the node ID of parent nodes when creating paged and compound content, so, for example, you can use parent_id values in your input CSV that refer to parents created in earlier Workbench sessions. But, you may find other uses for this data. Since it is stored in an SQLite database, it can be queried using SQL, or can be dumped using into a CSV file using the dump_id_map.py script provided in Workbench's scripts directory. Note In create_from_files tasks, which don't use an input CSV file, the filename is recorded instead of an \"id\". One configuration setting applies to this feature, csv_id_to_node_id_map_path . By default, its value is [your temporary directory]/csv_id_to_node_id_map.db (see the temp_dir config setting's documentation for more information on where that directory is). This default can be overridden in your config file. If you want to disable population of this database completely, set csv_id_to_node_id_map_path to false . Warning Some systems clear out their temporary directories on restart. You may want to define the absolute path to your ID map database in the csv_id_to_node_id_map_path configuration setting so it is stored in a location that will not get deleted on system restart. The SQLite database contains one table, \"csv_id_to_node_id_map\". This table has five columns: timestamp : the current timestamp in yyyy-mm-dd hh:mm:ss format or a truncated version of that format config_file : the name of the Workbench configuration file active when the row was added parent_csv_id : if the node was created along with its parent, the parent's CSV ID parent_node_id : if the node was create along with its parent, the parent's node ID csv_id : the value in the node's CSV ID field (or in create_from_files tasks, which don't use an input CSV file, the filename) node_id : the node's Drupal node ID If you don't want to query the database directly, you can use scripts/manage_csv_to_node_id_map.py to: Export the contents of the entire database to a CSV file. To do this, in the Workbench directory, run the script, specifying the path to the database file and the path to the CSV output: python scripts/manage_csv_to_node_id_map.py --db_path /tmp/csv_id_to_node_id_map.db --csv_path /tmp/output.csv Export the rows that have duplicate (i.e., identical) CSV ID values, or duplicate values in any specific field. To limit the rows that are dumped to those with duplicate values in a specific database field, add the --nonunique argument and the name of the field, e.g., --nonunique csv_id . The resulting CSV will only contain those entries from your database. Delete entries from the database that have a specific value in their config_file column. To do this, provide the script with the --remove_entries_with_config_files argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_with_config_files create.yml . You can also specify a comma-separated list of config file names (for example --remove_entries_with_config_files create.yml,create_testing.yml ) to remove all entries with those config file names with one command. Delete entries from the database create before a specific date. To do this, provide the script with the --remove_entries_before argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_before \"2023-05-29 19:17\" . Delete entries from the database created after a specific date. To do this, provide the script with the --remove_entries_after argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_after \"2023-05-29 19:17\" . The value of the --remove_entries_before and --remove_entries_after arguments is a date string that can take the form yyyy-mm-dd hh:mm:ss or any truncated version of that format, e.g. yyyy-mm-dd hh:mm , yyyy-mm-dd hh , or yyyy-mm-dd . Any rows in the database table that have a timestamp value that matches the date value will be deleted from the database. Note that if your timestamp value has a space in it, you need to wrap it quotation marks as illustrated above; if you don't, the script will delete all the entries on the timestamp value before the space, in other words, that day.","title":"Generating CSV files"},{"location":"generating_csv_files/#generating-csv-files","text":"Islandora Workbench can generate several different CSV files you might find useful.","title":"Generating CSV Files"},{"location":"generating_csv_files/#csv-file-templates","text":"Note This section describes creating CSV file templates. For information on CSV field templates, see the \" Using CSV field templates \" section. You can generate a template CSV file by running Workbench with the --get_csv_template argument: ./workbench --config config.yml --get_csv_template With this option, Workbench will fetch the field definitions for the content type named in your configuration's content_type option and save a CSV file with a column for each of the content type's fields. You can then populate this template with values you want to use in a create or update task. The template file is saved in the directory indicated in your configuration's input_dir option, using the filename defined in input_csv with .csv_file_template appended. The template also contains three additional rows: human-readable label whether or not the field is required in your CSV for create tasks sample data number of values allowed (either a specific maximum number or 'unlimited') the name of the section in the documentation covering the field type Here is a screenshot of this CSV file template loaded into a spreadsheet application: Note that the first column, and all the rows other than the field machine names, should be deleted before you use a populated version of this CSV file in a create task. Also, you can remove any columns you do not intend on populating during the create task:","title":"CSV file templates"},{"location":"generating_csv_files/#csv-file-containing-a-row-for-every-newly-created-node","text":"In some situations, you may want to create stub nodes that only have a small subset of fields, and then populate the remaining fields later. To facilitate this type of workflow, Workbench provides an option to generate a simple CSV file containing a record for every node created during a create task. This file can then be used later in update tasks to add additional metadata or in add_media tasks to add media. You tell Workbench to generate this file by including the optional output_csv setting in your create task configuration file. If this setting is present, Workbench will write a CSV file at the specified location containing one record per node created. This CSV file contains the following fields: id (or whatever column is specified in your id_field setting): the value in your input CSV file's ID field node_id : the node ID for the newly created node uuid : the new node's UUID status : true if the node is published, False if it is unpublished title : the node's title The file will also contain empty columns corresponding to all of the fields in the target content type. An example, generated from a 2-record input CSV file, looks like this (only left-most part of the spreadsheet shown): This CSV file is suitable as a template for subsequent update tasks, since it already contains the node_id s for all the stub nodes plus column headers for all of the fields in those nodes. You can remove from the template any columns you do not want to include in your update task. You can also use the node IDs in this file as the basis for later add_media tasks; all you will need to do is delete the other columns and add a file column containing the new nodes' corresponding filenames. If you want to include in your output CSV all of the fields (and their values) from the input CSV, add output_csv_include_input_csv: true to your configuration file. This option is useful if you want a CSV that contains the node ID and a field such as field_identifier or other fields that contain local identifiers, DOIs, file paths, etc. If you use this option, all the fields from the input CSV are added to the output CSV; you cannot configure which fields are included.","title":"CSV file containing a row for every newly created node"},{"location":"generating_csv_files/#csv-file-containing-field-data-for-existing-nodes","text":"The export_csv task generates a CSV file that contains one row for each node identified in the input CSV file. The cells of the CSV are populated with data that is consistent with the structures that Workbench uses in update tasks. Using this CSV file, you can: see in one place all of the field values for nodes, which might be useful during quality assurance after a create task modify the data and use it as input for an update task using the update_mode: replace configuration option. The CSV file contains two of the extra rows included in the CSV file template, described above (specifically, the human-readable field label and number of values allowed), and the left-most \"REMOVE THIS COLUMN (KEEP THIS ROW)\" column. To use the file as input for an update task, simply delete the extraneous column and rows. A sample configuration file for an export_csv task is: task: export_csv host: \"http://localhost:8000\" username: admin password: islandora input_csv: nodes_to_export.csv export_csv_term_mode: name content_type: my_custom_content_type # If export_csv_field_list is not present, all fields will be exported. export_csv_field_list: ['title', 'field_description'] # Specifying the output path is optional; see below for more information. export_csv_file_path: output.csv The file identified by input_file has only one column, \"node_id\": node_id 7653 7732 7653 Some things to note: The output includes data from nodes only, not media. Unless a file path is specified in the export_csv_file_path configuration option, the output CSV file name is the name of the input CSV file (containing node IDs) with \".csv_file_with_field_values\" appended. For example, if you export_csv configuration file defines the input_csv as \"my_export_nodes.csv\", the CSV file created by the task will be named \"my_export_nodes.csv.csv_file_with_field_values\". The file is saved in the directory identified by the input_dir configuration option. You can include either vocabulary term IDs or term names (with accompanying vocabulary namespaces) in the CSV. By default, term IDs are included; to include term names instead, include export_csv_term_mode: name in you configuration file. A single export_csv job can only export nodes that have the content type identified in your Workbench configuration. By default, this is \"islandora_object\". If you include node IDs in your input file for nodes that have a different content type, Workbench will skip exporting their data and log the fact that it has done so. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Warning Using the export_csv_term_mode: name option will slow down the export, since Workbench must query Drupal to get the name of each term. The more taxonomy or typed relation fields in your content type, the slower the export will be with export_csv_term_mode set to \"name\".","title":"CSV file containing field data for existing nodes"},{"location":"generating_csv_files/#using-a-drupal-view-to-identify-content-to-export-as-csv","text":"You can use a new or existing View to tell Workbench what nodes to export into CSV. This is done using a get_data_from_view task. A sample configuration file looks like this: task: get_data_from_view host: \"http://localhost:8000/\" view_path: '/daily_nodes_created_test' username: admin password: islandora content_type: pubished_work export_csv_file_path: /tmp/islandora_export.csv # If export_csv_field_list is not present, all fields will be exported. # node_id and title are always included. export_csv_field_list: ['field_description', 'field_extent'] # 'view_paramters' is optinal, and used only if your View uses Contextual Filters. # In this setting you identify any URL parameters configured as Contextual Filters # for the View. Note that values in the 'view_parameters' configuration setting # are literal parameter=value strings that include the =, not YAML key: value # pairs used elsewhere in the Workbench configuration file. view_parameters: - 'date=20231202' The view_path setting should contain the value of the \"Path\" option in the Views configuration page's \"Path settings\" section. The export_csv_file_path is the location where you want your CSV file saved. In the View configuration page: Add a \"REST export\" display. Under \"Format\" > \"Serializer\" > \"Settings\", choose \"json\". In the View \"Fields\" settings, leave \"The selected style or row format does not use fields\" as is (see explanation below). Under \"Path\", add a path where your REST export will be accessible to Workbench. As noted above, this value is also what you should use in the view_path setting in your Workbench configuration file. Under \"Pager\" > \"Items to display\", choose \"Paged output, mini pager\". In \"Pager options\" choose 10 items to display. Under \"Path settings\" > \"Access\", choose \"Permission\" and \"View published content\". Under \"Authentication\", choose \"basic_auth\" and \"cookie\". Here is a screenshot illustrating these settings: To test your REST export, in your browser, join your Drupal hostname and the \"Path\" defined in your View configuration. Using the values in the configuration file above, that would be http://localhost:8000/workbench-export-test . You should see raw JSON (or formatted JSON if your browser renders JSON to be human readable) that lists the nodes in your View. Warning If your View includes nodes that you do not want to be seen by anonymous users, or if it contains unpublished nodes, adjust the access permissions settings appropriately, and ensure that the user identified in your Workbench configuration file has sufficient permissions. You can optionally configure your View to use a single Contextual Filters, and expose that Contextual Filter to use one or more query parameters. This way, you can include each query parameter's name and its value in your configuration file using Workbench's view_parameters config setting, as illustrated in the sample configuration file above. The configuration in the View's Contextual Filters for this type of parameter looks like this: By adding a Contextual Filterf to your View display, you can control what nodes end up in the output CSV by including the value you want to filter on in your Workbench configuration's view_parameters setting. In the screenshot of the \"Created date\" Contextual Filter shown here, the query parameter is date , so you include that parameter in your view_parameters list in your configuration file along with the value you want to assign to the parameter (separated by an = sign), e.g.: view_parameters: - 'date=20231202' will set the value of the date query parameter in the \"Created date\" Contextual Filter to \"20231202\". Some things to note: Note that the values you include in view_parameters apply only to your View's Contextual Filter. Any \"Filter Criteria\" you include in the main part of your View configuration also take effect. In other words, both \"Filter Criteria\" and \"Contextual Filters\" determine what nodes end up in your output CSV file. You can only include a single Contextual Filter in your View, but it can have multiple query parameters. REST export Views displays don't use fields in the same way that other Views displays do. In fact, Drupal says within the Views user interface that for REST export displays, \"The selected style or row format does not use fields.\" Instead, these displays export the entire node in JSON format. Workbench iterates through all fields on the node JSON that start with field_ and includes those fields, plus node_id and title , in the output CSV. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Only content from nodes that have the content type identified in the content_type configuration setting will be written to the CSV file. If you want to export term names instead of term IDs, include export_csv_term_mode: name in your configuration file. The warning about this option slowing down the export applies to this task and the export_csv task.","title":"Using a Drupal View to identify content to export as CSV"},{"location":"generating_csv_files/#using-a-drupal-view-to-generate-a-media-report-as-csv","text":"You can get a report of which media a set of nodes has using a View. This report is generated using a get_media_report_from_view task, and the View configuration it uses is the same as the View configuration described above (in fact, you can use the same View with both get_data_from_view and get_media_report_from_view tasks). A sample Workbench configuration file looks like: task: get_media_report_from_view host: \"http://localhost:8000/\" view_path: daily_nodes_created username: admin password: islandora export_csv_file_path: /tmp/media_report.csv # view_paramters is optinal, and used only if your View uses Contextual Filters. view_parameters: - 'date=20231201' The output contains colums for Node ID, Title, Content Type, Islandora Model, and Media. For each node in the View, the Media column contains the media use terms for all media attached to the node separated by semicolons, and sorted alphabetically:","title":"Using a Drupal View to generate a media report as CSV"},{"location":"generating_csv_files/#exporting-image-video-etc-files-along-with-csv-data","text":"In export_csv and get_data_from_view tasks, you can optionally export media files. To do this, add the following settings to your configuration file: export_file_directory : Required. This is the path to the directory where Workbench will save the exported files. export_file_media_use_term_id : Optional. This setting tells Workbench which Islandora Media Use term to use to identify the file to export. Defaults to http://pcdm.org/use#OriginalFile (for Original File). Can be either a term ID or a term URI. Note that currently only a single file per node can be exported, and that files need to be accessible to the anonymous Drupal user to be exported.","title":"Exporting image, video, etc. files along with CSV data"},{"location":"generating_csv_files/#using-the-csv-id-to-node-id-map","text":"By default, Workbench maintains a simple database that maps values in your CSV's ID column (or whatever column you define in the id_field config setting) to node IDs created in create tasks. Note You do not need to install anything extra for Workbench to create this database. Workbench provides a utility script, manage_csv_to_node_id_map.py (described below), for exporting and pruning the data. You only need to install the sqlite3 client or a third-party utility if you want to access the database in ways that surpass the capabilities of the manage_csv_to_node_id_map.py script. A useful third-party tool for viewing and modifying SQLite databases is DB Browser for SQLite . Here is a sample screenshot illustrating the CSV to node ID map database table in DB Browser for SQLite (CSV ID and node ID are the two right-most columns): Workbench optionally uses this database to determine the node ID of parent nodes when creating paged and compound content, so, for example, you can use parent_id values in your input CSV that refer to parents created in earlier Workbench sessions. But, you may find other uses for this data. Since it is stored in an SQLite database, it can be queried using SQL, or can be dumped using into a CSV file using the dump_id_map.py script provided in Workbench's scripts directory. Note In create_from_files tasks, which don't use an input CSV file, the filename is recorded instead of an \"id\". One configuration setting applies to this feature, csv_id_to_node_id_map_path . By default, its value is [your temporary directory]/csv_id_to_node_id_map.db (see the temp_dir config setting's documentation for more information on where that directory is). This default can be overridden in your config file. If you want to disable population of this database completely, set csv_id_to_node_id_map_path to false . Warning Some systems clear out their temporary directories on restart. You may want to define the absolute path to your ID map database in the csv_id_to_node_id_map_path configuration setting so it is stored in a location that will not get deleted on system restart. The SQLite database contains one table, \"csv_id_to_node_id_map\". This table has five columns: timestamp : the current timestamp in yyyy-mm-dd hh:mm:ss format or a truncated version of that format config_file : the name of the Workbench configuration file active when the row was added parent_csv_id : if the node was created along with its parent, the parent's CSV ID parent_node_id : if the node was create along with its parent, the parent's node ID csv_id : the value in the node's CSV ID field (or in create_from_files tasks, which don't use an input CSV file, the filename) node_id : the node's Drupal node ID If you don't want to query the database directly, you can use scripts/manage_csv_to_node_id_map.py to: Export the contents of the entire database to a CSV file. To do this, in the Workbench directory, run the script, specifying the path to the database file and the path to the CSV output: python scripts/manage_csv_to_node_id_map.py --db_path /tmp/csv_id_to_node_id_map.db --csv_path /tmp/output.csv Export the rows that have duplicate (i.e., identical) CSV ID values, or duplicate values in any specific field. To limit the rows that are dumped to those with duplicate values in a specific database field, add the --nonunique argument and the name of the field, e.g., --nonunique csv_id . The resulting CSV will only contain those entries from your database. Delete entries from the database that have a specific value in their config_file column. To do this, provide the script with the --remove_entries_with_config_files argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_with_config_files create.yml . You can also specify a comma-separated list of config file names (for example --remove_entries_with_config_files create.yml,create_testing.yml ) to remove all entries with those config file names with one command. Delete entries from the database create before a specific date. To do this, provide the script with the --remove_entries_before argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_before \"2023-05-29 19:17\" . Delete entries from the database created after a specific date. To do this, provide the script with the --remove_entries_after argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_after \"2023-05-29 19:17\" . The value of the --remove_entries_before and --remove_entries_after arguments is a date string that can take the form yyyy-mm-dd hh:mm:ss or any truncated version of that format, e.g. yyyy-mm-dd hh:mm , yyyy-mm-dd hh , or yyyy-mm-dd . Any rows in the database table that have a timestamp value that matches the date value will be deleted from the database. Note that if your timestamp value has a space in it, you need to wrap it quotation marks as illustrated above; if you don't, the script will delete all the entries on the timestamp value before the space, in other words, that day.","title":"Using the CSV ID to node ID map"},{"location":"generating_sample_content/","text":"If you want to quickly generate some sample images to load into Islandora, Workbench provides a utility script to do that. Running python3 scripts/generate_image_files.py from within the Islandora Workbench directory will generate PNG images from the list of titles in the sample_filenames.txt file. Running this script will result in a group of images whose filenames are normalized versions of the lines in the sample title file. You can then load this sample content into Islandora using the create_from_files task. If you want to have Workbench generate the sample content automatically, configure the generate_image_files.py script as a bootstrap script. See the autogen_content.yml configuration file for an example of how to do that.","title":"Generating sample Islandora content"},{"location":"hooks/","text":"Hooks Islandora Workbench offers three \"hooks\" that can be used to run scripts at specific points in the Workbench execution lifecycle. The three hooks are: Bootstrap CSV preprocessor Post-action Hook scripts can be in any language, and need to be executable by the user running Workbench. Execution (whether successful or not) of hook scripts is logged, including the scripts' exit code. Bootstrap and shutdown scripts Bootstrap scripts execute prior to Workbench connecting to Drupal. For an example of using this feature to run a script that generates sample Islandora content, see the \" Generating sample Islandora content \" section. To register a bootstrap script in your configuration file, add it to the bootstrap option, like this, indicating the absolute path to the script: bootstrap: [\"/home/mark/Documents/hacking/workbench/scripts/generate_image_files.py\"] Each bootstrap script gets passed a single argument, the path to the Workbench config file that was specified in Workbench's --config argument. For example, if you are running Workbench with a config file called create.yml , \"create.yml\" will automatically be passed as the argument to your bootstrap script (you do not specify it in the configuration), like this: generate_image_files.py create.yml Shutdown scripts work the same way as bootstrap scripts but they execute after Workbench has finished connecting to Drupal. Like bootstrap scripts, shutdown scripts receive a single argument from Workbench, the path to your configuration file. A common situation where a shutdown script is useful is to check the Workbench log for failures, and if any are detected, to email someone. The script email_log_if_errors.py in the scripts directory shows how this can be used for this. To register a shutdown script, add it to the shutdown option: shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example.py\"] --check will check for the existence of bootstrap and shutdown scripts, and that they are executable, but does not execute them. The scripts are only executed when Workbench is run without --check . A shutdown script that you might find useful is scripts/generate_iiif_manifests.py , which generates the IIIF Manifest (book-manifest) for each node in the current create task that is a parent. You might want to do this since pregenerating the manifest usually speeds up rendering of paged content in Mirador and OpenSeadragon. To use it, simply add the following to your create task config file, adjusting the path to your Workbench scripts directory: shutdown: [\"/home/mark/hacking/workbench/scripts/generate_iiif_manifests.py\"] Warning Bootstrap and shutdown scripts get passed the path to your configuration file, but they only have access to the configuration settings explicitly defined in that file. In other words, any configuration setting with a default value, and therefore no necessarily included in your configuration file, is not known to bootstrap/shutdown scripts. Therefore, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench. CSV preprocessor scripts CSV preprocessor scripts are applied to CSV values in a specific CSV field prior to the values being ingested into Drupal. They apply to the entire value from the CSV field and not split field values, e.g., if a field is multivalued, the preprocessor must split it and then reassemble it back into a string. Note that preprocessor scripts work only on string data and not on binary data like images, etc. and only on custom fields (so not title). Preprocessor scripts are applied in create and update tasks. Note If you are interested in seeing preprocessor scripts act on binary data such as images, see this issue . For example, if you want to convert all the values in the field_description CSV field to sentence case, you can do this by writing a small Python script that uses the capitalize() method and registering it as a preprocessor. To register a preprocessor script in your configuration file, add it to the preprocessors setting, mapping a CSV column header to the path of the script, like this: preprocessors: - field_description: /home/mark/Documents/hacking/workbench/scripts/samplepreprocessor.py You must provide the absolute path to the script, and the script must be executable. Each preprocessor script gets passed two arguments: the character used as the CSV subdelimiter (defined in the subdelimiter config setting, which defaults to | ) unlike bootstrap, shutdown, and post-action scripts, preprocessor scripts do not get passed the path to your Workbench configuration file; they only get passed the value of the subdelimiter config setting. the CSV field value When executed, the script processes the string content of the CSV field, and then replaces the original version of the CSV field value with the version processed by the script. An example preprocessor script is available in scripts/samplepreprocessor.py . Post-action scripts Post-action scripts execute after a node is created or updated, or after a media is created. To register post-action scripts in your configuration file, add them to either the node_post_create , node_post_update , or media_post_create configuration setting: node_post_create: [\"/home/mark/Documents/hacking/workbench/post_node_create.py\"] node_post_update: [\"/home/mark/Documents/hacking/workbench/post_node_update.py\"] media_post_create: [\"/home/mark/Documents/hacking/workbench/post_media_update.py\"] The arguments passed to each post-action hook are: the path to the Workbench config file that was specified in the --config argument the HTTP response code returned from the action (create, update), e.g. 201 or 403 . Note that this response code is a string, not an integer. the entire HTTP response body; this will be raw JSON. These arguments are passed to post-action scripts automatically. You don't specific them when you register your scripts in your config file. The scripts/entity_post_task_example.py illustrates these arguments. Your scripts can find the entity ID and other information within the (raw JSON) HTTP response body. Using the way Python decodes JSON as an example, if the entity is a node, its nid is in entity_json['nid'][0]['value'] ; if the entity is a media, the mid is in entity_json['mid'][0]['value'] . The exact location of the nid and mid may differ if your script is written in a language that decodes JSON differently than Python (used in this example) does. Warning Not all Workbench configuration settings are available in post-action scripts. Only the settings are explicitly defined in the configuration YAML are available. As with bootstrap and shutdown scripts, when using post-action scripts, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench. Running multiple scripts in one hook For all types of hooks, you can register multiple scripts, like this: bootstrap: [\"/home/mark/Documents/hacking/workbench/bootstrap_example_1.py\", \"/home/mark/Documents/hacking/workbench/bootstrap_example_2.py\"] shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example_1.py\", \"/home/mark/Documents/hacking/workbench/shutdown_example_2.py\"] node_post_create: [\"/home/mark/scripts/email_someone.py\", \"/tmp/hit_remote_api.py\"] They are executed in the order in which they are listed.","title":"Hooks"},{"location":"hooks/#hooks","text":"Islandora Workbench offers three \"hooks\" that can be used to run scripts at specific points in the Workbench execution lifecycle. The three hooks are: Bootstrap CSV preprocessor Post-action Hook scripts can be in any language, and need to be executable by the user running Workbench. Execution (whether successful or not) of hook scripts is logged, including the scripts' exit code.","title":"Hooks"},{"location":"hooks/#bootstrap-and-shutdown-scripts","text":"Bootstrap scripts execute prior to Workbench connecting to Drupal. For an example of using this feature to run a script that generates sample Islandora content, see the \" Generating sample Islandora content \" section. To register a bootstrap script in your configuration file, add it to the bootstrap option, like this, indicating the absolute path to the script: bootstrap: [\"/home/mark/Documents/hacking/workbench/scripts/generate_image_files.py\"] Each bootstrap script gets passed a single argument, the path to the Workbench config file that was specified in Workbench's --config argument. For example, if you are running Workbench with a config file called create.yml , \"create.yml\" will automatically be passed as the argument to your bootstrap script (you do not specify it in the configuration), like this: generate_image_files.py create.yml Shutdown scripts work the same way as bootstrap scripts but they execute after Workbench has finished connecting to Drupal. Like bootstrap scripts, shutdown scripts receive a single argument from Workbench, the path to your configuration file. A common situation where a shutdown script is useful is to check the Workbench log for failures, and if any are detected, to email someone. The script email_log_if_errors.py in the scripts directory shows how this can be used for this. To register a shutdown script, add it to the shutdown option: shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example.py\"] --check will check for the existence of bootstrap and shutdown scripts, and that they are executable, but does not execute them. The scripts are only executed when Workbench is run without --check . A shutdown script that you might find useful is scripts/generate_iiif_manifests.py , which generates the IIIF Manifest (book-manifest) for each node in the current create task that is a parent. You might want to do this since pregenerating the manifest usually speeds up rendering of paged content in Mirador and OpenSeadragon. To use it, simply add the following to your create task config file, adjusting the path to your Workbench scripts directory: shutdown: [\"/home/mark/hacking/workbench/scripts/generate_iiif_manifests.py\"] Warning Bootstrap and shutdown scripts get passed the path to your configuration file, but they only have access to the configuration settings explicitly defined in that file. In other words, any configuration setting with a default value, and therefore no necessarily included in your configuration file, is not known to bootstrap/shutdown scripts. Therefore, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench.","title":"Bootstrap and shutdown scripts"},{"location":"hooks/#csv-preprocessor-scripts","text":"CSV preprocessor scripts are applied to CSV values in a specific CSV field prior to the values being ingested into Drupal. They apply to the entire value from the CSV field and not split field values, e.g., if a field is multivalued, the preprocessor must split it and then reassemble it back into a string. Note that preprocessor scripts work only on string data and not on binary data like images, etc. and only on custom fields (so not title). Preprocessor scripts are applied in create and update tasks. Note If you are interested in seeing preprocessor scripts act on binary data such as images, see this issue . For example, if you want to convert all the values in the field_description CSV field to sentence case, you can do this by writing a small Python script that uses the capitalize() method and registering it as a preprocessor. To register a preprocessor script in your configuration file, add it to the preprocessors setting, mapping a CSV column header to the path of the script, like this: preprocessors: - field_description: /home/mark/Documents/hacking/workbench/scripts/samplepreprocessor.py You must provide the absolute path to the script, and the script must be executable. Each preprocessor script gets passed two arguments: the character used as the CSV subdelimiter (defined in the subdelimiter config setting, which defaults to | ) unlike bootstrap, shutdown, and post-action scripts, preprocessor scripts do not get passed the path to your Workbench configuration file; they only get passed the value of the subdelimiter config setting. the CSV field value When executed, the script processes the string content of the CSV field, and then replaces the original version of the CSV field value with the version processed by the script. An example preprocessor script is available in scripts/samplepreprocessor.py .","title":"CSV preprocessor scripts"},{"location":"hooks/#post-action-scripts","text":"Post-action scripts execute after a node is created or updated, or after a media is created. To register post-action scripts in your configuration file, add them to either the node_post_create , node_post_update , or media_post_create configuration setting: node_post_create: [\"/home/mark/Documents/hacking/workbench/post_node_create.py\"] node_post_update: [\"/home/mark/Documents/hacking/workbench/post_node_update.py\"] media_post_create: [\"/home/mark/Documents/hacking/workbench/post_media_update.py\"] The arguments passed to each post-action hook are: the path to the Workbench config file that was specified in the --config argument the HTTP response code returned from the action (create, update), e.g. 201 or 403 . Note that this response code is a string, not an integer. the entire HTTP response body; this will be raw JSON. These arguments are passed to post-action scripts automatically. You don't specific them when you register your scripts in your config file. The scripts/entity_post_task_example.py illustrates these arguments. Your scripts can find the entity ID and other information within the (raw JSON) HTTP response body. Using the way Python decodes JSON as an example, if the entity is a node, its nid is in entity_json['nid'][0]['value'] ; if the entity is a media, the mid is in entity_json['mid'][0]['value'] . The exact location of the nid and mid may differ if your script is written in a language that decodes JSON differently than Python (used in this example) does. Warning Not all Workbench configuration settings are available in post-action scripts. Only the settings are explicitly defined in the configuration YAML are available. As with bootstrap and shutdown scripts, when using post-action scripts, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench.","title":"Post-action scripts"},{"location":"hooks/#running-multiple-scripts-in-one-hook","text":"For all types of hooks, you can register multiple scripts, like this: bootstrap: [\"/home/mark/Documents/hacking/workbench/bootstrap_example_1.py\", \"/home/mark/Documents/hacking/workbench/bootstrap_example_2.py\"] shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example_1.py\", \"/home/mark/Documents/hacking/workbench/shutdown_example_2.py\"] node_post_create: [\"/home/mark/scripts/email_someone.py\", \"/tmp/hit_remote_api.py\"] They are executed in the order in which they are listed.","title":"Running multiple scripts in one hook"},{"location":"ignoring_csv_rows_and_columns/","text":"Commenting out CSV rows In create and update tasks, you can comment out rows in your input CSV, Excel file, or Google Sheet by adding a hash mark ( # ) as the first character of the value in the first column. Workbench ignores these rows, both when it is run with and without --check . Commenting out rows works in all tasks that use CSV data. For example, the third row in the following CSV file is commented out: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,Weather was windy. #IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Since column order doesn't matter to Workbench, the same row is commented out in both the previous example and in this one: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Commenting works the same with in Excel and Google Sheets. Here is the CSV file used above in a Google Sheet: You can also use commenting to include actual comments in your CSV/Google Sheet/Excel file: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # Let's not load the following record right now. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Using CSV row ranges The csv_start_row and csv_stop_row configuration settings allow you to tell Workbench to only process a specific subset of input CSV records. Both settings are optional and can be used in any task, and apply when using text CSV, Google Sheets, or Excel input files. Each setting takes as its value a row number (ignoring the header row). For example, row number 2 is the second row of data after the CSV header row. Below are some example configurations. Process CSV rows 10 to the end of the CSV file (ignoring rows 1-9): csv_start_row: 10 Process only CSV rows 10-15 (ignoring all other rows): csv_start_row: 10 csv_stop_row: 15 Process CSV from the start of the file to row 20 (ignoring rows 21 and higher): csv_stop_row: 20 If you only want to process a single row, use its position in the CSV for both csv_start_row or csv_stop_row (for example, to only process row 100): csv_start_row: 100 csv_stop_row: 100 Note When the csv_start_row or csv_stop_row options are in use, Workbench will display a message similar to the following when run: Using a subset of the input CSV (will start at row 10, stop at row 15). Processing specific CSV rows You can tell Workbench to process only specific rows in your CSV file (or, looked at from another perspective, to ignore all rows other than the specified ones). To do this, add the csv_rows_to_process setting to your config file with a list of \"id\" column values, e.g.: csv_rows_to_process: [\"test_001\", \"test_007\", \"test_103\"] This will tell Workbench to process only the CSV rows that have those values in their \"id\" column. This works with whatever you have configured as your \"id\" column header using the id_field configuration setting. Processing or ignoring rows based on field values Workbench provides a simple mechanism to filter rows in your input CSV. For example, you can tell it to process only CSV rows that have a field_model of \"Image\", or fields that have a field_model of either \"Image\" or \"Digital document\". This is done using the csv_row_filters config setting, which defines a set of filters that are applied to your CSV at runtime. There are two types of filters, is and isnot . Here is an example configuration using an is filter to reduce the input CSV to only rows that have a field_model of \"Image\": csv_row_filters: - field_model:is:Image This example filters the CSV input to only rows that have a field_model of \"Image\" or \"Digital document\": csv_row_filters: - field_model:is:Image - field_model:is:Digital document Multiple filters are ORed together, i.e. in the above example, the row in kept in the input if its field_model is either \"Image\" or \"Digital document\". isnot filters work similarly, but they exclude rows that match ( is filters include rows that match). For example, csv_row_filters: - field_model:isnot:Image will filter out rows that have a field_model of \"Image\". Some things to keep in mind when using CSV row filters: You can filter on any column in your input CSV, not just field_model as used in examples above. You can use filters on multivalued fields; Workbench will check each value in the field against each filter. You can add as many filters as you want (both is and isnot ) but as the number of filters increases, the contents of your resulting input CSV becomes less predictable. In other words, these filters are intended for, and work best in, configurations that have a small number (e.g., one, two, or three) filters. isnot filters are applied before is filters. Within is and isnot filters, each filter is not necessarily applied in the order they appear in your configuration file. Filters are applied every time you run Workbench, regardless of whether you are in --check mode or not. Regardless of whether you are using CSV row filters, or any other technique of ignoring CSV rows or columns, Workbench converts your input CSV into a \"preprocessed\" version and uses it to perform its task. This file can be found in the temporary directory defined in your Workbench config's temp_dir setting, which by default is your computer's temporary directory. If you want to inspect this file after running --check , you can see which rows result after the filters have been applied. Ignoring CSV columns Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names and reserved Workbench column names. To accommodate CSV columns that do not correspond to either of those types, or to eliminate a column during testing or troubleshooting, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the column headers in the ignore_csv_columns configuration setting. The value of this setting is a list. For example, if you want to include a date_generated column in your CSV (which is neither a Workbench reserved column or a Drupal field name), include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated'] If you want Workbench to ignore the \"data_generated\" column and the \"field_description\" columns, your configuration would look like this: ignore_csv_columns: ['date_generated', 'field_description'] Note that if a column name is listed in the csv_field_templates setting , the value for the column defined in the template is used. In other words, the values in the CSV are ignored, but the field in Drupal is still populated using the value from the field template.","title":"Ignoring CSV rows and columns"},{"location":"ignoring_csv_rows_and_columns/#commenting-out-csv-rows","text":"In create and update tasks, you can comment out rows in your input CSV, Excel file, or Google Sheet by adding a hash mark ( # ) as the first character of the value in the first column. Workbench ignores these rows, both when it is run with and without --check . Commenting out rows works in all tasks that use CSV data. For example, the third row in the following CSV file is commented out: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,Weather was windy. #IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Since column order doesn't matter to Workbench, the same row is commented out in both the previous example and in this one: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Commenting works the same with in Excel and Google Sheets. Here is the CSV file used above in a Google Sheet: You can also use commenting to include actual comments in your CSV/Google Sheet/Excel file: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # Let's not load the following record right now. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\"","title":"Commenting out CSV rows"},{"location":"ignoring_csv_rows_and_columns/#using-csv-row-ranges","text":"The csv_start_row and csv_stop_row configuration settings allow you to tell Workbench to only process a specific subset of input CSV records. Both settings are optional and can be used in any task, and apply when using text CSV, Google Sheets, or Excel input files. Each setting takes as its value a row number (ignoring the header row). For example, row number 2 is the second row of data after the CSV header row. Below are some example configurations. Process CSV rows 10 to the end of the CSV file (ignoring rows 1-9): csv_start_row: 10 Process only CSV rows 10-15 (ignoring all other rows): csv_start_row: 10 csv_stop_row: 15 Process CSV from the start of the file to row 20 (ignoring rows 21 and higher): csv_stop_row: 20 If you only want to process a single row, use its position in the CSV for both csv_start_row or csv_stop_row (for example, to only process row 100): csv_start_row: 100 csv_stop_row: 100 Note When the csv_start_row or csv_stop_row options are in use, Workbench will display a message similar to the following when run: Using a subset of the input CSV (will start at row 10, stop at row 15).","title":"Using CSV row ranges"},{"location":"ignoring_csv_rows_and_columns/#processing-specific-csv-rows","text":"You can tell Workbench to process only specific rows in your CSV file (or, looked at from another perspective, to ignore all rows other than the specified ones). To do this, add the csv_rows_to_process setting to your config file with a list of \"id\" column values, e.g.: csv_rows_to_process: [\"test_001\", \"test_007\", \"test_103\"] This will tell Workbench to process only the CSV rows that have those values in their \"id\" column. This works with whatever you have configured as your \"id\" column header using the id_field configuration setting.","title":"Processing specific CSV rows"},{"location":"ignoring_csv_rows_and_columns/#processing-or-ignoring-rows-based-on-field-values","text":"Workbench provides a simple mechanism to filter rows in your input CSV. For example, you can tell it to process only CSV rows that have a field_model of \"Image\", or fields that have a field_model of either \"Image\" or \"Digital document\". This is done using the csv_row_filters config setting, which defines a set of filters that are applied to your CSV at runtime. There are two types of filters, is and isnot . Here is an example configuration using an is filter to reduce the input CSV to only rows that have a field_model of \"Image\": csv_row_filters: - field_model:is:Image This example filters the CSV input to only rows that have a field_model of \"Image\" or \"Digital document\": csv_row_filters: - field_model:is:Image - field_model:is:Digital document Multiple filters are ORed together, i.e. in the above example, the row in kept in the input if its field_model is either \"Image\" or \"Digital document\". isnot filters work similarly, but they exclude rows that match ( is filters include rows that match). For example, csv_row_filters: - field_model:isnot:Image will filter out rows that have a field_model of \"Image\". Some things to keep in mind when using CSV row filters: You can filter on any column in your input CSV, not just field_model as used in examples above. You can use filters on multivalued fields; Workbench will check each value in the field against each filter. You can add as many filters as you want (both is and isnot ) but as the number of filters increases, the contents of your resulting input CSV becomes less predictable. In other words, these filters are intended for, and work best in, configurations that have a small number (e.g., one, two, or three) filters. isnot filters are applied before is filters. Within is and isnot filters, each filter is not necessarily applied in the order they appear in your configuration file. Filters are applied every time you run Workbench, regardless of whether you are in --check mode or not. Regardless of whether you are using CSV row filters, or any other technique of ignoring CSV rows or columns, Workbench converts your input CSV into a \"preprocessed\" version and uses it to perform its task. This file can be found in the temporary directory defined in your Workbench config's temp_dir setting, which by default is your computer's temporary directory. If you want to inspect this file after running --check , you can see which rows result after the filters have been applied.","title":"Processing or ignoring rows based on field values"},{"location":"ignoring_csv_rows_and_columns/#ignoring-csv-columns","text":"Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names and reserved Workbench column names. To accommodate CSV columns that do not correspond to either of those types, or to eliminate a column during testing or troubleshooting, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the column headers in the ignore_csv_columns configuration setting. The value of this setting is a list. For example, if you want to include a date_generated column in your CSV (which is neither a Workbench reserved column or a Drupal field name), include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated'] If you want Workbench to ignore the \"data_generated\" column and the \"field_description\" columns, your configuration would look like this: ignore_csv_columns: ['date_generated', 'field_description'] Note that if a column name is listed in the csv_field_templates setting , the value for the column defined in the template is used. In other words, the values in the CSV are ignored, but the field in Drupal is still populated using the value from the field template.","title":"Ignoring CSV columns"},{"location":"installation/","text":"Requirements An Islandora repository using Drupal 8 or 9, with the Islandora Workbench Integration module enabled. If you are using Drupal 8.5 or earlier, please refer to the \"Using Drupal 8.5 or earlier\" section below. Python 3.7 or higher The following Python libraries: ruamel.yaml Requests Requests-Cache progress_bar openpyxl unidecode edtf-validate rich If you want to have these libraries automatically installed, you will need Python's setuptools Islandora Workbench has been installed and used on Linux, Mac, and Windows. Warning Some systems have both Python 2 and Python 3 installed. It's a good idea to check which version is used when you run python . To do this, run python --version , which will output something like \"Python 2.7.17\" or \"Python 3.8.10\". If python --version indicates you're running version 2, try running python3 --version to see if you have version 3 installed. Also, if you installed an alternate version of Python 3.x on your system (for example via Homebrew on a Mac), you may need to run Workbench by calling that Python interpreter directly. For Python 3.x installed via Homebrew, that will be at /opt/homebrew/bin/python3 , so to run Workbench you would use /opt/homebrew/bin/python3 workbench while in the islandora_workbench directory. Installing Islandora Workbench Installation involves two steps: cloning the Islandora Workbench Github repo running setup.py to install the required Python libraries (listed above) Step 1: cloning the Islandora Workbench Github repo In a terminal, run: git clone https://github.com/mjordan/islandora_workbench.git This will create a directory named islandora_workbench where you will run the ./workbench command. Step 2: running setup.py to install the required Python libraries For most people, the preferred place to install Python libraries is in your user directory. To do this, change into the \"islandora_workbench\" directory created by cloning the repo, and run the following command: python3 setup.py install --user A less common method is to install the required Python libraries into your computer's central Python environment. To do this, omit the --user (note: you must have administrator privileges on the computer to do this): sudo python3 setup.py install Updating Islandora Workbench Since Islandora Workbench is under development, you will want to update it often. To do this, within the islandora_workbench directory, run the following git command: git pull origin main After you pull in the latest changes using git , it's a good idea to rerun the setup tools in case new Python libraries have been added since you last ran the setup tools (same command as above): python3 setup.py install --user or if you originally installed the required Python libraries centrally, without the --user option (again, you will need administrator privileges on the machine): sudo python3 setup.py install Keeping the Islandora Workbench Integration Drupal module up to date Islandora Workbench communicates with Drupal using REST endpoints and Views. The Islandora Workbench Integration module (linked above in the \"Requirements\" section) ensures that the target Drupal has all required REST endpoints and Views enabled. Therefore, keeping it in sync with Islandora Workbench is important. Workbench checks the version of the Integration module and tells you if you need to upgrade it. To upgrade the module, update its code via Git or Composer, and follow the instructions in the \"Updates\" section of its README . Configuring Drupal's media URLs Islandora Workbench uses Drupal's default form of media URLs. You should not need to do anything to allow this, since the admin setting in admin/config/media/media-settings (under \"Security\") that determines what form of media URLs your site uses defaults to the correct setting (unchecked): If your site needs to have this option checked (so it supports URLs like /media/{id} ), you will need to add the following entry to all configuration files for tasks that create or delete media: standalone_media_url: true Note If you change the checkbox in Drupal's media settings admin page, be sure you clear your Drupal cache to make the new media URLs work. Using Drupal 8.5 or earlier When ingesting media in Drupal versions 8.5 and earlier, Islandora Workbench has two significant limitations/bugs that you should be aware of: Approximately 10% of media creation attempts will likely fail. Workbench will log these failures. Additional information is available in this issue . A file with a filename that already exists in Islandora will overwrite the existing file, as reported in this issue . To avoid these issues, you need to be running Drupal version 8.6 or higher. Warning If you are using Drupal 8.5 or earlier, you need to use the version of Workbench tagged with drupal_8.5_and_lower (commit 542325fb6d44c2ac84a4e2965289bb9f9ed9bf68). Later versions no longer support Drupal 8.5 and earlier. Password management Islandora Workbench requires user credentials that have administrator-level permissions in the target Drupal. Therefore you should exercise caution when managing those credentials. Workbench configuration files must contain a username setting, but you can provide the corresponding password in three ways: in the password setting in your YAML configuration file in the ISLANDORA_WORKBENCH_PASSWORD environment variable in response to a prompt when you run Workbench. If the password setting is present in your configuration files, Workbench will use its value as the user password and will ignore the other two methods of providing a password. If the password setting is absent, Workbench will look for the ISLANDORA_WORKBENCH_PASSWORD environment variable and if it is present, use its value. If both the password setting and the ISLANDORA_WORKBENCH_PASSWORD environment variable are absent, Workbench will prompt the user for a password before proceeding. Warning If you put the password in configuration files, you should not leave the files in directories that are widely readable, send them in emails or share them in Slack, commit the configuration files to public Git repositories, etc.","title":"Requirements and installation"},{"location":"installation/#requirements","text":"An Islandora repository using Drupal 8 or 9, with the Islandora Workbench Integration module enabled. If you are using Drupal 8.5 or earlier, please refer to the \"Using Drupal 8.5 or earlier\" section below. Python 3.7 or higher The following Python libraries: ruamel.yaml Requests Requests-Cache progress_bar openpyxl unidecode edtf-validate rich If you want to have these libraries automatically installed, you will need Python's setuptools Islandora Workbench has been installed and used on Linux, Mac, and Windows. Warning Some systems have both Python 2 and Python 3 installed. It's a good idea to check which version is used when you run python . To do this, run python --version , which will output something like \"Python 2.7.17\" or \"Python 3.8.10\". If python --version indicates you're running version 2, try running python3 --version to see if you have version 3 installed. Also, if you installed an alternate version of Python 3.x on your system (for example via Homebrew on a Mac), you may need to run Workbench by calling that Python interpreter directly. For Python 3.x installed via Homebrew, that will be at /opt/homebrew/bin/python3 , so to run Workbench you would use /opt/homebrew/bin/python3 workbench while in the islandora_workbench directory.","title":"Requirements"},{"location":"installation/#installing-islandora-workbench","text":"Installation involves two steps: cloning the Islandora Workbench Github repo running setup.py to install the required Python libraries (listed above)","title":"Installing Islandora Workbench"},{"location":"installation/#step-1-cloning-the-islandora-workbench-github-repo","text":"In a terminal, run: git clone https://github.com/mjordan/islandora_workbench.git This will create a directory named islandora_workbench where you will run the ./workbench command.","title":"Step 1: cloning the Islandora Workbench Github repo"},{"location":"installation/#step-2-running-setuppy-to-install-the-required-python-libraries","text":"For most people, the preferred place to install Python libraries is in your user directory. To do this, change into the \"islandora_workbench\" directory created by cloning the repo, and run the following command: python3 setup.py install --user A less common method is to install the required Python libraries into your computer's central Python environment. To do this, omit the --user (note: you must have administrator privileges on the computer to do this): sudo python3 setup.py install","title":"Step 2: running setup.py to install the required Python libraries"},{"location":"installation/#updating-islandora-workbench","text":"Since Islandora Workbench is under development, you will want to update it often. To do this, within the islandora_workbench directory, run the following git command: git pull origin main After you pull in the latest changes using git , it's a good idea to rerun the setup tools in case new Python libraries have been added since you last ran the setup tools (same command as above): python3 setup.py install --user or if you originally installed the required Python libraries centrally, without the --user option (again, you will need administrator privileges on the machine): sudo python3 setup.py install","title":"Updating Islandora Workbench"},{"location":"installation/#keeping-the-islandora-workbench-integration-drupal-module-up-to-date","text":"Islandora Workbench communicates with Drupal using REST endpoints and Views. The Islandora Workbench Integration module (linked above in the \"Requirements\" section) ensures that the target Drupal has all required REST endpoints and Views enabled. Therefore, keeping it in sync with Islandora Workbench is important. Workbench checks the version of the Integration module and tells you if you need to upgrade it. To upgrade the module, update its code via Git or Composer, and follow the instructions in the \"Updates\" section of its README .","title":"Keeping the Islandora Workbench Integration Drupal module up to date"},{"location":"installation/#configuring-drupals-media-urls","text":"Islandora Workbench uses Drupal's default form of media URLs. You should not need to do anything to allow this, since the admin setting in admin/config/media/media-settings (under \"Security\") that determines what form of media URLs your site uses defaults to the correct setting (unchecked): If your site needs to have this option checked (so it supports URLs like /media/{id} ), you will need to add the following entry to all configuration files for tasks that create or delete media: standalone_media_url: true Note If you change the checkbox in Drupal's media settings admin page, be sure you clear your Drupal cache to make the new media URLs work.","title":"Configuring Drupal's media URLs"},{"location":"installation/#using-drupal-85-or-earlier","text":"When ingesting media in Drupal versions 8.5 and earlier, Islandora Workbench has two significant limitations/bugs that you should be aware of: Approximately 10% of media creation attempts will likely fail. Workbench will log these failures. Additional information is available in this issue . A file with a filename that already exists in Islandora will overwrite the existing file, as reported in this issue . To avoid these issues, you need to be running Drupal version 8.6 or higher. Warning If you are using Drupal 8.5 or earlier, you need to use the version of Workbench tagged with drupal_8.5_and_lower (commit 542325fb6d44c2ac84a4e2965289bb9f9ed9bf68). Later versions no longer support Drupal 8.5 and earlier.","title":"Using Drupal 8.5 or earlier"},{"location":"installation/#password-management","text":"Islandora Workbench requires user credentials that have administrator-level permissions in the target Drupal. Therefore you should exercise caution when managing those credentials. Workbench configuration files must contain a username setting, but you can provide the corresponding password in three ways: in the password setting in your YAML configuration file in the ISLANDORA_WORKBENCH_PASSWORD environment variable in response to a prompt when you run Workbench. If the password setting is present in your configuration files, Workbench will use its value as the user password and will ignore the other two methods of providing a password. If the password setting is absent, Workbench will look for the ISLANDORA_WORKBENCH_PASSWORD environment variable and if it is present, use its value. If both the password setting and the ISLANDORA_WORKBENCH_PASSWORD environment variable are absent, Workbench will prompt the user for a password before proceeding. Warning If you put the password in configuration files, you should not leave the files in directories that are widely readable, send them in emails or share them in Slack, commit the configuration files to public Git repositories, etc.","title":"Password management"},{"location":"limitations/","text":"Note If you are encountering problems not described here, please open an issue and help improve Islandora Workbench! parent_id CSV column can only contain one ID The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Non-ASCII filenames are normalized to their ASCII equivalents. The HTTP client library Workbench uses, Requests, requires filenames to be encoded as Latin-1 , while Drupal requires filenames to be encoded as UTF-8. Normalizing filenames that contain diacritics or non-Latin characters to their ASCII equivalents is a compromise. See this issue for more information. If Workbench normalizes a filename, it logs the original and the normalized version. Updating nodes does not create revisions. This is limitation of Drupal (see this issue ). Password prompt always fails first time, and prompts a second time (which works) More information is available at this issue . Workbench doesn't support taxonomy reference fields that use the \"Filter by an entity reference View\" reference type Only taxonomy reference fields that use the \"Default\" reference type are currently supported. As a work around, to populate a \"Filter by an entity reference View\" field, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Note that Workbench will not validate values in fields that are configured to use this type of reference. Also, term IDs that are not in the View results will result in the node not being created (Drupal will return a 422 response). Setting destination filesystem for media is not possible Drupal's REST interface for file fields does not allow overriding the \"upload destination\" (filesystem) that is defined in a media type's file field configuration. For example, if a file field is configured to use the \"Flysystem: fedora\" upload destination, you cannot tell Workbench to use the \"Public Files\" upload destination instead. Note that the drupal_filesystem configuration setting only applied to Drupal versions 8.x - 9.1. In Drupal 9.2 or later, this setting is ignored. Workbench cannot modify media automatically generated by Islandora's microservices Islandora uses Contexts to initiate the generation of derivative media. Configuration for these Contexts is available in the \"Derivatives\" section of admin/structure/context . One example of a derivative media is a thumbnail generated on the ingestion of a JPEG original file. During create or add_media tasks, Workbench cannot modify or even be aware of derivative media automatically generated by Islandora. For example, Workbench can't add alt text to a thumbnail image automatically created by the \"Image derivatives\" Context. This is because the derivative media are generated asynchronously by Islandora's job processing queue. In other words, there is no predictable relationship between when an \"Original file\" media is created by Workbench (or uploaded by a user in the Drupal content management forms) and when the \"Thumbnail\" media is generated by Islandora's microservices. This unpredictability prevents Workbench from knowing when the derivative will be or has been created, making it impossible to have Workbench automatically add alt text to that thumbnail.","title":"Known limitations"},{"location":"limitations/#parent_id-csv-column-can-only-contain-one-id","text":"The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field.","title":"parent_id CSV column can only contain one ID"},{"location":"limitations/#non-ascii-filenames-are-normalized-to-their-ascii-equivalents","text":"The HTTP client library Workbench uses, Requests, requires filenames to be encoded as Latin-1 , while Drupal requires filenames to be encoded as UTF-8. Normalizing filenames that contain diacritics or non-Latin characters to their ASCII equivalents is a compromise. See this issue for more information. If Workbench normalizes a filename, it logs the original and the normalized version.","title":"Non-ASCII filenames are normalized to their ASCII equivalents."},{"location":"limitations/#updating-nodes-does-not-create-revisions","text":"This is limitation of Drupal (see this issue ).","title":"Updating nodes does not create revisions."},{"location":"limitations/#password-prompt-always-fails-first-time-and-prompts-a-second-time-which-works","text":"More information is available at this issue .","title":"Password prompt always fails first time, and prompts a second time (which works)"},{"location":"limitations/#workbench-doesnt-support-taxonomy-reference-fields-that-use-the-filter-by-an-entity-reference-view-reference-type","text":"Only taxonomy reference fields that use the \"Default\" reference type are currently supported. As a work around, to populate a \"Filter by an entity reference View\" field, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Note that Workbench will not validate values in fields that are configured to use this type of reference. Also, term IDs that are not in the View results will result in the node not being created (Drupal will return a 422 response).","title":"Workbench doesn't support taxonomy reference fields that use the \"Filter by an entity reference View\" reference type"},{"location":"limitations/#setting-destination-filesystem-for-media-is-not-possible","text":"Drupal's REST interface for file fields does not allow overriding the \"upload destination\" (filesystem) that is defined in a media type's file field configuration. For example, if a file field is configured to use the \"Flysystem: fedora\" upload destination, you cannot tell Workbench to use the \"Public Files\" upload destination instead. Note that the drupal_filesystem configuration setting only applied to Drupal versions 8.x - 9.1. In Drupal 9.2 or later, this setting is ignored.","title":"Setting destination filesystem for media is not possible"},{"location":"limitations/#workbench-cannot-modify-media-automatically-generated-by-islandoras-microservices","text":"Islandora uses Contexts to initiate the generation of derivative media. Configuration for these Contexts is available in the \"Derivatives\" section of admin/structure/context . One example of a derivative media is a thumbnail generated on the ingestion of a JPEG original file. During create or add_media tasks, Workbench cannot modify or even be aware of derivative media automatically generated by Islandora. For example, Workbench can't add alt text to a thumbnail image automatically created by the \"Image derivatives\" Context. This is because the derivative media are generated asynchronously by Islandora's job processing queue. In other words, there is no predictable relationship between when an \"Original file\" media is created by Workbench (or uploaded by a user in the Drupal content management forms) and when the \"Thumbnail\" media is generated by Islandora's microservices. This unpredictability prevents Workbench from knowing when the derivative will be or has been created, making it impossible to have Workbench automatically add alt text to that thumbnail.","title":"Workbench cannot modify media automatically generated by Islandora's microservices"},{"location":"logging/","text":"Islandora Workbench writes a log file for all tasks to a file named \"workbench.log\" in the directory Workbench is run from, unless you specify an alternative log file location using the log_file_path configuration option, e.g.: log_file_path: /tmp/mylogfilepath.log Note The only times that the default log file name is used instead of one defined in log_file_path is 1) when Workbench can't find the specified configuration file and 2) when Workbench finds the configuration file but detects that the file is not valid YAML, and therefore can't understand the value of log_file_path . The log contains information that is similar to what you see when you run Workbench, but with time stamps: 24-Dec-20 15:05:06 - INFO - Starting configuration check for \"create\" task using config file create.yml. 24-Dec-20 15:05:07 - INFO - OK, configuration file has all required values (did not check for optional values). 24-Dec-20 15:05:07 - INFO - OK, CSV file input_data/metadata.csv found. 24-Dec-20 15:05:07 - INFO - OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). 24-Dec-20 15:05:21 - INFO - OK, CSV column headers match Drupal field names. 24-Dec-20 15:05:21 - INFO - OK, required Drupal fields are present in the CSV file. 24-Dec-20 15:05:23 - INFO - OK, term IDs/names in CSV file exist in their respective taxonomies. 24-Dec-20 15:05:23 - INFO - OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. 24-Dec-20 15:05:23 - INFO - OK, files named in the CSV \"file\" column are all present. 24-Dec-20 15:05:23 - INFO - Configuration checked for \"create\" task using config file create.yml, no problems found. It may also contain additional detail that would clutter up the console output, for example which term is being added to a vocabulary. Appending to vs. overwriting your log file By default, new entries are appended to this log, unless you indicate that the log file should be overwritten each time Workbench is run by providing the log_file_mode configuration option with a value of \"w\": log_file_mode: w Logging debugging information Workbench doesn't provide a way to set the amount of detail in its log, but several options are available that are useful for debugging and troubleshooting. These options, when set to true , write raw values used in the REST requests to Drupal: log_request_url : Logs the request URL and its method (GET, POST, etc.). log_json : Logs the raw JSON that Workbench uses in POST, PUT, and PATCH requests. log_headers : Logs the raw HTTP headers used in all requests. log_response_status_code : Logs the HTTP response code. log_response_body : Logs the raw HTTP response body. Another configuration setting that is useful during debugging is log_file_name_and_line_number , which, as the name suggests, adds to all log entries the filename and line number where the entry was generated. These options can be used independently of each other, but they are often more useful for debugging when used together. Warning Using these options, especially log_json and log_response_body , can add a lot of data to you log file.","title":"Logging"},{"location":"logging/#appending-to-vs-overwriting-your-log-file","text":"By default, new entries are appended to this log, unless you indicate that the log file should be overwritten each time Workbench is run by providing the log_file_mode configuration option with a value of \"w\": log_file_mode: w","title":"Appending to vs. overwriting your log file"},{"location":"logging/#logging-debugging-information","text":"Workbench doesn't provide a way to set the amount of detail in its log, but several options are available that are useful for debugging and troubleshooting. These options, when set to true , write raw values used in the REST requests to Drupal: log_request_url : Logs the request URL and its method (GET, POST, etc.). log_json : Logs the raw JSON that Workbench uses in POST, PUT, and PATCH requests. log_headers : Logs the raw HTTP headers used in all requests. log_response_status_code : Logs the HTTP response code. log_response_body : Logs the raw HTTP response body. Another configuration setting that is useful during debugging is log_file_name_and_line_number , which, as the name suggests, adds to all log entries the filename and line number where the entry was generated. These options can be used independently of each other, but they are often more useful for debugging when used together. Warning Using these options, especially log_json and log_response_body , can add a lot of data to you log file.","title":"Logging debugging information"},{"location":"media_track_files/","text":"Video and audio service file media can have accompanying track files. These files can be added in \"create\" tasks by including a column in your CSV file that contains the information Islandora Workbench needs to create these files. Note 1) This feature of Workbench only works with media track files that are stored in a file field on Audio and Video media. In the Starter Site, these fields are named field_track . If you store your track files in other fields on the Audio or Video media, you can identify those fields using the instructions below. 2) In order to create media track files as described below, you must configure Drupal in the following ways. First, the \"File\" media type's \"field_media_file\" field must be configured to accept files with the \".vtt\" extension. To do this, visit /admin/structure/media/manage/file/fields/media.file.field_media_file and add \"vtt\" to the list of allowed file extensions. 3) You must ensure that captions are enabled in the respective media display configurations. To do this for video, visit /admin/structure/media/manage/video/display and select \"Video with Captions\" in the \"Video file\" entry. For audio, visit /admin/structure/media/manage/audio/display and select \"Audio with Captions\" in the \"Audio file\" entry. Since track files are part of the media's service file, Workbench will only add track files to media that are tagged as \"Service File\" in your Workbench CSV using a \"media_use_tid\" column (or using the media_use_tid configuration setting). Tagging video and audio files as both \"Original File\" and \"Service File\" will also work. Running --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) are compatible with creating media track files. Workbench cannot add track files to service files generated by Islandora - it can only create track files that accompany service files it creates. The header for this CSV column looks like media:video:field_track - the string \"media\" followed by a colon, which is then followed by a media type (in a standard Islandora configuration, either \"video\" or \"audio\"), which is then followed by another colon and the machine name of the field on the media that holds the track file (in a standard Islandora configuration, this is \"field_track\" for both video and audio). If you have a custom setup and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field In this case, your column header would look like media:mycustommmedia:field_another_custom_track_file_field . You only need to use this configuration setting if you have a custom media type, or you have configured your Islandora so that audio or video uses a nonstandard track file field. Note that this setting replaces the default values of video: field_track and audio: field_track , so if you wanted to retain those values and add your own, your configuration file would need to contain something like this: media_track_file_fields: audio: field_my_custom_track_file_field video: field_track mycustommmediatype: field_another_custom_track_file_field In the track column in your input CSV, you specify, in the following order and separated by colons, the label for the track, its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\"), the language code for the track (\"en\", \"fr\", \"es\", etc.), and the absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case). Here is an example CSV file that creates track files for accompanying video files: id,field_model,title,file,media:video:field_track 001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt 002,Image,An image,test.png, 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt 004,Video,Third video,third.mp4, Warning Since a colon is used to separate the parts of the track data, you can't use a colon in the label. A label value like \"Transcript: The Making Of\" will invalidate the entire track data for that CSV row, and Workbench will skip creating it. If you need to have a colon in your track label, you will need to update the label value manually in the video or audio media's add/edit form. However, if you are running Workbench on Windows, you can use absolute file paths that contain colons, such as c:\\Users\\mark\\Documents\\my_vtt_file.vtt . You can mix CSV entries that contain track file information with those that do not (as with row 002 above, which is for an image), and also omit the track data for video and audio files that don't have an accompanying track file. If there is no value in the media track column, Workbench will not attempt to create a media track file. You can also add multiple track files for a single video or audio file in the same way you add multiple values in any field: id,field_model,title,file,media:video:field_track 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt|Transcript:subtitles:en:/path/to/another/track_file.vtt Note If you add multiple track files to a single video or audio, only the first one listed in the CSV data for that entry will be configured as the \"Default\" track. If your CSV has entries for both audio and video files, you will need to include separate track columns, one for each media type: id,field_model,title,file,media:video:field_track,media:audio:field_track 001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt, 002,Image,An image,test.png,, 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt, 004,Audio,An audio,,Transcript:subtitles:en:/path/to/audio004/vtt/file.vtt","title":"Creating media track files"},{"location":"media_types/","text":"Overriding Workbench's default extension to media type mappings Note Drupal's use of Media types (image, video, document, etc.) is distinct from Islandora's use of \"model\", which identifies an intellectual entity as an image, video, collection, compound object, newspaper, etc. By default Workbench defines the following file extension to media type mapping: File extensions Media type png, gif, jpg, jpeg image pdf, doc, docx, ppt, pptx document tif, tiff, jp2, zip, tar file mp3, wav, aac audio mp4 video txt extracted_text If a file's extension is not defined in this default mapping, the media is assigned the \"file\" type. If you need to override this default mapping, you can do so in two ways: If the override applies to all files named in your CSV's file column, use the media_type configuration option, for example media_type: document ). Use this option if all of the files in your batch are to be assigned the same media type, but their extensions are not defined in the default mapping or you wish to override the default mapping. On a per file extension basis, via a mapping in the media_types_override option in your configuration file like this one: media_types_override: - video: ['mp4', 'ogg'] Use the media_types_override option if each of the files named in your CSV's file column are to be assigned an extension-specific media type, and their extensions are not defined in the default mapping (or add to the extensions in the default mapping, as in this example). Note that: If a file's extension is not present in the default mapping or in the media_types_override custom mapping, the media is assigned the \"file\" type. If you use the media_types_override configuration option, your mapping replaces Workbench's default mappings for the specified media type. This means that if you want to retain the default media type mapping for a file extension, you need to include it in the mapping, as illustrated by the presence of \"mp4\" in the example above. If both media_type and media_types_override are included in the config file, the mapping in media_types_override is ignored and the media type assigned in media_type is used. Overriding Workbench's default MIME type to file extension mappings For remote files, in other words files that start with http or https , Workbench relies on the MIME type provided by the remote web server to determine the extension of the temporary file that it writes locally. If you are getting errors indicating that a file extension is not registered for a given media type, and you suspect that the extensions are wrong, you can include the mimetype_extensions setting in your config file to tell Workbench which extensions to use for a given MIME type. Here is a (hypothetical) example that tells Workbench to assign the '.foo' extension to files with the MIME type 'image/jp2' and the extension '.bar' to files with the MIME type 'image/jpeg': mimetype_extensions: 'image/jp2': '.foo' 'image/jpeg': '.bar' Overriding Workbench's default file extension to MIME type mappings It may be necessary to assign a media a MIME type that is different from the MIME type that Drupal ordinarily derives from a given extension. The best example of this is that media for hOCR files need to have the MIME type text/vnd.hocr+html . Without explicitly indicating that MIME type, Drupal will assign the media for an .hocr file, on creation, the catch-all MIME type application/octet-stream . Workbench automatically assigns files ending in .hocr the correct MIME type. But, if you want to override that for some reason, or want to tell Workbench to create a media with a specific MIME type from a file with a specific extension, you can add to your configuration file a an extension-to-MIME-type mapping like this (the leading . in the extension on the left is optional): extensions_to_mimetypes: 'mbox': 'application/mbox' Assigning a media type by Media Use URI We typically assign a media type that corresponds with the asset's MIME type. However, there are instances where two media types share the same MIME type. A notable example is FITS XML: while assets with the MIME type application/xml are usually assigned a file media type, the fits_technical_metadata designation is more suitable in this case. To assign a specific media to an asset type based on its intended use, use media_type_by_media_use . Below is an example demonstrating how to assign a FITS-tagged XML file to the fits_technical_metadata media type . media_type_by_media_use: - https://projects.iq.harvard.edu/fits: fits_technical_metadata Configuring a custom media type Islandora ships with a set of default media types, including audio, document, extracted text, file, FITS technical metadata, image, and video. If you want to add your own custom media type, you need to tell Workbench two things: which file extension(s) should map to the new media type, and which field on the new media type is used to store the file associated with the media. To satisfy the first requirement, use the media_type or media_types_override option as described above. To satisfy the second requirement, use Workbench's media_type_file_fields option. The values in the media_type_file_fields option are the machine name of the media type and the machine name of the \"File\" field configured for that media. To determine the machine name of your media type, go to the field configuration of your media types (Admin > Structure > Media types) choose your custom media type choose the \"Manage fields\" operation for the media type. The URL of the Drupal page you are now at should look like /admin/structure/media/manage/my_custom_media/fields . The machine name of the media is in the second-last position in the URL. In this example, it's my_custom_media . in the list of fields, look for the one that says \"File\" in the \"Field type\" column the field machine name you want is in that row's \"Machine name\" column. Here's an example that tells Workbench that the custom media type \"Custom media\" uses the \"field_media_file\" field: media_type_file_fields: - my_custom_media: field_media_file Put together, the two configuration options would look like this: media_types_override: - my_custom_media: ['cus'] media_type_file_fields: - my_custom_media: field_media_file In this example, your Workbench job is creating media of varying types (for example, images, videos, and documents, all using the default extension-to-media type mappings. If all the files you are adding in the Workbench job all have the same media type (in the following example, your \"my_custom_media\" type), you could use this configuration: media_type: my_custom_media media_type_file_fields: - my_custom_media: field_media_file","title":"Configuring media types"},{"location":"media_types/#overriding-workbenchs-default-extension-to-media-type-mappings","text":"Note Drupal's use of Media types (image, video, document, etc.) is distinct from Islandora's use of \"model\", which identifies an intellectual entity as an image, video, collection, compound object, newspaper, etc. By default Workbench defines the following file extension to media type mapping: File extensions Media type png, gif, jpg, jpeg image pdf, doc, docx, ppt, pptx document tif, tiff, jp2, zip, tar file mp3, wav, aac audio mp4 video txt extracted_text If a file's extension is not defined in this default mapping, the media is assigned the \"file\" type. If you need to override this default mapping, you can do so in two ways: If the override applies to all files named in your CSV's file column, use the media_type configuration option, for example media_type: document ). Use this option if all of the files in your batch are to be assigned the same media type, but their extensions are not defined in the default mapping or you wish to override the default mapping. On a per file extension basis, via a mapping in the media_types_override option in your configuration file like this one: media_types_override: - video: ['mp4', 'ogg'] Use the media_types_override option if each of the files named in your CSV's file column are to be assigned an extension-specific media type, and their extensions are not defined in the default mapping (or add to the extensions in the default mapping, as in this example). Note that: If a file's extension is not present in the default mapping or in the media_types_override custom mapping, the media is assigned the \"file\" type. If you use the media_types_override configuration option, your mapping replaces Workbench's default mappings for the specified media type. This means that if you want to retain the default media type mapping for a file extension, you need to include it in the mapping, as illustrated by the presence of \"mp4\" in the example above. If both media_type and media_types_override are included in the config file, the mapping in media_types_override is ignored and the media type assigned in media_type is used.","title":"Overriding Workbench's default extension to media type mappings"},{"location":"media_types/#overriding-workbenchs-default-mime-type-to-file-extension-mappings","text":"For remote files, in other words files that start with http or https , Workbench relies on the MIME type provided by the remote web server to determine the extension of the temporary file that it writes locally. If you are getting errors indicating that a file extension is not registered for a given media type, and you suspect that the extensions are wrong, you can include the mimetype_extensions setting in your config file to tell Workbench which extensions to use for a given MIME type. Here is a (hypothetical) example that tells Workbench to assign the '.foo' extension to files with the MIME type 'image/jp2' and the extension '.bar' to files with the MIME type 'image/jpeg': mimetype_extensions: 'image/jp2': '.foo' 'image/jpeg': '.bar'","title":"Overriding Workbench's default MIME type to file extension mappings"},{"location":"media_types/#overriding-workbenchs-default-file-extension-to-mime-type-mappings","text":"It may be necessary to assign a media a MIME type that is different from the MIME type that Drupal ordinarily derives from a given extension. The best example of this is that media for hOCR files need to have the MIME type text/vnd.hocr+html . Without explicitly indicating that MIME type, Drupal will assign the media for an .hocr file, on creation, the catch-all MIME type application/octet-stream . Workbench automatically assigns files ending in .hocr the correct MIME type. But, if you want to override that for some reason, or want to tell Workbench to create a media with a specific MIME type from a file with a specific extension, you can add to your configuration file a an extension-to-MIME-type mapping like this (the leading . in the extension on the left is optional): extensions_to_mimetypes: 'mbox': 'application/mbox'","title":"Overriding Workbench's default file extension to MIME type mappings"},{"location":"media_types/#assigning-a-media-type-by-media-use-uri","text":"We typically assign a media type that corresponds with the asset's MIME type. However, there are instances where two media types share the same MIME type. A notable example is FITS XML: while assets with the MIME type application/xml are usually assigned a file media type, the fits_technical_metadata designation is more suitable in this case. To assign a specific media to an asset type based on its intended use, use media_type_by_media_use . Below is an example demonstrating how to assign a FITS-tagged XML file to the fits_technical_metadata media type . media_type_by_media_use: - https://projects.iq.harvard.edu/fits: fits_technical_metadata","title":"Assigning a media type by Media Use URI"},{"location":"media_types/#configuring-a-custom-media-type","text":"Islandora ships with a set of default media types, including audio, document, extracted text, file, FITS technical metadata, image, and video. If you want to add your own custom media type, you need to tell Workbench two things: which file extension(s) should map to the new media type, and which field on the new media type is used to store the file associated with the media. To satisfy the first requirement, use the media_type or media_types_override option as described above. To satisfy the second requirement, use Workbench's media_type_file_fields option. The values in the media_type_file_fields option are the machine name of the media type and the machine name of the \"File\" field configured for that media. To determine the machine name of your media type, go to the field configuration of your media types (Admin > Structure > Media types) choose your custom media type choose the \"Manage fields\" operation for the media type. The URL of the Drupal page you are now at should look like /admin/structure/media/manage/my_custom_media/fields . The machine name of the media is in the second-last position in the URL. In this example, it's my_custom_media . in the list of fields, look for the one that says \"File\" in the \"Field type\" column the field machine name you want is in that row's \"Machine name\" column. Here's an example that tells Workbench that the custom media type \"Custom media\" uses the \"field_media_file\" field: media_type_file_fields: - my_custom_media: field_media_file Put together, the two configuration options would look like this: media_types_override: - my_custom_media: ['cus'] media_type_file_fields: - my_custom_media: field_media_file In this example, your Workbench job is creating media of varying types (for example, images, videos, and documents, all using the default extension-to-media type mappings. If all the files you are adding in the Workbench job all have the same media type (in the following example, your \"my_custom_media\" type), you could use this configuration: media_type: my_custom_media media_type_file_fields: - my_custom_media: field_media_file","title":"Configuring a custom media type"},{"location":"nodes_only/","text":"During a create task, if you want to create nodes but not any accompanying media, for example if you are testing your metadata values or creating collection nodes, you can include the nodes_only: true option in your configuration file: task: create host: \"http://localhost:8000\" username: admin password: islandora nodes_only: true If this is present, Islandora Workbench will only create nodes and will skip all media creation. During --check , it will ignore anything in your CSV's files field (in fact, your CSV doesn't even need a file column). If nodes_only is true , your configuration file for the create task doesn't need a media_use_tid , drupal_filesystem , or media_type / media_types_override option.","title":"Creating nodes without media"},{"location":"paged_and_compound/","text":"Islandora Workbench provides three ways to create paged and compound content: using a subdirectory structure to define the relationship between the parent item and its children using page-level metadata in the CSV to establish that relationship using a secondary task. Using subdirectories Note Information in this section applies to all compound content, not just \"paged content\". That term is used here since the most common use of this method will be for creating paged content. In other words, where \"page\" is used below, it can be substituted with \"child\". Enable this method by including paged_content_from_directories: true in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata. CSV and directory structure This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Only the parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.: id,title,field_model book1,How to Use Islandora Workbench like a Pro,Paged Content book2,Using Islandora Workbench for Fun and Profit,Paged Content Note Unlike every other Islandora Workbench \"create\" configuration, the metadata CSV should not contain a file column (however, you can include a directory column as described below). This means that content created using this method cannot be created using the same CSV file as other content. Each parent's pages are located in a subdirectory of the input directory that is named by default to match the value of the id field of the parent item they are pages of: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv If you don't want to use your id column to name the directory that stores pages, you can include a directory column in your input CSV and add the page_files_source_dir_field: directory setting to your config file. The values in the directory column can then contain the names of the page directories. If you do that, your CSV would look like this: id,title,field_model,directory sfu_book_1,How to Use Islandora Workbench like a Pro,Paged Content,book1 sfu_book_2,Using Islandora Workbench for Fun and Profit,Paged Content,book2 Filename conventions The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash ( - ), although you can use another character by setting the paged_content_sequence_separator option in your configuration file. These sequence indicators are essentially physical page numbers, starting a \"1\" (not \"0\"). For example, using the filenames for \"book1\" above, the sequence of \"page-001.jpg\" is \"001\". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of \"isbn-1843341778-001.jpg\" for \"book2\" is also \"001\". Workbench takes this sequence number, strips all leading zeros, and uses it to populate the field_weight in the page nodes, so \"001\" becomes a weight value of 1, \"002\" becomes a weight value of 2, and so on. Important things to note when using this method: To use this method of creating paged content, you must include paged_content_page_model_tid in your configuration file and set it to your Islandora's term ID for the \"Page\" term in the Islandora Models vocabulary (or to http://id.loc.gov/ontologies/bibframe/part ). The Islandora model of the parent is not set automatically. You need to include a field_model value for each item in your CSV file, commonly \"Paged content\" or \"Publication issue\". You can apply CSV value templates to paged/child items using values from their respective parents. See the \" CSV value templates \" documentation for more information. You should also include a field_display_hints column in your CSV. This value is applied to the parent nodes and also the page nodes, unless the paged_content_page_display_hints setting is present in you configuration file. However, if you normally don't set the \"Display hints\" field in your objects but use a Context to determine how objects display, you should not include a field_display_hints column in your CSV file. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/paged item relationships will still work. The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the paged_content_page_content_type setting in your configuration file. If your page directories contain files other than page images, you need to include the paged_content_image_file_extension setting in your configuration. Otherwise, Workbench can't tell which files to create pages from. If you don't want to use your id column to name the directories that contain each item's pages, you can include page_files_source_dir_field: directory to your config file and add a directory column to your input CSV to name the directories. Applying field data to pages/children created from subdirectories Titles for pages/children created from subdirectories are generated automatically using the pattern parent_title + , page + sequence_number , where \"parent title\" is inherited from the page's parent node and \"sequence number\" is the page's sequence. For example, if a page's parent has the title \"How to Write a Book\" and its sequence number is 450, its automatically generated title will be \"How to Write a Book, page 450\". You can override this pattern by including the page_title_template setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is '$parent_title, page $weight' . There are only two variables you can include in the template, $parent_title and $weight , although you do not need to include either one if you don't want that information appearing in your page titles. The Islandora Model applied to all page/child nodes is the one defined in the paged_content_page_model_tid configuration setting. This model is automatically applied to all pages/children created from subdirectories. Fields on pages/children that are configured as required in the parent and page content type are automatically inherited from the parent. No special configuration is necessary. You can add additional (non-required field) metadata to pages/children using CSV value templates during the create task that creates the pages/children from subdirectories. Ingesting pages, their parents, and their \"grandparents\" using a single CSV file In the \"books\" example above, each row in the CSV (i.e., book1, book2) describes a node with the \"Paged Content\" Islandora model; each of the books is the direct parent of the individual page nodes. However, in some cases, you may want to create the pages, their direct parents (each book), and a parent of the parents (let's call it a \"grandparent\" of the pages) at the same time, using the same Workbench job and the same input CSV. Some common use cases for this ability are: creating a node describing a periodical, some nodes describing issues of the periodical, and the pages of each issue, and creating a node describing a book series, a set of nodes describing books in the series, and page nodes for each book. paged_content_from_directories: true in your config file tells Workbench to look in a directory containing page files for each row in your input CSV. If you want to include the pages, the immediate parent of the pages, and the grandparent of the pages in the same CSV, you can create an empty directory for the grandparent node, named after its id value like the other items in your CSV. In addition, and importantly, you also need to include a parent_id column in your CSV file to define the relationship between the grandparent and its direct children (in our example, the book nodes). The presence of the parent_id column does not have impact on the parent-child relationship between the books and their pages; that relationship is created automatically, like it is in the \"books\" example above. To illustrate this, let's extend the \"books\" example above to include a higher-level (grandparent to the pages) node that describes the series of books used in that example. Here is the CSV with the new top-level item, and with the addition of the parent_id column to indicate that the paged content items are children of the new \"book000\" node: id,parent_id,title,field_model book000,,How-to Books: A Best-Selling Genre of Books,Compound Object book1,book000,How to Use Islandora Workbench like a Pro,Paged Content book2,book000,Using Islandora Workbench for Fun and Profit,Paged Content The directory structure looks like this (note that the book000 directory should be empty since it doesn't have any pages as direct children): books/ \u251c\u2500\u2500 book000 \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Workbench will warn you that the book000 directory is empty, but that's OK - it will look for, but not find, any pages for that item. The node corresponding to that directory will be created as expected, and values in the parent_id column will ensure that the intended hierarchical relationship between \"book000\" and its child items (the book nodes) is created. Ingesting OCR (and other) files with page images You can tell Workbench to add OCR and other media related to page images when using the \"Using subdirectories\" method of creating paged content. To do this, add the OCR files to your subdirectories, using the base filenames of each page image plus an extension like .txt : books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 isbn-1843341778-001.txt \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Then, add the following settings to your configuration file: paged_content_from_directories: true (as described above) paged_content_page_model_tid (as described above) paged_content_image_file_extension : this is the file extension, without the leading . , of the page images, for example tif , jpg , etc. paged_content_additional_page_media : this is a list of mappings from Media Use term IDs or URIs to the file extensions of the OCR or other files you are ingesting. See the example below. An example configuration is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data/paged_content_example standalone_media_url: true paged_content_from_directories: true paged_content_page_model_tid: http://id.loc.gov/ontologies/bibframe/part paged_content_image_file_extension: jpg paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt You can add multiple additional files (for example, OCR and hOCR) if you provide a Media Use term-to-file-extension mapping for each type of file: paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt - https://discoverygarden.ca/use#hocr: hocr You can also use your Drupal's numeric Media Use term IDs in the mappings, like: paged_content_additional_page_media: - 354: txt - 429: hocr Note Using hOCR media for Islandora paged content nodes may not be configured on your Islandora repository; hOCR and the corresponding URI are used here as an example only. In this case, Workbench looks for files with the extensions txt and hocr and creates media for them with respective mapped Media Use terms. The paged content input directory would look like this: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-001.hocr \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-002.hocr \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u251c\u2500\u2500 page-003.hocr \u2502 \u2514\u2500\u2500 page-003.jpg Warning It is important to temporarily disable actions in Contexts that generate media/derivatives that would conflict with additional media you are adding using the method described here. For example, if you are adding OCR files, in the \"Page Derivatives\" Context listed at /admin/structure/context , disable the \"Extract text from PDF or image\" action prior to running Workbench, and be sure to re-enable it afterwards. If you do not do this, the OCR media added by Workbench will get overwritten with the one that Islandora generates using the \"Extract text from PDF or image\" action. Ignoring files in page directories Sometimes files such as \"Thumbs.db\" (on Windows) can creep into page directories. You can tell Workbench to ignore specific files within directories by including the paged_content_ignore_files configuration setting in your config file. Note that the default setting is to ignore \"Thumbs.db\" files. If you want to add additional files, or override that default setting, include the paged_content_ignore_files followed by a list of filenames, e.g.: paged_content_ignore_files: [\"Thumbs.db\", \"scanning_manifest.txt\"] Note that Workbench converts all filenames in the directories and filenames listed in the paged_content_ignore_files setting to lower case before checking to see if they are in this list. For example, if Workbench encounters a filename Scanning_Manifest.TXT , it will match \"scanning_manifest.txt\" in the configuration above configuration. With page/child-level metadata Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's file column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with a special parent_id CSV column. Values in the parent_id column, which only apply to rows describing pages/children, are the id value of their parent. For this to work, your CSV file must contain a parent_id field plus the standard Islandora fields field_weight , field_member_of , and field_model (the role of these last three fields will be explained below). The id field is required in all CSV files used to create content, so in this case, your CSV needs both an id field and a parent_id field. The following example illustrates how this works. Here is the raw CSV data: id,parent_id,field_weight,file,title,field_description,field_model,field_member_of 001,,,,Postcard 1,The first postcard,28,197 003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29, 004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29, 002,,,,Postcard 2,The second postcard,28,197 006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29, 007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29, The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet: The data contains rows for two postcards (rows with id values \"001\" and \"002\") plus a back and front for each (the remaining four rows). The parent_id value for items with id values \"003\" and \"004\" is the same as the id value for item \"001\", which will tell Workbench to make both of those items children of item \"001\"; the parent_id value for items with id values \"006\" and \"007\" is the same as the id value for item \"002\", which will tell Workbench to make both of those items children of the item \"002\". We can't populate field_member_of for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children. In this example, the rows for our postcard objects have empty parent_id , field_weight , and file columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in field_member_of , which is the node ID of the \"Postcards\" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their field_weight field, and they have values in their file column because we are creating objects that contain image media. Importantly, they have no value in their field_member_of field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's field_member_of dynamically, just after its parent node is created. Some important things to note: The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Currently, you need to include the option allow_missing_files: true in your configuration file when using this method to create paged/compound content. See this issue for more information. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/child relationships will still work. The values of the id and parent_id columns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields. The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. ( --check will tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child's field_member_of will be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items. Currently, you must include values in the children's field_weight column (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue ). Currently, Islandora model values (e.g. \"Paged Content\", \"Page\") are not automatically assigned. You must include the correct \"Islandora Models\" taxonomy term IDs in your field_model column for all parent and child records, as you would for any other Islandora objects you are creating. Like for field_weight , it may be possible to automatically generate values for this field (see this issue ). Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your \"create\" tasks accordingly. For example: If you want to use a single \"create\" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages. If you would rather use multiple \"create\" tasks, you can create all your collections first, then, in subsequent \"create\" tasks, use their respective node IDs in the field_member_of CSV column for their members. If you use a separate \"create\" task to create members of a single collection, you can define the value of field_member_of in a CSV field template . If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book node's node ID in their field_member_of column). Or, you could run a separate \"create\" task for each book, and use a CSV field template containing a field_member_of entry containing the book item's node ID. For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent \"create\" task for both newspaper issues and pages. In this task, the field_member_of column in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blank field_member_of and a parent_id using the parent issue's id value. Using a secondary task You can configure Islandora Workbench to execute two \"create\" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents. The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's parent_id values to their parent's id values, much the same way as in the \"With page/child-level metadata\" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and also their own configuration file). Each task is a standard Islandora Workbench \"create\" task, joined by one setting in the primary's configuration file, secondary_tasks , as described below. In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children: As you can see, values in the parent_id column in the secondary CSV reference values in the id column in the primary CSV: parent_id 001 in the secondary CSV matches id 001 in the primary, parent_id 003 in the secondary matches id 003 in the primary, and so on. You configure secondary tasks by adding the secondary_tasks setting to your primary configuration file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora # This is the setting that links the two configuration files together. secondary_tasks: ['children.yml'] input_csv: parents.csv nodes_only: true In the secondary_tasks setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named \"children.yml\") contains no indication that it's a secondary task: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: kids.csv csv_field_templates: - field_model: http://purl.org/coar/resource_type/c_c513 query_csv_id_to_node_id_map_for_parents: true Note The CSV ID to node ID map is required in secondary create tasks. Workbench will automatically change the query_csv_id_to_node_id_map_for_parents to true , regardless of whether that setting is in your secondary task's config file. Note The nodes_only setting in the above example primary configuration file and the csv_field_templates setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ. When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of id + node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the field_member_of values in the secondary task with the node IDs corresponding to the referenced primary id values. Some things to note about secondary tasks: Only \"create\" tasks can be used as the primary and secondary tasks. When you have a secondary task configured, running --check will validate both tasks' configuration and input data. The secondary CSV must contain parent_id and field_member_of columns. field_member_of must be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, include field_weight with the appropriate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order). If a row in the secondary task CSV does not have a parent_id that matches an id of a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so. As already stated, each task has its own configuration file, which means that you can specify a content_type value in your secondary configuration file that differs from the content_type of the primary task. You can include more than one secondary task in your configuration. For example, secondary_tasks: ['first.yml', 'second.yml'] will execute the primary task, then the \"first.yml\" secondary task, then the \"second.yml\" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes. Specifying paths to the python interpreter and to the workbench script When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the \"workbench\" script is located. The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute path to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are: path_to_python path_to_workbench_script An example of using these settings is: secondary_tasks: ['children.yml'] path_to_python: '/usr/bin/python' path_to_workbench_script: '/home/mark/islandora_workbench/workbench' The second situation is when using a secondary task when running Workbench in Windows and \"python.exe\" is not in the PATH of the user running the scheduled job. Specifying the absolute path to \"python.exe\" will ensure that Workbench can execute the secondary task properly, like this: secondary_tasks: ['children.yml'] path_to_python: 'c:/program files/python39/python.exe' path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench' Creating parent/child relationships across Workbench sessions It is possible to use parent_id values in your CSV that refer to id values from earlier Workbench sessions. In other words, you don't need to create parents and their member/child nodes within the same Workbench job; you can create parents in an earlier job and refer to their id values in later jobs. This is possible because during create tasks, Workbench records each newly created node ID and its corresponding value from the input CSV's id (or configured equivalent) column. It also records any values from the CSV parent_id column, if they exist. This data is stored in a simple SQLite database called the \" CSV ID to node ID map \". Because this database persists across Workbench sessions, you can use id values in your input CSV's parent_id column from previously loaded CSV files. The mapping between the previously loaded parents' id values and the values in your current CSV's parent_id column are stored in the CSV ID to node ID map database. Note It is important to use unique values in your CSV id (or configured equivalent) column, since if duplicate ID values exist in this database, Workbench can't know which corresponding node ID to use. In this case, Workbench will create the child node, but it won't assign a parent to it. --check will inform you if this happens with messages like Warning: Query of ID map for parent ID \"0002\" returned multiple node IDs: (771, 772, 773, 774, 778, 779). , and your Workbench log will also document that there are duplicate IDs. Warning By default, Workbench only checks the CSV ID to node ID map for parent IDs created in the same session as the children. If you want to assign children to parents created in previous Workbench sessions, you need to set the query_csv_id_to_node_id_map_for_parents configuration setting to true . Creating collections and members together Using a variation of the \"With page/child-level metadata\" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members' parent_id field to the collections' id field: id,parent_id,file,title,field_model,field_member_of,field_weight 1,,,A collection of animal photos,24,, 2,1,cat.jpg,Picture of a cat,25,, 3,1,dog.jpg,Picture of a dog,25,, 3,1,horse.jpg,Picture of a horse,25,, The use of the parent_id and field_member_of fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in field_weight empty, since Islandora collections don't use field_weight to determine order of members. Collection Views are sorted using other fields. Warning Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the \"Using a secondary task\" method to ingest collections and members in the same Workbench job. Summary The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes: Method Relationships created by field_weight Advantage Subdirectories Directory structure Do not include column in CSV; autopopulated. Useful for creating paged content where pages don't have their own metadata. Parent/child-level metadata in same CSV References from child's parent_id to parent's id in same CSV data Column required; values required in child rows Allows including parent and child metadata in same CSV. Secondary task References from parent_id in child CSV file to id in parent CSV file Column and values recommended in secondary (child) CSV data Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job. Collections and members together References from child (member) parent_id fields to parent (collection) id fields in same CSV data Column required in CSV but must be empty (collections do not use weight to determine sort order) Allows creation of collection and members in same Islandora Workbench job.","title":"Creating paged, compound, and collection content"},{"location":"paged_and_compound/#using-subdirectories","text":"Note Information in this section applies to all compound content, not just \"paged content\". That term is used here since the most common use of this method will be for creating paged content. In other words, where \"page\" is used below, it can be substituted with \"child\". Enable this method by including paged_content_from_directories: true in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata.","title":"Using subdirectories"},{"location":"paged_and_compound/#csv-and-directory-structure","text":"This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Only the parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.: id,title,field_model book1,How to Use Islandora Workbench like a Pro,Paged Content book2,Using Islandora Workbench for Fun and Profit,Paged Content Note Unlike every other Islandora Workbench \"create\" configuration, the metadata CSV should not contain a file column (however, you can include a directory column as described below). This means that content created using this method cannot be created using the same CSV file as other content. Each parent's pages are located in a subdirectory of the input directory that is named by default to match the value of the id field of the parent item they are pages of: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv If you don't want to use your id column to name the directory that stores pages, you can include a directory column in your input CSV and add the page_files_source_dir_field: directory setting to your config file. The values in the directory column can then contain the names of the page directories. If you do that, your CSV would look like this: id,title,field_model,directory sfu_book_1,How to Use Islandora Workbench like a Pro,Paged Content,book1 sfu_book_2,Using Islandora Workbench for Fun and Profit,Paged Content,book2","title":"CSV and directory structure"},{"location":"paged_and_compound/#filename-conventions","text":"The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash ( - ), although you can use another character by setting the paged_content_sequence_separator option in your configuration file. These sequence indicators are essentially physical page numbers, starting a \"1\" (not \"0\"). For example, using the filenames for \"book1\" above, the sequence of \"page-001.jpg\" is \"001\". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of \"isbn-1843341778-001.jpg\" for \"book2\" is also \"001\". Workbench takes this sequence number, strips all leading zeros, and uses it to populate the field_weight in the page nodes, so \"001\" becomes a weight value of 1, \"002\" becomes a weight value of 2, and so on. Important things to note when using this method: To use this method of creating paged content, you must include paged_content_page_model_tid in your configuration file and set it to your Islandora's term ID for the \"Page\" term in the Islandora Models vocabulary (or to http://id.loc.gov/ontologies/bibframe/part ). The Islandora model of the parent is not set automatically. You need to include a field_model value for each item in your CSV file, commonly \"Paged content\" or \"Publication issue\". You can apply CSV value templates to paged/child items using values from their respective parents. See the \" CSV value templates \" documentation for more information. You should also include a field_display_hints column in your CSV. This value is applied to the parent nodes and also the page nodes, unless the paged_content_page_display_hints setting is present in you configuration file. However, if you normally don't set the \"Display hints\" field in your objects but use a Context to determine how objects display, you should not include a field_display_hints column in your CSV file. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/paged item relationships will still work. The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the paged_content_page_content_type setting in your configuration file. If your page directories contain files other than page images, you need to include the paged_content_image_file_extension setting in your configuration. Otherwise, Workbench can't tell which files to create pages from. If you don't want to use your id column to name the directories that contain each item's pages, you can include page_files_source_dir_field: directory to your config file and add a directory column to your input CSV to name the directories.","title":"Filename conventions"},{"location":"paged_and_compound/#applying-field-data-to-pageschildren-created-from-subdirectories","text":"Titles for pages/children created from subdirectories are generated automatically using the pattern parent_title + , page + sequence_number , where \"parent title\" is inherited from the page's parent node and \"sequence number\" is the page's sequence. For example, if a page's parent has the title \"How to Write a Book\" and its sequence number is 450, its automatically generated title will be \"How to Write a Book, page 450\". You can override this pattern by including the page_title_template setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is '$parent_title, page $weight' . There are only two variables you can include in the template, $parent_title and $weight , although you do not need to include either one if you don't want that information appearing in your page titles. The Islandora Model applied to all page/child nodes is the one defined in the paged_content_page_model_tid configuration setting. This model is automatically applied to all pages/children created from subdirectories. Fields on pages/children that are configured as required in the parent and page content type are automatically inherited from the parent. No special configuration is necessary. You can add additional (non-required field) metadata to pages/children using CSV value templates during the create task that creates the pages/children from subdirectories.","title":"Applying field data to pages/children created from subdirectories"},{"location":"paged_and_compound/#ingesting-pages-their-parents-and-their-grandparents-using-a-single-csv-file","text":"In the \"books\" example above, each row in the CSV (i.e., book1, book2) describes a node with the \"Paged Content\" Islandora model; each of the books is the direct parent of the individual page nodes. However, in some cases, you may want to create the pages, their direct parents (each book), and a parent of the parents (let's call it a \"grandparent\" of the pages) at the same time, using the same Workbench job and the same input CSV. Some common use cases for this ability are: creating a node describing a periodical, some nodes describing issues of the periodical, and the pages of each issue, and creating a node describing a book series, a set of nodes describing books in the series, and page nodes for each book. paged_content_from_directories: true in your config file tells Workbench to look in a directory containing page files for each row in your input CSV. If you want to include the pages, the immediate parent of the pages, and the grandparent of the pages in the same CSV, you can create an empty directory for the grandparent node, named after its id value like the other items in your CSV. In addition, and importantly, you also need to include a parent_id column in your CSV file to define the relationship between the grandparent and its direct children (in our example, the book nodes). The presence of the parent_id column does not have impact on the parent-child relationship between the books and their pages; that relationship is created automatically, like it is in the \"books\" example above. To illustrate this, let's extend the \"books\" example above to include a higher-level (grandparent to the pages) node that describes the series of books used in that example. Here is the CSV with the new top-level item, and with the addition of the parent_id column to indicate that the paged content items are children of the new \"book000\" node: id,parent_id,title,field_model book000,,How-to Books: A Best-Selling Genre of Books,Compound Object book1,book000,How to Use Islandora Workbench like a Pro,Paged Content book2,book000,Using Islandora Workbench for Fun and Profit,Paged Content The directory structure looks like this (note that the book000 directory should be empty since it doesn't have any pages as direct children): books/ \u251c\u2500\u2500 book000 \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Workbench will warn you that the book000 directory is empty, but that's OK - it will look for, but not find, any pages for that item. The node corresponding to that directory will be created as expected, and values in the parent_id column will ensure that the intended hierarchical relationship between \"book000\" and its child items (the book nodes) is created.","title":"Ingesting pages, their parents, and their \"grandparents\" using a single CSV file"},{"location":"paged_and_compound/#ingesting-ocr-and-other-files-with-page-images","text":"You can tell Workbench to add OCR and other media related to page images when using the \"Using subdirectories\" method of creating paged content. To do this, add the OCR files to your subdirectories, using the base filenames of each page image plus an extension like .txt : books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 isbn-1843341778-001.txt \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Then, add the following settings to your configuration file: paged_content_from_directories: true (as described above) paged_content_page_model_tid (as described above) paged_content_image_file_extension : this is the file extension, without the leading . , of the page images, for example tif , jpg , etc. paged_content_additional_page_media : this is a list of mappings from Media Use term IDs or URIs to the file extensions of the OCR or other files you are ingesting. See the example below. An example configuration is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data/paged_content_example standalone_media_url: true paged_content_from_directories: true paged_content_page_model_tid: http://id.loc.gov/ontologies/bibframe/part paged_content_image_file_extension: jpg paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt You can add multiple additional files (for example, OCR and hOCR) if you provide a Media Use term-to-file-extension mapping for each type of file: paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt - https://discoverygarden.ca/use#hocr: hocr You can also use your Drupal's numeric Media Use term IDs in the mappings, like: paged_content_additional_page_media: - 354: txt - 429: hocr Note Using hOCR media for Islandora paged content nodes may not be configured on your Islandora repository; hOCR and the corresponding URI are used here as an example only. In this case, Workbench looks for files with the extensions txt and hocr and creates media for them with respective mapped Media Use terms. The paged content input directory would look like this: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-001.hocr \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-002.hocr \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u251c\u2500\u2500 page-003.hocr \u2502 \u2514\u2500\u2500 page-003.jpg Warning It is important to temporarily disable actions in Contexts that generate media/derivatives that would conflict with additional media you are adding using the method described here. For example, if you are adding OCR files, in the \"Page Derivatives\" Context listed at /admin/structure/context , disable the \"Extract text from PDF or image\" action prior to running Workbench, and be sure to re-enable it afterwards. If you do not do this, the OCR media added by Workbench will get overwritten with the one that Islandora generates using the \"Extract text from PDF or image\" action.","title":"Ingesting OCR (and other) files with page images"},{"location":"paged_and_compound/#ignoring-files-in-page-directories","text":"Sometimes files such as \"Thumbs.db\" (on Windows) can creep into page directories. You can tell Workbench to ignore specific files within directories by including the paged_content_ignore_files configuration setting in your config file. Note that the default setting is to ignore \"Thumbs.db\" files. If you want to add additional files, or override that default setting, include the paged_content_ignore_files followed by a list of filenames, e.g.: paged_content_ignore_files: [\"Thumbs.db\", \"scanning_manifest.txt\"] Note that Workbench converts all filenames in the directories and filenames listed in the paged_content_ignore_files setting to lower case before checking to see if they are in this list. For example, if Workbench encounters a filename Scanning_Manifest.TXT , it will match \"scanning_manifest.txt\" in the configuration above configuration.","title":"Ignoring files in page directories"},{"location":"paged_and_compound/#with-pagechild-level-metadata","text":"Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's file column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with a special parent_id CSV column. Values in the parent_id column, which only apply to rows describing pages/children, are the id value of their parent. For this to work, your CSV file must contain a parent_id field plus the standard Islandora fields field_weight , field_member_of , and field_model (the role of these last three fields will be explained below). The id field is required in all CSV files used to create content, so in this case, your CSV needs both an id field and a parent_id field. The following example illustrates how this works. Here is the raw CSV data: id,parent_id,field_weight,file,title,field_description,field_model,field_member_of 001,,,,Postcard 1,The first postcard,28,197 003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29, 004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29, 002,,,,Postcard 2,The second postcard,28,197 006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29, 007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29, The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet: The data contains rows for two postcards (rows with id values \"001\" and \"002\") plus a back and front for each (the remaining four rows). The parent_id value for items with id values \"003\" and \"004\" is the same as the id value for item \"001\", which will tell Workbench to make both of those items children of item \"001\"; the parent_id value for items with id values \"006\" and \"007\" is the same as the id value for item \"002\", which will tell Workbench to make both of those items children of the item \"002\". We can't populate field_member_of for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children. In this example, the rows for our postcard objects have empty parent_id , field_weight , and file columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in field_member_of , which is the node ID of the \"Postcards\" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their field_weight field, and they have values in their file column because we are creating objects that contain image media. Importantly, they have no value in their field_member_of field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's field_member_of dynamically, just after its parent node is created. Some important things to note: The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Currently, you need to include the option allow_missing_files: true in your configuration file when using this method to create paged/compound content. See this issue for more information. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/child relationships will still work. The values of the id and parent_id columns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields. The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. ( --check will tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child's field_member_of will be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items. Currently, you must include values in the children's field_weight column (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue ). Currently, Islandora model values (e.g. \"Paged Content\", \"Page\") are not automatically assigned. You must include the correct \"Islandora Models\" taxonomy term IDs in your field_model column for all parent and child records, as you would for any other Islandora objects you are creating. Like for field_weight , it may be possible to automatically generate values for this field (see this issue ). Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your \"create\" tasks accordingly. For example: If you want to use a single \"create\" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages. If you would rather use multiple \"create\" tasks, you can create all your collections first, then, in subsequent \"create\" tasks, use their respective node IDs in the field_member_of CSV column for their members. If you use a separate \"create\" task to create members of a single collection, you can define the value of field_member_of in a CSV field template . If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book node's node ID in their field_member_of column). Or, you could run a separate \"create\" task for each book, and use a CSV field template containing a field_member_of entry containing the book item's node ID. For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent \"create\" task for both newspaper issues and pages. In this task, the field_member_of column in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blank field_member_of and a parent_id using the parent issue's id value.","title":"With page/child-level metadata"},{"location":"paged_and_compound/#using-a-secondary-task","text":"You can configure Islandora Workbench to execute two \"create\" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents. The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's parent_id values to their parent's id values, much the same way as in the \"With page/child-level metadata\" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and also their own configuration file). Each task is a standard Islandora Workbench \"create\" task, joined by one setting in the primary's configuration file, secondary_tasks , as described below. In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children: As you can see, values in the parent_id column in the secondary CSV reference values in the id column in the primary CSV: parent_id 001 in the secondary CSV matches id 001 in the primary, parent_id 003 in the secondary matches id 003 in the primary, and so on. You configure secondary tasks by adding the secondary_tasks setting to your primary configuration file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora # This is the setting that links the two configuration files together. secondary_tasks: ['children.yml'] input_csv: parents.csv nodes_only: true In the secondary_tasks setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named \"children.yml\") contains no indication that it's a secondary task: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: kids.csv csv_field_templates: - field_model: http://purl.org/coar/resource_type/c_c513 query_csv_id_to_node_id_map_for_parents: true Note The CSV ID to node ID map is required in secondary create tasks. Workbench will automatically change the query_csv_id_to_node_id_map_for_parents to true , regardless of whether that setting is in your secondary task's config file. Note The nodes_only setting in the above example primary configuration file and the csv_field_templates setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ. When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of id + node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the field_member_of values in the secondary task with the node IDs corresponding to the referenced primary id values. Some things to note about secondary tasks: Only \"create\" tasks can be used as the primary and secondary tasks. When you have a secondary task configured, running --check will validate both tasks' configuration and input data. The secondary CSV must contain parent_id and field_member_of columns. field_member_of must be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, include field_weight with the appropriate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order). If a row in the secondary task CSV does not have a parent_id that matches an id of a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so. As already stated, each task has its own configuration file, which means that you can specify a content_type value in your secondary configuration file that differs from the content_type of the primary task. You can include more than one secondary task in your configuration. For example, secondary_tasks: ['first.yml', 'second.yml'] will execute the primary task, then the \"first.yml\" secondary task, then the \"second.yml\" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes.","title":"Using a secondary task"},{"location":"paged_and_compound/#specifying-paths-to-the-python-interpreter-and-to-the-workbench-script","text":"When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the \"workbench\" script is located. The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute path to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are: path_to_python path_to_workbench_script An example of using these settings is: secondary_tasks: ['children.yml'] path_to_python: '/usr/bin/python' path_to_workbench_script: '/home/mark/islandora_workbench/workbench' The second situation is when using a secondary task when running Workbench in Windows and \"python.exe\" is not in the PATH of the user running the scheduled job. Specifying the absolute path to \"python.exe\" will ensure that Workbench can execute the secondary task properly, like this: secondary_tasks: ['children.yml'] path_to_python: 'c:/program files/python39/python.exe' path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench'","title":"Specifying paths to the python interpreter and to the workbench script"},{"location":"paged_and_compound/#creating-parentchild-relationships-across-workbench-sessions","text":"It is possible to use parent_id values in your CSV that refer to id values from earlier Workbench sessions. In other words, you don't need to create parents and their member/child nodes within the same Workbench job; you can create parents in an earlier job and refer to their id values in later jobs. This is possible because during create tasks, Workbench records each newly created node ID and its corresponding value from the input CSV's id (or configured equivalent) column. It also records any values from the CSV parent_id column, if they exist. This data is stored in a simple SQLite database called the \" CSV ID to node ID map \". Because this database persists across Workbench sessions, you can use id values in your input CSV's parent_id column from previously loaded CSV files. The mapping between the previously loaded parents' id values and the values in your current CSV's parent_id column are stored in the CSV ID to node ID map database. Note It is important to use unique values in your CSV id (or configured equivalent) column, since if duplicate ID values exist in this database, Workbench can't know which corresponding node ID to use. In this case, Workbench will create the child node, but it won't assign a parent to it. --check will inform you if this happens with messages like Warning: Query of ID map for parent ID \"0002\" returned multiple node IDs: (771, 772, 773, 774, 778, 779). , and your Workbench log will also document that there are duplicate IDs. Warning By default, Workbench only checks the CSV ID to node ID map for parent IDs created in the same session as the children. If you want to assign children to parents created in previous Workbench sessions, you need to set the query_csv_id_to_node_id_map_for_parents configuration setting to true .","title":"Creating parent/child relationships across Workbench sessions"},{"location":"paged_and_compound/#creating-collections-and-members-together","text":"Using a variation of the \"With page/child-level metadata\" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members' parent_id field to the collections' id field: id,parent_id,file,title,field_model,field_member_of,field_weight 1,,,A collection of animal photos,24,, 2,1,cat.jpg,Picture of a cat,25,, 3,1,dog.jpg,Picture of a dog,25,, 3,1,horse.jpg,Picture of a horse,25,, The use of the parent_id and field_member_of fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in field_weight empty, since Islandora collections don't use field_weight to determine order of members. Collection Views are sorted using other fields. Warning Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the \"Using a secondary task\" method to ingest collections and members in the same Workbench job.","title":"Creating collections and members together"},{"location":"paged_and_compound/#summary","text":"The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes: Method Relationships created by field_weight Advantage Subdirectories Directory structure Do not include column in CSV; autopopulated. Useful for creating paged content where pages don't have their own metadata. Parent/child-level metadata in same CSV References from child's parent_id to parent's id in same CSV data Column required; values required in child rows Allows including parent and child metadata in same CSV. Secondary task References from parent_id in child CSV file to id in parent CSV file Column and values recommended in secondary (child) CSV data Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job. Collections and members together References from child (member) parent_id fields to parent (collection) id fields in same CSV data Column required in CSV but must be empty (collections do not use weight to determine sort order) Allows creation of collection and members in same Islandora Workbench job.","title":"Summary"},{"location":"preparing_data/","text":"Islandora Workbench allows you to arrange your input data in a variety of ways. The two basic sets of data you need to prepare (depending on what task you are performing) are: a CSV file, containing data that will populate node fields (or do other things depending on what task you are performing), described here files that will be used as Drupal media. The options for arranging your data are detailed below. Note Some of Workbench's functionality depends on a specific directory structure not described here, for example \" Creating nodes from files \" and \" Creating paged, compound, and collection content .\" However, the information on this page applies to the vast majority of Workbench usage. Using an input directory In this configuration, you define an input directory (identified by the input_dir config option) that contains a CSV file with field content (identified by the input_csv config option) and any accompanying media files you want to add to the newly created nodes: input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv Here is the same input directory, with some explanation of how the files relate to each other: input_data/ <-- This is the directory named in the \"input_dir\" configuration setting. \u251c\u2500\u2500 image1.JPG <-- This and the other JPEG files are named in the \"file\" column in the CSV file. \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv <-- This is the CSV file named in the \"input_csv\" configuration setting. The names of the image/PDF/video/etc. files are included in the file column of the CSV file. Files with any extension that you can upload to Drupal are allowed. Islandora Workbench reads the CSV file and iterates through it, performing the current task for each record. In this configuration, files other than the CSV and your media files are allowed in this directory (although for some configurations, your input directory should not contain any files that are not going to be ingested). This is Islandora Workbench's default configuration. If you do not specify an input_dir or an input_csv , as illustrated in following minimal configuration file, Workbench will assume your files are in a directory named \"input_data\" in the same directory as the Workbench script, and that within that directory, your CSV file is named \"metadata.csv\": task: create host: \"http://localhost:8000\" username: admin password: islandora Workbench ignores the other files in the input directory, and only looks for files in that directory if the filename alone (no directory component) is in file column. workbench <-- The \"workbench\" script. \u251c\u2500\u2500 input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv For example, in this configuration, in the following \"metadata.csv\" file, Workbench looks for \"image1.JPG\", \"image-27626.jpg\", and \"someimage.jpg\" at \"input_data/image1.JPG\", \"input_data/image1.JPG\", and \"input_data/someimage.jpg\" respectively, relative to the location of the \"workbench\" script: id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog Workbench complete ignores \"pic_saturday.jpg\" and \"IMG_2958.JPG\" because they are not named in any of the file columns in the \"metadata.csv\" file. If the configuration file specified an input_dir value, or identified a CSV file in input_csv , Workbench would use those values: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: myfiles input_csv: mymetadata.csv workbench <-- The \"workbench\" script. \u251c\u2500\u2500 myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv The value of input_dir doesn't need to be relative to the workbench script, it can be absolute: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: /tmp/myfiles \u251c\u2500\u2500 /tmp/myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog In this case, even though only the CSV file entries contain only filenames and no path information, Workbench looks for the image files at \"/tmp/myfiles/image1.JPG\", \"/tmp/myfiles/image1.JPG\", and \"/tmp/myfiles/someimage.jpg\". Using absolute file paths We saw in the previous section that the path specified in your configuration file's input_dir configuration option need not be relative to the location of the workbench script, it can be absolute. That is also true for both the configuration value of input_csv and for the values in your input CSV's file column. You can also mix absolute and relative filenames in the same CSV file, but all relative filenames are considered to be in the directory named in input_dir . An example configuration file for this is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: media_files input_csv: /tmp/input.csv And within the file column of the CSV, values like: id,file,title 001,/tmp/mydata/file01.png,A very good file 0002,/home/me/Documents/files/cat.jpg,My cat 003,dog.png,My dog Notice that the file values in the first two rows are absolute, but the file value in the last row is relative. Workbench will look for that file at \"media_files/dog.png\". Note In general, Workbench doesn't care if any file path used in configuration or CSV data is relative or absolute, but if it's relative, it's relative to the directory where the workbench script lives. Note Most of the example paths used in this documentation are Linux paths. In general, paths on Mac computers look and work the same way. On Windows, relative paths and absolute paths like C:\\Users\\Mark\\Downloads\\myfile.pdf and UNC paths like \\\\some.windows.file.share.org\\share_name\\files\\myfile.png work fine. These paths also work in Workbench configuration files in settings such as input_dir . Using URLs as file paths In the file column, you can also use URLs to files, like this: id,file,title 001,http://www.mysite.com/file01.png,A very good file 0002,https://mycatssite.org/images/cat.jpg,My cat 003,dog.png,My dog More information is available on using URLs in your file column. Using a local or remote .zip archive as input data If you register the location of a local .zip archive or a remote (available over http(s)) zip archive in your configuration file, Workbench will unzip the contents of the archive into the directory defined in your input_dir setting: input_data_zip_archives: - /tmp/mytest.zip - https://myremote.host.org/zips/another_zip.zip The archive is unzipped with its internal directory structure intact; for example, if your zip has the following structure: bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 In this case, your CSV file column values should include the intermediate directory's path, e.g. bpnichol/003 Partial Side A.mp3 . You can also include an input CSV in your zip archive if you want: bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Within input_data , the unzipped content would look like: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Alternatively, all of your files can also be at the root of the zip archive. In that case, they would be unzipped into the directory named in your input_dir setting. A zip archive with this structure: \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 would be unzipped into: input_data/ \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 If you are zipping up directories to create paged content , all of the directories containing page files should be at the root of your zip archive, with no intermediate parent directory: \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif This is because if you include paged_content_from_directories: true in your configuration file, Workbench looks within your input_dir for a directory named after the paged content item's id value, without an intermediate directory. A few things to note if you are using a zip archive as your input data: Remote URLs to zip archives do not need to end in \".zip\", but the remote files must be directly accessible for downloading without any authentication. You can register a single or multiple zip file in your input_data_zip_archives setting. Workbench doesn't check for the existence of files at extracted destination paths, so if a file with the same extracted path exists in more than one archive (or is already at a path the same as that of a file from an archive), the file from the last archive in the input_data_zip_archives list will overwrite existing files at the same path. You can include in your zip archive(s) any files that you want to put in the directory indicated in your input_dir config setting, including files named in your CSV file column, files named in columns defined by your additional_files configuration, or the CSV or Excel file named in your input_csv setting (as illustrated in the Rungh example above). Workbench will automatically delete the archive file after extracting it unless you add delete_zip_archive_after_extraction: false to your config file. Using a Google Sheet as the input CSV file With this option, your configuration's input_csv option contains the URL to a publicly readable Google Sheet. To do this, simply provide the URL to the Google spreadsheet in your configuration file's input_csv option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit google_sheets_gid: 430969348 That's all you need to do. It's most reliable to have your Google Sheets URLs end in \"/edit\" and explicitly add the \"gid\" value using the google_sheets_gid config setting. Every time Workbench runs, it fetches the CSV content of the spreadsheet and saves it to a local file in the directory named in your input_directory configuration option, and from that point onward in its execution, uses the locally saved version of the spreadsheet. The default filename for this CSV file is google_sheet.csv but you can change it if you need to by including the google_sheets_csv_filename option in your configuration file, e.g., google_sheets_csv_filename: my_filename.csv . Islandora Workbench fetches a new copy of the CSV data every time it runs (even with the --check option), so if you make changes to the contents of that local file, the changes will be overwritten with the data from the Google spreadsheet the next time you run Workbench. If you don't want to overwrite your local copy of the data, rename the local CSV file manually before running Workbench, and update the input_csv option in your configuration file to use the name of the CSV file you copied. Note Using a Google Sheet is currently the fastest and most convenient way of managing CSV data for use with Islandora Workbench. Since Sheets saves changes in realtime, and since Workbench fetches a fresh copy of the CSV data every time you run it, it's easy to iterate by making changes to your data in Sheets, running Workbench (don't forget to use --check first to identify any problems!), seeing the effects of your changes in the nodes you've just created, rolling back your nodes , tweaking your data in Sheets, and starting a new cycle. If you are focused on refining your CSV metadata, you can save time by skipping the creation of media by including nodes_only: true in your configuration file. Some things to note about using Google Sheets: You can use a Google Sheet in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Google spreadsheet. The URL in the configuration file needs single or double quotes around it, like any other value that contains a colon. You can use either the URL you copy from your browser when you are viewing the spreadsheet (which ends in \"/edit#gid=0\" or something similar), or the \"sharing\" URL you copy into your clipboard from within the \"Share\" dialog box (which ends in \"edit?usp=sharing\"). Either is OK, but, as noted above, it's best to have your URL end in \"/edit\" and explicitly state the gid using the google_sheets_gid config setting. The Google spreadsheet must be publicly readable, e.g. with \"Anyone on the Internet with this link can view\" permission. Spreadsheets work best for descriptive metadata if all cells are formatted as \"Plain text\". To do this in Google Sheets, select all cells, then choose the menu items Format > Number > Plain text before adding any content to the cells . If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above. Selecting a specific worksheet within a Google Sheet Worksheets within a given Google Sheet are identified by a \"gid\". If a Sheet has only a single worksheet, its \"gid\" is \"0\" (zero): https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=0 If you add additional worksheets, they get a randomly generated \"gid\", such as \"1094504353\". You can see this \"gid\" in the URL when you are in the worksheet: https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=1094504353 By default, Workbench extracts CSV data from the worksheet with a \"gid\" of \"0\". If you want Workbench to extract the CSV data from a specific worksheet that is not the one with a \"gid\" of \"0\", specify the \"gid\" in your configuration file using the google_sheets_gid option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit?usp=sharing' google_sheets_gid: 1094504353 Using an Excel file as the input CSV file With this option, your configuration's input_csv option contains the filename of an Excel 2010 (or higher) file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx Islandora Workbench extracts the content of this file as CSV data, and uses that extracted data as its input the same way it would use a raw CSV file. Note that: You can use an Excel file in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Excel spreadsheet. Spreadsheets work best for descriptive metadata if all cells are formatted as \"text\". To do this, in Excel, select all cells, alt-click on the selected area, then choose the \"Format Cells\" context menu item. In the \"Number\" tab, choose \"Text\", then click on the \"OK\" button. If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above. Selecting a specific worksheet within an Excel file The worksheet that the CSV data is taken from is the one named \"Sheet1\", unless you specify another worksheet using the excel_worksheet configuration option. As with Google Sheets, you can tell Workbench to use a specific worksheet in an Excel file. Here is an example of a config file using that setting: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx excel_worksheet: Second sheet How Workbench cleans your input data Regardless of whether your input data is raw CSV, a Google Sheet, or Excel, Workbench applies a small number of cleansing operations on it. These are: replaces smart/curly quotes (both double and single) with regular quotes replaces multiple whitespaces within strings with a single space removes leading and trailing spaces (including newlines). removes leading and trailing subdelimter characters (i.e., the value of the subdelimiter config setting, default of | ). If you do not want Workbench to do one or more of these cleanups, include the clean_csv_values_skip setting in your configuration, specifying in a list one or more of the following: smart_quotes inside_spaces outside_spaces outside_subdelimiters An example of using this configuration setting is: clean_csv_values_skip: [\"smart_quotes\", \"inside_spaces\"] When Workbench skips invalid CSV data Running --check will tell you when any of the data in your CSV file is invalid, or in other words, does not conform to its target Drupal field's configuration and is likely to cause the creation/updating of content to fail. Currently, for the following types of fields: text geolocation link Workbench will validate CSV values and skip values that fail its validation tests. Work is underway to complete this feature, including skipping of invalid entity reference and typed relation fields. Blank or missing \"file\" values By default, if the file value for a row is empty, Workbench's --check option will show an error. But, in some cases you may want to create nodes but not add any media. If you add allow_missing_files: true to your config file for \"create\" tasks, you can leave the file column in your CSV empty. Creating nodes but not media If you want to only create nodes and not media, you can do so by including nodes_only: true in your configuration file. More detail is available . Encoding of text files All text files used as input to Islandora Workbench, including CSV data, files that are going to be use to create \"Extracted Text\" media such as OCR/hOCR files, and media track files, must use a standard UTF-8 encoding. This is generally not a problem other than if you have created any of these files on Microsoft Windows. Even on Windows, UTF-8 files can contain a Microsoft-specifc feature called a \"Byte-order mark\", or \"BOM\". This type of UTF-8 file is not valid; only standard UTF-8 files are.","title":"Preparing your data"},{"location":"preparing_data/#using-an-input-directory","text":"In this configuration, you define an input directory (identified by the input_dir config option) that contains a CSV file with field content (identified by the input_csv config option) and any accompanying media files you want to add to the newly created nodes: input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv Here is the same input directory, with some explanation of how the files relate to each other: input_data/ <-- This is the directory named in the \"input_dir\" configuration setting. \u251c\u2500\u2500 image1.JPG <-- This and the other JPEG files are named in the \"file\" column in the CSV file. \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv <-- This is the CSV file named in the \"input_csv\" configuration setting. The names of the image/PDF/video/etc. files are included in the file column of the CSV file. Files with any extension that you can upload to Drupal are allowed. Islandora Workbench reads the CSV file and iterates through it, performing the current task for each record. In this configuration, files other than the CSV and your media files are allowed in this directory (although for some configurations, your input directory should not contain any files that are not going to be ingested). This is Islandora Workbench's default configuration. If you do not specify an input_dir or an input_csv , as illustrated in following minimal configuration file, Workbench will assume your files are in a directory named \"input_data\" in the same directory as the Workbench script, and that within that directory, your CSV file is named \"metadata.csv\": task: create host: \"http://localhost:8000\" username: admin password: islandora Workbench ignores the other files in the input directory, and only looks for files in that directory if the filename alone (no directory component) is in file column. workbench <-- The \"workbench\" script. \u251c\u2500\u2500 input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv For example, in this configuration, in the following \"metadata.csv\" file, Workbench looks for \"image1.JPG\", \"image-27626.jpg\", and \"someimage.jpg\" at \"input_data/image1.JPG\", \"input_data/image1.JPG\", and \"input_data/someimage.jpg\" respectively, relative to the location of the \"workbench\" script: id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog Workbench complete ignores \"pic_saturday.jpg\" and \"IMG_2958.JPG\" because they are not named in any of the file columns in the \"metadata.csv\" file. If the configuration file specified an input_dir value, or identified a CSV file in input_csv , Workbench would use those values: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: myfiles input_csv: mymetadata.csv workbench <-- The \"workbench\" script. \u251c\u2500\u2500 myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv The value of input_dir doesn't need to be relative to the workbench script, it can be absolute: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: /tmp/myfiles \u251c\u2500\u2500 /tmp/myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog In this case, even though only the CSV file entries contain only filenames and no path information, Workbench looks for the image files at \"/tmp/myfiles/image1.JPG\", \"/tmp/myfiles/image1.JPG\", and \"/tmp/myfiles/someimage.jpg\".","title":"Using an input directory"},{"location":"preparing_data/#using-absolute-file-paths","text":"We saw in the previous section that the path specified in your configuration file's input_dir configuration option need not be relative to the location of the workbench script, it can be absolute. That is also true for both the configuration value of input_csv and for the values in your input CSV's file column. You can also mix absolute and relative filenames in the same CSV file, but all relative filenames are considered to be in the directory named in input_dir . An example configuration file for this is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: media_files input_csv: /tmp/input.csv And within the file column of the CSV, values like: id,file,title 001,/tmp/mydata/file01.png,A very good file 0002,/home/me/Documents/files/cat.jpg,My cat 003,dog.png,My dog Notice that the file values in the first two rows are absolute, but the file value in the last row is relative. Workbench will look for that file at \"media_files/dog.png\". Note In general, Workbench doesn't care if any file path used in configuration or CSV data is relative or absolute, but if it's relative, it's relative to the directory where the workbench script lives. Note Most of the example paths used in this documentation are Linux paths. In general, paths on Mac computers look and work the same way. On Windows, relative paths and absolute paths like C:\\Users\\Mark\\Downloads\\myfile.pdf and UNC paths like \\\\some.windows.file.share.org\\share_name\\files\\myfile.png work fine. These paths also work in Workbench configuration files in settings such as input_dir .","title":"Using absolute file paths"},{"location":"preparing_data/#using-urls-as-file-paths","text":"In the file column, you can also use URLs to files, like this: id,file,title 001,http://www.mysite.com/file01.png,A very good file 0002,https://mycatssite.org/images/cat.jpg,My cat 003,dog.png,My dog More information is available on using URLs in your file column.","title":"Using URLs as file paths"},{"location":"preparing_data/#using-a-local-or-remote-zip-archive-as-input-data","text":"If you register the location of a local .zip archive or a remote (available over http(s)) zip archive in your configuration file, Workbench will unzip the contents of the archive into the directory defined in your input_dir setting: input_data_zip_archives: - /tmp/mytest.zip - https://myremote.host.org/zips/another_zip.zip The archive is unzipped with its internal directory structure intact; for example, if your zip has the following structure: bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 In this case, your CSV file column values should include the intermediate directory's path, e.g. bpnichol/003 Partial Side A.mp3 . You can also include an input CSV in your zip archive if you want: bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Within input_data , the unzipped content would look like: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Alternatively, all of your files can also be at the root of the zip archive. In that case, they would be unzipped into the directory named in your input_dir setting. A zip archive with this structure: \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 would be unzipped into: input_data/ \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 If you are zipping up directories to create paged content , all of the directories containing page files should be at the root of your zip archive, with no intermediate parent directory: \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif This is because if you include paged_content_from_directories: true in your configuration file, Workbench looks within your input_dir for a directory named after the paged content item's id value, without an intermediate directory. A few things to note if you are using a zip archive as your input data: Remote URLs to zip archives do not need to end in \".zip\", but the remote files must be directly accessible for downloading without any authentication. You can register a single or multiple zip file in your input_data_zip_archives setting. Workbench doesn't check for the existence of files at extracted destination paths, so if a file with the same extracted path exists in more than one archive (or is already at a path the same as that of a file from an archive), the file from the last archive in the input_data_zip_archives list will overwrite existing files at the same path. You can include in your zip archive(s) any files that you want to put in the directory indicated in your input_dir config setting, including files named in your CSV file column, files named in columns defined by your additional_files configuration, or the CSV or Excel file named in your input_csv setting (as illustrated in the Rungh example above). Workbench will automatically delete the archive file after extracting it unless you add delete_zip_archive_after_extraction: false to your config file.","title":"Using a local or remote .zip archive as input data"},{"location":"preparing_data/#using-a-google-sheet-as-the-input-csv-file","text":"With this option, your configuration's input_csv option contains the URL to a publicly readable Google Sheet. To do this, simply provide the URL to the Google spreadsheet in your configuration file's input_csv option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit google_sheets_gid: 430969348 That's all you need to do. It's most reliable to have your Google Sheets URLs end in \"/edit\" and explicitly add the \"gid\" value using the google_sheets_gid config setting. Every time Workbench runs, it fetches the CSV content of the spreadsheet and saves it to a local file in the directory named in your input_directory configuration option, and from that point onward in its execution, uses the locally saved version of the spreadsheet. The default filename for this CSV file is google_sheet.csv but you can change it if you need to by including the google_sheets_csv_filename option in your configuration file, e.g., google_sheets_csv_filename: my_filename.csv . Islandora Workbench fetches a new copy of the CSV data every time it runs (even with the --check option), so if you make changes to the contents of that local file, the changes will be overwritten with the data from the Google spreadsheet the next time you run Workbench. If you don't want to overwrite your local copy of the data, rename the local CSV file manually before running Workbench, and update the input_csv option in your configuration file to use the name of the CSV file you copied. Note Using a Google Sheet is currently the fastest and most convenient way of managing CSV data for use with Islandora Workbench. Since Sheets saves changes in realtime, and since Workbench fetches a fresh copy of the CSV data every time you run it, it's easy to iterate by making changes to your data in Sheets, running Workbench (don't forget to use --check first to identify any problems!), seeing the effects of your changes in the nodes you've just created, rolling back your nodes , tweaking your data in Sheets, and starting a new cycle. If you are focused on refining your CSV metadata, you can save time by skipping the creation of media by including nodes_only: true in your configuration file. Some things to note about using Google Sheets: You can use a Google Sheet in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Google spreadsheet. The URL in the configuration file needs single or double quotes around it, like any other value that contains a colon. You can use either the URL you copy from your browser when you are viewing the spreadsheet (which ends in \"/edit#gid=0\" or something similar), or the \"sharing\" URL you copy into your clipboard from within the \"Share\" dialog box (which ends in \"edit?usp=sharing\"). Either is OK, but, as noted above, it's best to have your URL end in \"/edit\" and explicitly state the gid using the google_sheets_gid config setting. The Google spreadsheet must be publicly readable, e.g. with \"Anyone on the Internet with this link can view\" permission. Spreadsheets work best for descriptive metadata if all cells are formatted as \"Plain text\". To do this in Google Sheets, select all cells, then choose the menu items Format > Number > Plain text before adding any content to the cells . If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above.","title":"Using a Google Sheet as the input CSV file"},{"location":"preparing_data/#selecting-a-specific-worksheet-within-a-google-sheet","text":"Worksheets within a given Google Sheet are identified by a \"gid\". If a Sheet has only a single worksheet, its \"gid\" is \"0\" (zero): https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=0 If you add additional worksheets, they get a randomly generated \"gid\", such as \"1094504353\". You can see this \"gid\" in the URL when you are in the worksheet: https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=1094504353 By default, Workbench extracts CSV data from the worksheet with a \"gid\" of \"0\". If you want Workbench to extract the CSV data from a specific worksheet that is not the one with a \"gid\" of \"0\", specify the \"gid\" in your configuration file using the google_sheets_gid option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit?usp=sharing' google_sheets_gid: 1094504353","title":"Selecting a specific worksheet within a Google Sheet"},{"location":"preparing_data/#using-an-excel-file-as-the-input-csv-file","text":"With this option, your configuration's input_csv option contains the filename of an Excel 2010 (or higher) file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx Islandora Workbench extracts the content of this file as CSV data, and uses that extracted data as its input the same way it would use a raw CSV file. Note that: You can use an Excel file in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Excel spreadsheet. Spreadsheets work best for descriptive metadata if all cells are formatted as \"text\". To do this, in Excel, select all cells, alt-click on the selected area, then choose the \"Format Cells\" context menu item. In the \"Number\" tab, choose \"Text\", then click on the \"OK\" button. If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above.","title":"Using an Excel file as the input CSV file"},{"location":"preparing_data/#selecting-a-specific-worksheet-within-an-excel-file","text":"The worksheet that the CSV data is taken from is the one named \"Sheet1\", unless you specify another worksheet using the excel_worksheet configuration option. As with Google Sheets, you can tell Workbench to use a specific worksheet in an Excel file. Here is an example of a config file using that setting: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx excel_worksheet: Second sheet","title":"Selecting a specific worksheet within an Excel file"},{"location":"preparing_data/#how-workbench-cleans-your-input-data","text":"Regardless of whether your input data is raw CSV, a Google Sheet, or Excel, Workbench applies a small number of cleansing operations on it. These are: replaces smart/curly quotes (both double and single) with regular quotes replaces multiple whitespaces within strings with a single space removes leading and trailing spaces (including newlines). removes leading and trailing subdelimter characters (i.e., the value of the subdelimiter config setting, default of | ). If you do not want Workbench to do one or more of these cleanups, include the clean_csv_values_skip setting in your configuration, specifying in a list one or more of the following: smart_quotes inside_spaces outside_spaces outside_subdelimiters An example of using this configuration setting is: clean_csv_values_skip: [\"smart_quotes\", \"inside_spaces\"]","title":"How Workbench cleans your input data"},{"location":"preparing_data/#when-workbench-skips-invalid-csv-data","text":"Running --check will tell you when any of the data in your CSV file is invalid, or in other words, does not conform to its target Drupal field's configuration and is likely to cause the creation/updating of content to fail. Currently, for the following types of fields: text geolocation link Workbench will validate CSV values and skip values that fail its validation tests. Work is underway to complete this feature, including skipping of invalid entity reference and typed relation fields.","title":"When Workbench skips invalid CSV data"},{"location":"preparing_data/#blank-or-missing-file-values","text":"By default, if the file value for a row is empty, Workbench's --check option will show an error. But, in some cases you may want to create nodes but not add any media. If you add allow_missing_files: true to your config file for \"create\" tasks, you can leave the file column in your CSV empty.","title":"Blank or missing \"file\" values"},{"location":"preparing_data/#creating-nodes-but-not-media","text":"If you want to only create nodes and not media, you can do so by including nodes_only: true in your configuration file. More detail is available .","title":"Creating nodes but not media"},{"location":"preparing_data/#encoding-of-text-files","text":"All text files used as input to Islandora Workbench, including CSV data, files that are going to be use to create \"Extracted Text\" media such as OCR/hOCR files, and media track files, must use a standard UTF-8 encoding. This is generally not a problem other than if you have created any of these files on Microsoft Windows. Even on Windows, UTF-8 files can contain a Microsoft-specifc feature called a \"Byte-order mark\", or \"BOM\". This type of UTF-8 file is not valid; only standard UTF-8 files are.","title":"Encoding of text files"},{"location":"prompts/","text":"You can define prompts to the user by adding settings to your configuration file. There are two types: 1) a prompt to ask the user whether they have run --check , and 2) configurable prompts that allow you to define your own \"y/n\" questions to the user. The \"Have you run --check? (y/n)\" prompt simply displays that question to the end user and asks for a \"y\" or \"n\" response. \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. To show the user this prompt, add the following to your config file: remind_user_to_run_check: true If this configuration setting is present, the user will be prompted to answer \"Have you run --check? (y/n)\" when Workbench is run without the --check argumement. (This prompt is separate from the next type of prompt because it contains special logic such that it is never presented to the user in --check mode.) The second type of prompt, the configurable prompts, allow the creation of custom \"y/n\" questions that the user must answer \"y\" to in order to proceed. These prompts are shown to the user regardless of whether they are running Workbench with the --check argument. A situation where you may want to ask the user a question of this sort is when creating media that are normally created by Islandora as derivatives. In that case, it is usefult to remind the user that they need to temporarily disable the Contexts that generate the derivatives. These \"y/n\" questions you want to prompt users with are listed within the user_prompts config setting like this: user_prompts: - Have you inspected your Workbench log to look for issues you can fix? (y/n) - If you are adding \"additional files\", have you temporarily disabled the Contexts that would generate corresponding derivatives? (y/n) Each prompt is shown to the user in the order they are listed, each asking for a \"y/n\" response. If the users responds by entering \"y\", the next prompt is shown. If they respond with \"n\" or any other key, Workbench logs the prompt and response, and exits. Note that you must include the \"(y/n)\" text in your prompts; Workbench doesn't add it automatically. A couple notes: The user must hit Enter after entering \"y\" or \"n\". You can include both the remind_user_to_run_check setting and the user_prompts settings in your config file. If you include both, \"Have you run --check? (y/n)\" is presented to the user first, then the customizable prompts. If the user responds \"n\" to any prompt (or any response other than \"y\"), the prompt along with their response is logged before Workbench exits. Prompts can be overridden with an implied \"y\" answer by including the --skip_user_prompts argument when running Workbench. This is useful if your config file includes either the remind_user_to_run_check or user_prompts during testing or scripted use of Workbench.","title":"Prompting the user"},{"location":"quick_delete/","text":"You can delete individual nodes and media by using the --quick_delete_node and --quick_delete_media command-line options, respectively. In both cases, you need to provide the full URL (or URL alias) of the node or media you are deleting. For example: ./workbench --config anyconfig.yml --quick_delete_node http://localhost:8000/node/393 ./workbench --config anyconfig.yml --quick_delete_media http://localhost:8000/media/552/edit In the case of the --quick_delete_node option, all associated media and files are also deleted (but see the exception below); in the case of the --quick_delete_media option, all associated files are deleted. You can use any valid Workbench configuration file regardless of the task setting, since the only settings the quick delete tool requires are username and password . All other configuration settings are ignored with the exception of: delete_media_with_nodes . If you want to not delete a node's media when using the --quick_delete_node option, your configuration file should include delete_media_with_nodes: false . standalone_media_url . If your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings unchecked (the default), you will need to include /edit at the end of media URLs. Workbench will tell you if you include or exclude the /edit incorrectly. If your Drupal has this option checked, you will need to include standalone_media_url: true in your configuration file. In this case, you should not include the /edit at the end of the media URL.","title":"Quick delete"},{"location":"redirects/","text":"Islandora Workbench enables you to create redirects managed by the Redirect contrib module. One of the most common uses for redirects is to retain old URLs that have been replaced by newer ones. Within the Islandora context, being able to redirect from Islandora 7.x islandora/object/[PID] style URLs to their new Islandora 2 replacements such as node/[NODEID] . Using redirects, you can ensure that the old URLs continue to lead the user to the expected content. In order to use Workbench for this, you must install the Redirect module and configure the \"Redirect\" REST endpoing at /admin/config/services/rest so that it has the following settings: A sample configuration file for a creatre_redirects task looks like this: task: create_redirects host: https://islandora.traefik.me username: admin password: password input_csv: my_redirects.csv # This setting is explained below. # redirect_status_code: 302 The input CSV contains only two columns, redirect_source and redirect_target . Each value in the redirect_source column is a relative (to the hostname running Drupal) URL path that, when visited, will automatically redirect the user to the relative (to the hostname running Drupal) URL path (or external absolute URL) named in the redirect_target column. Here is a brief example CSV: redirect_source,redirect_target contact_us,how_can_we_help islandora/object/alpine:482,node/1089 islandora/object/awesomeimage:collection,https://galleries.sfu.ca node/16729,node/3536 Assuming that the hostname of the Drupal instance running the Redirects module is https://mydrupal.org , when these redirects are created, the following will happen: When a user visits https://mydrupal.org/contact_us , they will automatically be redirected to https://mydrupal.org/how_can_we_help . When a user visits https://mydrupal.org/islandora/object/alpine:482 , they will be automatically redirected to https://mydrupal.org/node/1089 . When a user visits https://mydrupal.org/islandora/object/awesomeimage:collection , they will be automatically redirected to the external (to https://mydrupal.org ) URL https://galleries.sfu.ca . When a user visits https://mydrupal.org/node/16729 , they will be automatically redirected to https://mydrupal.org/node/3536 . A few things to note about creating redirects using Islandora Workbench: The values in the redirect_source column are always relative to the root of the Drupal hostname URL. Drupal expects them to not begin with a leading / , but if you include it, Workbench will trim it off automatically. The redirect_source values do not need to represent existing nodes. For example, a value like islandora/object/awesomeimage:collection has no underlying node; it's just a path that Drupal listens at, and when requested, redirects the user to the corresponding target URL. Values in redirect_source that do have underlying nodes will redirect users to the corresponding redirect_target but don't make a lot of sense, since the user will always be automatically redirected and never get to see the underlying node at the source URL. However, you may want to use a redirect_target value that has a local node if you don't want users to see that node temporarily for some reason (after that reason is no longer valid, you would remove the redirect). Currently, Islandora Workbench can only create redirects, it can't update or delete them. If you have a need for either of those funtionalities, open a Github issue. HTTP redirects work by issuing a special response code to the browser. In most cases, and by default in Workbench, this is 301 . However, you can change this to another redirect HTTP status code by including the redirect_status_code setting in your config file specifying the code you want Drupal to send to the user's browser, e.g., redirect_status_code: 302 .","title":"Creating redirects"},{"location":"reducing_load/","text":"Workbench can put substantial stress on Drupal. In some cases, this stress can lead to instability and errors. Note The options described below modify Workbench's interaction with Drupal. They do not have a direct impact on the load experienced by the microservices Islandora uses to do things like indexing node metadata in its triplestore, extracting full text for indexing, and generating thumbnails. However, since Drupal is the \"controller\" for these microservices, reducing the number of Workbench requests to Drupal will also indirectly reduce the load experienced by Islandora's microservices. Workbench provides two ways to reduce this stress: pausing between Workbench's requests to Drupal, and caching Workbench's requests to Drupal. The first way to reduce stress on Drupal is by telling Workbench to pause between each request it makes. There are two types of pausing, 1) basic and 2) adaptive. Both types of pausing improve stability and reliability by slowing down Workbench's overall execution time. Pausing Basic pause The pause configuration setting tells Workbench to temporarily halt execution before every request, thereby spreading load caused by the requests over a longer period of time. To enable pause , include the setting in your configuration file, indicating the number of seconds to wait between requests: pause: 2 Using pause will help decrease load-induced errors, but it is inefficient because it causes Workbench to pause between all requests, even ones that are not putting stress on Drupal. A useful strategy for refining Workbench's load-reduction capabilities is to try pause first, and if it reduces errors, then disable pause and try adaptive_pause instead. pause will confirm that Workbench is adding load to Drupal, but adaptive_pause will tell Workbench to pause only when it detects its requests are putting load on Drupal. Note pause and adaptive_pause are mutually exclusive. If you include one in your configuration files, you should not include the other. Adaptive pause Adaptive pause only halts execution between requests if Workbench detects that Drupal is slowing down. It does this by comparing Drupal's response time for the most recent request to the average response time of the 20 previous requests made by Islandora Workbench. If the response time for the most recent request reaches a specific threshold, Workbench's adaptive pause will kick in and temporarily halt execution to allow Drupal to catch up. The number of previous requests used to determine the average response time, 20, cannot be changed with a configuration setting. The threshold that needs to be met is configured using the adaptive_pause_threshold setting. This setting's default value is 2, which means that the adaptive pause will kick in if the response time for the most recent request Workbench makes to Drupal is 2 times (double) the average of Workbench's last 20 requests. The amount of time that Workbench will pause is determined by the value of adaptive_pause , which, like the value for pause , is a number of seconds (e.g., adaptive_pause: 3 ). You enable adaptive pausing by adding the adaptive_pause setting to your configuration file. Here are a couple of examples. Keep in mind that adaptive_pause_threshold has a default value (2), but adaptive_pause does not have a default value. The first example enables adaptive_pause using the default value for adaptive_pause_threshold , telling it to pause for 3 seconds between requests if Drupal's response time to the last request is 2 times slower ( adaptive_pause_threshold 's default value) than the average of the last 20 requests: adaptive_pause: 3 In the next example, we override adaptive_pause_threshold 's default by including the setting in the configuration: adaptive_pause: 2 adaptive_pause_threshold: 2.5 In this example, adaptive pausing kicks in only if the response time for the most recent request is 2.5 times the average of the response time for the last 20 requests. You can increment adaptive_pause_threshold 's value by .5 (e.g., 2.5, 3, 3.5, etc.) until you find a sweet spot that balances reliability with overall execution time. You can also decrease or increase the value of adaptive_pause incrementally by intervals of .5 to further refine the balance - increasing adaptive_pause 's value lessens Workbench's impact on Drupal at the expense of speed, and decreasing its value increases speed but also impact on Drupal. Since adaptive_pause doesn't have a default value, you need to define its value in your configuration file. Because of this, using adaptive_pause_threshold on its own in a configuration, e.g.: adaptive_pause_threshold: 3 doesn't do anything because adaptive_pause has no value. In other words, you can use adaptive_pause on its own, or you can use adaptive_pause and adaptive_pause_threshold together, but it's pointless to use adaptive_pause_threshold on its own. Logging Drupal's response time If a request if paused by adaptive pausing, Workbench will automatically log the response time for the next request, indicating that adaptive_pause has temporarily halted execution. If you want to log Drupal's response time regardless of whether adaptive_pause had kicked in or not, add log_response_time: true to your configuration file. All logging of response time includes variation from the average of the last 20 response times. Caching The second way that Workbench reduces stress on Drupal is by caching its outbound HTTP requests, thereby reducing the overall number of requests. This caching both reduces the load on Drupal and speeds up Workbench considerably. By default, this caching is enabled for requests that Workbench makes more than once and that are expected to have the same response each time, such as requests for field configurations or for checks for the existence of taxonomy terms. Note that you should not normally have to disable this caching, but if you do (for example, if you are asked to during troubleshooting), you can do so by including the following setting in your configuration file: enable_http_cache: false","title":"Reducing Workbench's impact on Drupal"},{"location":"reducing_load/#pausing","text":"","title":"Pausing"},{"location":"reducing_load/#basic-pause","text":"The pause configuration setting tells Workbench to temporarily halt execution before every request, thereby spreading load caused by the requests over a longer period of time. To enable pause , include the setting in your configuration file, indicating the number of seconds to wait between requests: pause: 2 Using pause will help decrease load-induced errors, but it is inefficient because it causes Workbench to pause between all requests, even ones that are not putting stress on Drupal. A useful strategy for refining Workbench's load-reduction capabilities is to try pause first, and if it reduces errors, then disable pause and try adaptive_pause instead. pause will confirm that Workbench is adding load to Drupal, but adaptive_pause will tell Workbench to pause only when it detects its requests are putting load on Drupal. Note pause and adaptive_pause are mutually exclusive. If you include one in your configuration files, you should not include the other.","title":"Basic pause"},{"location":"reducing_load/#adaptive-pause","text":"Adaptive pause only halts execution between requests if Workbench detects that Drupal is slowing down. It does this by comparing Drupal's response time for the most recent request to the average response time of the 20 previous requests made by Islandora Workbench. If the response time for the most recent request reaches a specific threshold, Workbench's adaptive pause will kick in and temporarily halt execution to allow Drupal to catch up. The number of previous requests used to determine the average response time, 20, cannot be changed with a configuration setting. The threshold that needs to be met is configured using the adaptive_pause_threshold setting. This setting's default value is 2, which means that the adaptive pause will kick in if the response time for the most recent request Workbench makes to Drupal is 2 times (double) the average of Workbench's last 20 requests. The amount of time that Workbench will pause is determined by the value of adaptive_pause , which, like the value for pause , is a number of seconds (e.g., adaptive_pause: 3 ). You enable adaptive pausing by adding the adaptive_pause setting to your configuration file. Here are a couple of examples. Keep in mind that adaptive_pause_threshold has a default value (2), but adaptive_pause does not have a default value. The first example enables adaptive_pause using the default value for adaptive_pause_threshold , telling it to pause for 3 seconds between requests if Drupal's response time to the last request is 2 times slower ( adaptive_pause_threshold 's default value) than the average of the last 20 requests: adaptive_pause: 3 In the next example, we override adaptive_pause_threshold 's default by including the setting in the configuration: adaptive_pause: 2 adaptive_pause_threshold: 2.5 In this example, adaptive pausing kicks in only if the response time for the most recent request is 2.5 times the average of the response time for the last 20 requests. You can increment adaptive_pause_threshold 's value by .5 (e.g., 2.5, 3, 3.5, etc.) until you find a sweet spot that balances reliability with overall execution time. You can also decrease or increase the value of adaptive_pause incrementally by intervals of .5 to further refine the balance - increasing adaptive_pause 's value lessens Workbench's impact on Drupal at the expense of speed, and decreasing its value increases speed but also impact on Drupal. Since adaptive_pause doesn't have a default value, you need to define its value in your configuration file. Because of this, using adaptive_pause_threshold on its own in a configuration, e.g.: adaptive_pause_threshold: 3 doesn't do anything because adaptive_pause has no value. In other words, you can use adaptive_pause on its own, or you can use adaptive_pause and adaptive_pause_threshold together, but it's pointless to use adaptive_pause_threshold on its own.","title":"Adaptive pause"},{"location":"reducing_load/#logging-drupals-response-time","text":"If a request if paused by adaptive pausing, Workbench will automatically log the response time for the next request, indicating that adaptive_pause has temporarily halted execution. If you want to log Drupal's response time regardless of whether adaptive_pause had kicked in or not, add log_response_time: true to your configuration file. All logging of response time includes variation from the average of the last 20 response times.","title":"Logging Drupal's response time"},{"location":"reducing_load/#caching","text":"The second way that Workbench reduces stress on Drupal is by caching its outbound HTTP requests, thereby reducing the overall number of requests. This caching both reduces the load on Drupal and speeds up Workbench considerably. By default, this caching is enabled for requests that Workbench makes more than once and that are expected to have the same response each time, such as requests for field configurations or for checks for the existence of taxonomy terms. Note that you should not normally have to disable this caching, but if you do (for example, if you are asked to during troubleshooting), you can do so by including the following setting in your configuration file: enable_http_cache: false","title":"Caching"},{"location":"roadmap/","text":"Islandora Workbench development priorities for Fall 2021 are: ability to create taxonomy terms with fields ( issue ) - Done. ability to create hierarchical taxonomy terms ( issue ) - Done. ability to add/update remote video and audio ( issue ) ability to add/update multilingual field data ( issue ) ability to add/update a non-base Drupal title field ( issue ) - Done. allow --check to report more than one error at a time ( issue ) - Done. a desktop spreadsheet editor ( issue ) ability to add/update Paragraph fields ( issue ) - Done. support the TUS resumable upload protocol ( issue )","title":"Roadmap (Fall 2021)"},{"location":"rolling_back/","text":"In the create and create_from_files tasks, Workbench generates a configuration file and accompanying input CSV file in the format described in \"Deleting nodes\" documentation. These files allow you to easily roll back (i.e., delete) all the nodes and accompanying media you just created. Specifically, this configuration file defines a delete task. See the \" Deleting nodes \" section for more information. --check it will also write an entry in your log file indicating the location of the rollback config and CSV files, and it will test to ensure that the rollback config and CSV files can be written. If either one cannot, Workbench exists with an error. Warning The rollback configuration file contains the username used in the create or create_from_files task that generates it. If you also want to include the accompanying password, add include_password_in_rollback_config_file: true to your configuration. Also note, rollback configuration and CSV files are overwritten every time you run a create or create_from_files task. It is highly recommended that you timestamp your rollback files (see below) to produce unique filenames. By default, the configuration file is named \"rollback.yml\" and is written into the Workbench directory. The input CSV file is named \"rollback.csv\" and is written into the directory defined in your input_dir configuration setting. If either of these files exist, they are overwritten during the next create or create_from_files task. You can also specify the relative (to workbench) or abolute path to your rollback config and CSV files, by including rollback_config_file_path and rollback_csv_file_path , respectively, in your configuration. To roll back all the nodes and media you just created, run ./workbench --config rollback.yml . There are several configuration settings that let you control the names of these two files, and there is also an option to include comments in the files, as described below. Note When secondary tasks are configured, each task will get its own rollback file. Each secondary task's rollback file will have a normalized version of the path to the task's configuration file appended to the rollback filename, e.g., rollback.csv.home_mark_hacking_islandora_workbench_example_secondary_task . Using these separate rollback files, you can delete only the nodes created in a specific secondary task. Setting the directory where the rollback CVS file is written You can determine where the rollback CSV file is written by including the rollback_dir in your configuration. This overrides the default location defined in input_dir . The rollback configuration file is always written to the Workbench working directory unless that behavior is overwritten by rollback_config_file_path . Using rollback filename templates For both the rollback config file and the rollback CSV file, you can define a template that provides four placeholders: $config_filename , which contains the name of the create (or create_from_files ) configuration file $input_csv_filename , which contain the name of the input CSV file (only available in create tasks) csv_start_row , which contains the value of the csv_start_row configuration setting, or \"0\" if none is set (only available in create tasks) csv_stop_row , which contains the value of the csv_stop_row configuration setting, or \"None\" if not set (only available in create tasks). You can embed these placeholders in your template, which is then used to create the names of the two files. The templates for the two filenames are defined in two separate configuration settings. For the rollback configuration file, this setting is rollback_config_filename_template , e.g.: rollback_config_filename_template: my-custom_filename_${config_filename}_$input_csv_filename Assuming the configuration file for the create task that generates this rollback configuration has the name \"mjtest.yml\", and the input_csv filename has the filename \"sample.csv\", this template will result in a configuration file named my-custom_filename_mjtest_sample.yml . Note that the \".yml\" extension is added automatically; you do not need to include it in your template. For the rollback CSV file, the configuration setting is rollback_csv_filename_template , e.g.: rollback_csv_filename_template: my_custom_rollback_filename_${config_filename}_$input_csv_filename Using the same create task configuration filename and input CSV filename as in the previous example, this template will result in a CSV file named my_custom_rollback_filename_mjtest_sample.csv . The \".csv\" extension is added automatically; you do not need to include it in your template. Warning Python's built-in templating system has a quirk where when a character that is valid in a Python variable name follows a template placeholder, that character is added to the template placeholder. When templating filenames, the two most common characters of this type are _ and - . When this happens, Workbench will exit with the error message \"One or more parts of the configured rollback configuration filename template ([your template here]) need adjusting.\" You can work around this behavior by wrapping your template variable name, without the leading $ , in {} . The example rollback config filename template above ( my-custom_filename_$config_filename_$input_csv_filename ) will trigger this error because the _ following the $config_filename placeholder is valid in Python variable names. If you see this type of message, adjust your template to my-custom_filename_${config_filename}_$input_csv_filename . Adding a timestamp to the rollback filenames By default, Workbench overwrites the rollback configuration and CSV files each time it runs, so these files only apply to the most recent create and create_from_files runs. If you add timestamp_rollback: true to your configuration file, a (to-the-second) timestamp will be added to the rollback.yml and corresponding rollback.csv files, for example, rollback.2024_11_03_21_10_28.yml and rollback.2024_11_03_21_10_28.csv . The name of the CSV is also written to workbench.log . Running ./workbench --config rollback.2024_11_03_21_10_28.yml will delete the nodes identified in input_data/rollback.2024_11_03_21_10_28.csv . Timestamps are added in the same way to custom rollback configuration and CSV filenames create using templates. Adding custom comments to your rollback configuration and CSV files Workbench always adds to lines of comments to rollback configuration and CSV files, indicating when the files were generated and the names of the configuration and input CSV files they were generated from, like this: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". You can add additional, custom comment lines by including the rollback_file_comments configuration setting in your create or create_from_files configuration, like this: rollback_file_comments: - Keep this file! It might be useful if something goes wrong with this job. - Have a nice day! This will result in the following comments in your rollback configuration and CSV files: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". # Keep this file! It might be useful if something goes wrong with this job. # Have a nice day!","title":"Rolling back nodes and media"},{"location":"rolling_back/#setting-the-directory-where-the-rollback-cvs-file-is-written","text":"You can determine where the rollback CSV file is written by including the rollback_dir in your configuration. This overrides the default location defined in input_dir . The rollback configuration file is always written to the Workbench working directory unless that behavior is overwritten by rollback_config_file_path .","title":"Setting the directory where the rollback CVS file is written"},{"location":"rolling_back/#using-rollback-filename-templates","text":"For both the rollback config file and the rollback CSV file, you can define a template that provides four placeholders: $config_filename , which contains the name of the create (or create_from_files ) configuration file $input_csv_filename , which contain the name of the input CSV file (only available in create tasks) csv_start_row , which contains the value of the csv_start_row configuration setting, or \"0\" if none is set (only available in create tasks) csv_stop_row , which contains the value of the csv_stop_row configuration setting, or \"None\" if not set (only available in create tasks). You can embed these placeholders in your template, which is then used to create the names of the two files. The templates for the two filenames are defined in two separate configuration settings. For the rollback configuration file, this setting is rollback_config_filename_template , e.g.: rollback_config_filename_template: my-custom_filename_${config_filename}_$input_csv_filename Assuming the configuration file for the create task that generates this rollback configuration has the name \"mjtest.yml\", and the input_csv filename has the filename \"sample.csv\", this template will result in a configuration file named my-custom_filename_mjtest_sample.yml . Note that the \".yml\" extension is added automatically; you do not need to include it in your template. For the rollback CSV file, the configuration setting is rollback_csv_filename_template , e.g.: rollback_csv_filename_template: my_custom_rollback_filename_${config_filename}_$input_csv_filename Using the same create task configuration filename and input CSV filename as in the previous example, this template will result in a CSV file named my_custom_rollback_filename_mjtest_sample.csv . The \".csv\" extension is added automatically; you do not need to include it in your template. Warning Python's built-in templating system has a quirk where when a character that is valid in a Python variable name follows a template placeholder, that character is added to the template placeholder. When templating filenames, the two most common characters of this type are _ and - . When this happens, Workbench will exit with the error message \"One or more parts of the configured rollback configuration filename template ([your template here]) need adjusting.\" You can work around this behavior by wrapping your template variable name, without the leading $ , in {} . The example rollback config filename template above ( my-custom_filename_$config_filename_$input_csv_filename ) will trigger this error because the _ following the $config_filename placeholder is valid in Python variable names. If you see this type of message, adjust your template to my-custom_filename_${config_filename}_$input_csv_filename .","title":"Using rollback filename templates"},{"location":"rolling_back/#adding-a-timestamp-to-the-rollback-filenames","text":"By default, Workbench overwrites the rollback configuration and CSV files each time it runs, so these files only apply to the most recent create and create_from_files runs. If you add timestamp_rollback: true to your configuration file, a (to-the-second) timestamp will be added to the rollback.yml and corresponding rollback.csv files, for example, rollback.2024_11_03_21_10_28.yml and rollback.2024_11_03_21_10_28.csv . The name of the CSV is also written to workbench.log . Running ./workbench --config rollback.2024_11_03_21_10_28.yml will delete the nodes identified in input_data/rollback.2024_11_03_21_10_28.csv . Timestamps are added in the same way to custom rollback configuration and CSV filenames create using templates.","title":"Adding a timestamp to the rollback filenames"},{"location":"rolling_back/#adding-custom-comments-to-your-rollback-configuration-and-csv-files","text":"Workbench always adds to lines of comments to rollback configuration and CSV files, indicating when the files were generated and the names of the configuration and input CSV files they were generated from, like this: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". You can add additional, custom comment lines by including the rollback_file_comments configuration setting in your create or create_from_files configuration, like this: rollback_file_comments: - Keep this file! It might be useful if something goes wrong with this job. - Have a nice day! This will result in the following comments in your rollback configuration and CSV files: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". # Keep this file! It might be useful if something goes wrong with this job. # Have a nice day!","title":"Adding custom comments to your rollback configuration and CSV files"},{"location":"sample_data/","text":"Islandora Workbench comes with some sample data. Running ./workbench --config create.yml --check will result in the following output: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. Then running workbench Workbench without --check will result in something like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created.","title":"Creating nodes from the sample data"},{"location":"troubleshooting/","text":"If you encounter a problem, take a look at the \"things that might sound familiar\" section below. But, if the problem you're encountering isn't described below, you can ask for help. Ask for help The #islandoraworkbench Slack channel is a good place to ask a question if Workbench isn't working the way you expect or if it crashes. You can also open a Github issue . If Workbench \"isn't working the way you expect\", the documentation is likely unclear. Crashes are usually caused by sloppy Python coding. Reporting either is a great way to contribute to Islandora Workbench. But before you ask... The first step you should take while troubleshooting a Workbench failure is to use Islandora's graphical user interface to create/edit/delete a node/media/taxonomy term (or whatever it is you're trying to do with Workbench). If Islandora works without error, you have confirmed that the problem you are experiencing is likely isolated to Workbench and is not being caused by an underlying problem with Islandora. Next, if you have eliminated Islandora as the cause of the Workbench problem you are experiencing, you might be able to fix the problem by pulling in the most recent Workbench code. The best way to keep it up to date is to pull in the latest commits from the Github repository periodically, but if you haven't done that in a while, within the \"islandora_workbench\" directory, run the following git commands: git branch , which should tell whether you're currently in the \"main\" branch. If you are: git pull , which will fetch the most recent code and and merge it into the code you are running. If git tells you it has pulled in any changes to Workbench, you will be running the latest code. If you get an error while running git, ask for help. Also, you might be asked to provide one or more of the following: your configuration file (without username and password!). You can also print your configuration to your terminal by includeing the --print_config argument to workbench, e.g. python workbench --config test.yml --check --print_config . some sample input CSV your Workbench log file details from your Drupal log, available at Admin > Reports > Recent log messages whether you have certain contrib modules installed, or other questions about how your Drupal is configured. Some things that might sound familiar Running Workbench results in lots of messages like InsecureRequestWarning: Unverified HTTPS request is being made to host 'islandora.dev' . If you see this, and you are using an ISLE istallation whose Drupal hostname uses the traefik.me domain (for example, https://islandora.traefik.me), the HTTPS certificate for the domain has expired. This problem will be widespread so please check the Islandora Slack for any current discussion about it. If you don't get any help in Slack, try redownloading the certificates following the instructions in the Islandora documentation . If that doesn't work, you can temporarily avoid the warning messages by adding secure_ssl_only: false to your Workbench config file and updating an environment variable using the following command: export PYTHONWARNINGS=\"ignore:Unverified HTTPS request\" Workbench is slow. True, it can be slow. However, users have found that the following strategies increase Workbench's speed substantially: Running Workbench on the same server that Drupal is running on (e.g. using \"localhost\" as the value of host in your config file). While doing this negates Workbench's most important design principle - that it does not require access to the Drupal server's command line - during long-running jobs such as those that are part of migrations, this is the best way to speed up Workbench. Using local instead of remote files. If you populate your file or \"additional files\" fields with filenames that start with \"http\", Workbench downloads each of those files before ingesting them. Providing local copies of those files in advance of running Workbench will eliminate the time it takes Workbench to download them. Avoid confirming taxonomy terms' existence during --check . If you add validate_terms_exist: false to your configuration file, Workbench will not query Drupal for each taxonomy term during --check . This option is suitable if you know that the terms don't exist in the target Drupal. Note that this option only speeds up --check ; it does not have any effect when creating new nodes. Generate FITS XML files offline and then add them as media like any other file. Doing this allows Islandora to not generate FITS data, which can slow down ingests substantially. If you do this, be sure to disable the \"FITS derivatives\" Context first. I've pulled in updates to Islandora Workbench from Github but when I run it, Python complains about not being able to find a library. This won't happen very often, and the cause of this message will likely have been announced in the #islandoraworkbench Slack channel. This error is caused by the addition of a new Python library to Workbench. Running setup.py will install the missing library. Details are available in the \"Updating Islandora Workbench\" section of the Requirements and Installation docs. You do not need to run setup.py every time you update the Workbench code. Introducing a new library is not a common occurance. Workbench is failing to ingest some nodes and is leaving messages in the log mentioning HTTP response code 422. This is probably caused by unexpected data in your CSV file that Workbench's --check validation is not finding. If you encounter these messages, please open an issue and share any relevant entries in your Workbench log and Drupal log (as an admin user, go to Admin > Reports > Recent log messages) so we can track down the problem. One of the most common causes of this error is that one or more of the vocabularies being populated in your create task CSV contain required fields other than the default term name. It is possible to have Workbench create these fields, but you must do so as a separate create_terms task. See \" Creating taxonomy terms \" for more information. Workbench is crashing and telling me there are problems with SSL certificates. To determine if this issue is specific to Workbench, from the same computer Workbench is running on, try hitting your Drupal server (or server your remote files are on) with curl . If curl also complains about SSL certificates, the problem lies in the SSL/HTTPS configuration on the server. An example curl command is curl https://wwww.lib.sfu.ca . If curl doesn't complain, the problem is specific to Workbench. --check is telling me that one the rows in my CSV file has more columns than headers. The most likely problem is that one of your CSV values contains a comma but is not wrapped in double quotes. My Drupal has the \"Standalone media URL\" option at /admin/config/media/media-settings checked, and I'm using Workbench's standalone_media_url: true option in my config, but I'm still getting lots of errors. Be sure to clear Drupal's cache every time you change the \"Standalone media URL\" option. More information can be found here . Workbench crashes or slows down my Drupal server. If Islandora Workbench is putting too much strain on your Drupal server, you should try enabling the pause configuration option. If that works, replace it with the adaptive_pause option and see if that also works. The former option pauses between all communication requests between Workbench and Drupal, while the latter pauses only if the server's response time for the last request is longer than the average of the last 20 requests. Note that both of these settings will slow Workbench down, which is their purpose. However, adaptive_pause should have less impact on overall speed since it only pauses between requests if it detects the server is getting slower over time. If you use adaptive_pause , you can also tune the adaptive_pause_threshold option by incrementing the value by .5 intervals (for example, from the default of 2 to 2.5, then 3, etc.) to see if doing so reduces strain on your Drupal server while keeping overall speed acceptable. You can also lower the value of adaptive_pause incrementally to balance strain with overall speed. Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file. Some web applications, including Drupal 7, return a human-readable HTML page instead of a expected HTTP response code when they encounter an error. If Workbench is complaining that a remote file in your file other file column in your input CSV has an extension of \".htm\" or \".html\" and you know that the file is not an HTML page, what Workbench is seeing is probably an error message. For example, Workbench might leave a message like this in your log: Error: File \"https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view\" in CSV row \"text:302175\" has an extension (html) that is not allowed in the \"field_media_file\" field of the \"file\" media type. This error can be challenging to track down since the HTML error page might have been specific to the request that Workbench just made (e.g. a timeout or some other temporary server condition). One way of determining if the error is temporary (i.e. specific to the request) is to use curl to fetch the file (e.g., curl -o test.tif https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view ). If the returned file (in this example, it will be named test.tif ) is in fact HTML, the error is probably permanent or at least persistent; if the file is the one you expected to retrieve, the error was temporary and you can ignore it. The text in my CSV does not match how it looks when I view it in Drupal. If a field is configured in Drupal to use text filters , the HTML that is displayed to the user may not be exactly the same as the content of the node add/edit form field. If you check the node add/edit form, the content of the field should match the content of the CSV field. If it does, it is likely that Drupal is apply a text filter. See this issue for more information. My Islandora uses a custom media type and I need to tell Workbench what file field to use. If you need to create a media that is not one of the standard Islandora types (Image, File, Digital Document, Video, Audio, Extracted Text, or FITS Technical metadata), you will need to include the media_file_fields setting in your config file, like this: media_file_fields: - mycustommedia_machine_name: field_custom_file - myothercustommedia_machine_name: field_other_custom_file This configuration setting adds entries to the following default mapping of media types to file field names: 'file': 'field_media_file', 'document': 'field_media_document', 'image': 'field_media_image', 'audio': 'field_media_audio_file', 'video': 'field_media_video_file', 'extracted_text': 'field_media_file', 'fits_technical_metadata': 'field_media_file' EDTF 'interval' values are not rendering in Islandora properly. Islandora can display EDTF interval values (e.g., 2004-06/2006-08 , 193X/196X ) properly, but by default, the configuration that allows this is disabled (see this issue for more information). To enable it, for each field in your Islandora content types that use EDTF fields, visit the \"Manage form display\" configuration for the content type, and for each field that uses the \"Default EDTF widget\", within the widget configuration (click on the gear), check the \"Permit date intervals\" option and click \"Update\": My CSV file has a url_alias column, but the aliases are not being created. First thing to check is whether you are using the Pathauto module. It also creates URL aliases, and since by default Drupal only allows one URL alias, in most cases, the aliases it creates will take precedence over aliases created by Workbench. I'm having trouble getting Workbench to work in a cronjob The most common problem you will encounter when running Islandora Workbench in a cronjob is that Workbench can't find its configuration file, or input/output directories. The easiest way to avoid this is to use absolute file paths everywhere, including as the value of Workbench's --config parameter, in configuration files, and in file and additional_files columns in your input CSV files. In some cases, particularly if you are using a secondary task to create pages or child items, you many need to use the path_to_python and path_to_workbench_script configuration settings. get_islandora_7_content.py crashes with the error \"illegal request: Server returned status of 400. The default query may be too long for a url request.\" Islandora 7's Solr contains a lot of redundant fields. You need to reduce the number of fields to export. See the \" Exporting Islandora 7 content \" documentation for ways to reduce the number of fields. Ask in the #islandoraworkbench Slack channel if you need help.","title":"Troubleshooting"},{"location":"troubleshooting/#ask-for-help","text":"The #islandoraworkbench Slack channel is a good place to ask a question if Workbench isn't working the way you expect or if it crashes. You can also open a Github issue . If Workbench \"isn't working the way you expect\", the documentation is likely unclear. Crashes are usually caused by sloppy Python coding. Reporting either is a great way to contribute to Islandora Workbench.","title":"Ask for help"},{"location":"troubleshooting/#but-before-you-ask","text":"The first step you should take while troubleshooting a Workbench failure is to use Islandora's graphical user interface to create/edit/delete a node/media/taxonomy term (or whatever it is you're trying to do with Workbench). If Islandora works without error, you have confirmed that the problem you are experiencing is likely isolated to Workbench and is not being caused by an underlying problem with Islandora. Next, if you have eliminated Islandora as the cause of the Workbench problem you are experiencing, you might be able to fix the problem by pulling in the most recent Workbench code. The best way to keep it up to date is to pull in the latest commits from the Github repository periodically, but if you haven't done that in a while, within the \"islandora_workbench\" directory, run the following git commands: git branch , which should tell whether you're currently in the \"main\" branch. If you are: git pull , which will fetch the most recent code and and merge it into the code you are running. If git tells you it has pulled in any changes to Workbench, you will be running the latest code. If you get an error while running git, ask for help. Also, you might be asked to provide one or more of the following: your configuration file (without username and password!). You can also print your configuration to your terminal by includeing the --print_config argument to workbench, e.g. python workbench --config test.yml --check --print_config . some sample input CSV your Workbench log file details from your Drupal log, available at Admin > Reports > Recent log messages whether you have certain contrib modules installed, or other questions about how your Drupal is configured.","title":"But before you ask..."},{"location":"troubleshooting/#some-things-that-might-sound-familiar","text":"","title":"Some things that might sound familiar"},{"location":"troubleshooting/#running-workbench-results-in-lots-of-messages-like-insecurerequestwarning-unverified-https-request-is-being-made-to-host-islandoradev","text":"If you see this, and you are using an ISLE istallation whose Drupal hostname uses the traefik.me domain (for example, https://islandora.traefik.me), the HTTPS certificate for the domain has expired. This problem will be widespread so please check the Islandora Slack for any current discussion about it. If you don't get any help in Slack, try redownloading the certificates following the instructions in the Islandora documentation . If that doesn't work, you can temporarily avoid the warning messages by adding secure_ssl_only: false to your Workbench config file and updating an environment variable using the following command: export PYTHONWARNINGS=\"ignore:Unverified HTTPS request\"","title":"Running Workbench results in lots of messages like InsecureRequestWarning: Unverified HTTPS request is being made to host 'islandora.dev'."},{"location":"troubleshooting/#workbench-is-slow","text":"True, it can be slow. However, users have found that the following strategies increase Workbench's speed substantially: Running Workbench on the same server that Drupal is running on (e.g. using \"localhost\" as the value of host in your config file). While doing this negates Workbench's most important design principle - that it does not require access to the Drupal server's command line - during long-running jobs such as those that are part of migrations, this is the best way to speed up Workbench. Using local instead of remote files. If you populate your file or \"additional files\" fields with filenames that start with \"http\", Workbench downloads each of those files before ingesting them. Providing local copies of those files in advance of running Workbench will eliminate the time it takes Workbench to download them. Avoid confirming taxonomy terms' existence during --check . If you add validate_terms_exist: false to your configuration file, Workbench will not query Drupal for each taxonomy term during --check . This option is suitable if you know that the terms don't exist in the target Drupal. Note that this option only speeds up --check ; it does not have any effect when creating new nodes. Generate FITS XML files offline and then add them as media like any other file. Doing this allows Islandora to not generate FITS data, which can slow down ingests substantially. If you do this, be sure to disable the \"FITS derivatives\" Context first.","title":"Workbench is slow."},{"location":"troubleshooting/#ive-pulled-in-updates-to-islandora-workbench-from-github-but-when-i-run-it-python-complains-about-not-being-able-to-find-a-library","text":"This won't happen very often, and the cause of this message will likely have been announced in the #islandoraworkbench Slack channel. This error is caused by the addition of a new Python library to Workbench. Running setup.py will install the missing library. Details are available in the \"Updating Islandora Workbench\" section of the Requirements and Installation docs. You do not need to run setup.py every time you update the Workbench code. Introducing a new library is not a common occurance.","title":"I've pulled in updates to Islandora Workbench from Github but when I run it, Python complains about not being able to find a library."},{"location":"troubleshooting/#workbench-is-failing-to-ingest-some-nodes-and-is-leaving-messages-in-the-log-mentioning-http-response-code-422","text":"This is probably caused by unexpected data in your CSV file that Workbench's --check validation is not finding. If you encounter these messages, please open an issue and share any relevant entries in your Workbench log and Drupal log (as an admin user, go to Admin > Reports > Recent log messages) so we can track down the problem. One of the most common causes of this error is that one or more of the vocabularies being populated in your create task CSV contain required fields other than the default term name. It is possible to have Workbench create these fields, but you must do so as a separate create_terms task. See \" Creating taxonomy terms \" for more information.","title":"Workbench is failing to ingest some nodes and is leaving messages in the log mentioning HTTP response code 422."},{"location":"troubleshooting/#workbench-is-crashing-and-telling-me-there-are-problems-with-ssl-certificates","text":"To determine if this issue is specific to Workbench, from the same computer Workbench is running on, try hitting your Drupal server (or server your remote files are on) with curl . If curl also complains about SSL certificates, the problem lies in the SSL/HTTPS configuration on the server. An example curl command is curl https://wwww.lib.sfu.ca . If curl doesn't complain, the problem is specific to Workbench.","title":"Workbench is crashing and telling me there are problems with SSL certificates."},{"location":"troubleshooting/#-check-is-telling-me-that-one-the-rows-in-my-csv-file-has-more-columns-than-headers","text":"The most likely problem is that one of your CSV values contains a comma but is not wrapped in double quotes.","title":"--check is telling me that one the rows in my CSV file has more columns than headers."},{"location":"troubleshooting/#my-drupal-has-the-standalone-media-url-option-at-adminconfigmediamedia-settings-checked-and-im-using-workbenchs-standalone_media_url-true-option-in-my-config-but-im-still-getting-lots-of-errors","text":"Be sure to clear Drupal's cache every time you change the \"Standalone media URL\" option. More information can be found here .","title":"My Drupal has the \"Standalone media URL\" option at /admin/config/media/media-settings checked, and I'm using Workbench's standalone_media_url: true option in my config, but I'm still getting lots of errors."},{"location":"troubleshooting/#workbench-crashes-or-slows-down-my-drupal-server","text":"If Islandora Workbench is putting too much strain on your Drupal server, you should try enabling the pause configuration option. If that works, replace it with the adaptive_pause option and see if that also works. The former option pauses between all communication requests between Workbench and Drupal, while the latter pauses only if the server's response time for the last request is longer than the average of the last 20 requests. Note that both of these settings will slow Workbench down, which is their purpose. However, adaptive_pause should have less impact on overall speed since it only pauses between requests if it detects the server is getting slower over time. If you use adaptive_pause , you can also tune the adaptive_pause_threshold option by incrementing the value by .5 intervals (for example, from the default of 2 to 2.5, then 3, etc.) to see if doing so reduces strain on your Drupal server while keeping overall speed acceptable. You can also lower the value of adaptive_pause incrementally to balance strain with overall speed.","title":"Workbench crashes or slows down my Drupal server."},{"location":"troubleshooting/#workbench-thinks-that-a-remote-file-is-an-html-file-when-i-know-its-a-video-or-audio-or-image-etc-file","text":"Some web applications, including Drupal 7, return a human-readable HTML page instead of a expected HTTP response code when they encounter an error. If Workbench is complaining that a remote file in your file other file column in your input CSV has an extension of \".htm\" or \".html\" and you know that the file is not an HTML page, what Workbench is seeing is probably an error message. For example, Workbench might leave a message like this in your log: Error: File \"https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view\" in CSV row \"text:302175\" has an extension (html) that is not allowed in the \"field_media_file\" field of the \"file\" media type. This error can be challenging to track down since the HTML error page might have been specific to the request that Workbench just made (e.g. a timeout or some other temporary server condition). One way of determining if the error is temporary (i.e. specific to the request) is to use curl to fetch the file (e.g., curl -o test.tif https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view ). If the returned file (in this example, it will be named test.tif ) is in fact HTML, the error is probably permanent or at least persistent; if the file is the one you expected to retrieve, the error was temporary and you can ignore it.","title":"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file."},{"location":"troubleshooting/#the-text-in-my-csv-does-not-match-how-it-looks-when-i-view-it-in-drupal","text":"If a field is configured in Drupal to use text filters , the HTML that is displayed to the user may not be exactly the same as the content of the node add/edit form field. If you check the node add/edit form, the content of the field should match the content of the CSV field. If it does, it is likely that Drupal is apply a text filter. See this issue for more information.","title":"The text in my CSV does not match how it looks when I view it in Drupal."},{"location":"troubleshooting/#my-islandora-uses-a-custom-media-type-and-i-need-to-tell-workbench-what-file-field-to-use","text":"If you need to create a media that is not one of the standard Islandora types (Image, File, Digital Document, Video, Audio, Extracted Text, or FITS Technical metadata), you will need to include the media_file_fields setting in your config file, like this: media_file_fields: - mycustommedia_machine_name: field_custom_file - myothercustommedia_machine_name: field_other_custom_file This configuration setting adds entries to the following default mapping of media types to file field names: 'file': 'field_media_file', 'document': 'field_media_document', 'image': 'field_media_image', 'audio': 'field_media_audio_file', 'video': 'field_media_video_file', 'extracted_text': 'field_media_file', 'fits_technical_metadata': 'field_media_file'","title":"My Islandora uses a custom media type and I need to tell Workbench what file field to use."},{"location":"troubleshooting/#edtf-interval-values-are-not-rendering-in-islandora-properly","text":"Islandora can display EDTF interval values (e.g., 2004-06/2006-08 , 193X/196X ) properly, but by default, the configuration that allows this is disabled (see this issue for more information). To enable it, for each field in your Islandora content types that use EDTF fields, visit the \"Manage form display\" configuration for the content type, and for each field that uses the \"Default EDTF widget\", within the widget configuration (click on the gear), check the \"Permit date intervals\" option and click \"Update\":","title":"EDTF 'interval' values are not rendering in Islandora properly."},{"location":"troubleshooting/#my-csv-file-has-a-url_alias-column-but-the-aliases-are-not-being-created","text":"First thing to check is whether you are using the Pathauto module. It also creates URL aliases, and since by default Drupal only allows one URL alias, in most cases, the aliases it creates will take precedence over aliases created by Workbench.","title":"My CSV file has a url_alias column, but the aliases are not being created."},{"location":"troubleshooting/#im-having-trouble-getting-workbench-to-work-in-a-cronjob","text":"The most common problem you will encounter when running Islandora Workbench in a cronjob is that Workbench can't find its configuration file, or input/output directories. The easiest way to avoid this is to use absolute file paths everywhere, including as the value of Workbench's --config parameter, in configuration files, and in file and additional_files columns in your input CSV files. In some cases, particularly if you are using a secondary task to create pages or child items, you many need to use the path_to_python and path_to_workbench_script configuration settings.","title":"I'm having trouble getting Workbench to work in a cronjob"},{"location":"troubleshooting/#get_islandora_7_contentpy-crashes-with-the-error-illegal-request-server-returned-status-of-400-the-default-query-may-be-too-long-for-a-url-request","text":"Islandora 7's Solr contains a lot of redundant fields. You need to reduce the number of fields to export. See the \" Exporting Islandora 7 content \" documentation for ways to reduce the number of fields. Ask in the #islandoraworkbench Slack channel if you need help.","title":"get_islandora_7_content.py crashes with the error \"illegal request: Server returned status of 400. The default query may be too long for a url request.\""},{"location":"updating_media/","text":"You can update media by providing a CSV file with a media_id column and other columns representing the fields of the media that you wish to update. Here is a very simple example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the media_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The minimum configuration file for \"update_media\" tasks looks like this (note the task option is update_media ): task: update_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: update_media.csv media_type: file media_type is required, and its value is the Drupal machine name of the type of media you are updating (e.g. image , document , file , video , etc.) Currently, the update_media routine has support for the following operations: Updating files attached to media Updating the set of track files attached to media Updating the Media Use TIDs associated with media Updating the published status of media Updating custom fields of any supported field type When updating field values on the media, the update_mode configuration option allows you to determine whether the field values are appendeded or deleted: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. Updating files attached to media Note Updating files attached to media is currently only supported for media attached to a node. Warning This operation will delete the existing file attached to the media and replace it with the file specified in the CSV file. The update_mode setting has no effect on replacing files. To update the file attached to a media, you must provide a CSV file with, at minimum, a media_id column and a file column. The media_id column should contain the ID of the media you wish to update, and the file column should contain the path to the file you wish to attach to the media (always use file and not the media-type-specific file fieldname). Here is an example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the file column can be paths to files on the local filesystem, full URLs, or full URL aliases. Updating the track files attached to media Note This functionality is currently only supported for media attached to a node. Note This operation will delete all existing track files attached to the media and replace them with the track files specified in the CSV file. To update the set of track files attached to a media, you must provide a CSV file with, at minimum, a media_id column and a column with a name that matches the media_track_file_fields setting in the configuration file. By default, the media_track_file_fields setting in the configuration file is set to field_track for both audio and video. If you have a custom setup that has a different machine name of the field on the media that holds the track file and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field For the purposes of this example, we will assume that the media_track_file_fields setting is set to field_track for both audio and video. The media_id column should contain the ID of the media you wish to update, and the field_track column should contain information related to the track files you wish to attach to the media. The field_track has a special format, which requires you to specify the following information separated by colons, in exactly the following order: - The label for the track - Its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\") - The language code for the track (\"en\", \"fr\", \"es\", etc.) - The absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case) Here is an example CSV that updates the set of track files attached to the media with ID 100: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt You may also wish to attach multiple track files to a single media. To do this, you can specify items in the field_track column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the set of track files attached to the media with ID 100 to have multiple track files: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt|French Subtitles:subtitles:fr:/path/to/french_subtitles.vtt Updating the media use TIDs associated with media To update the Media Use TIDs associated with media, you must provide a CSV file with, at minimum, a media_id column and a media_use_tid column. The media_id column should contain the ID of the media you wish to update, and the media_use_tid column should contain the TID(s) of the media use term(s) you wish to associate with the media. If a value is not specified for the media_use_tid column in a particular row, the value for the media_use_tid setting in the configuration file (Service File by default) will be used. Here is an example CSV that updates the Media Use TID associated with the media with ID 100: media_id,media_use_tid 100,17 You may also wish to associate multiple Media Use TIDs with a single media. To do this, you can specify items in the media_use_tid column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the Media Use TIDs associated with the media with ID 100 to have multiple Media Use TIDs: media_id,media_use_tid 100,17|18 Values in the media_use_tid column can be the taxonomy term ID of the media use or the taxonomy term URL alias. Updating the published status of media To update the published status of media, you must provide a CSV file with, at minimum, a media_id column and a status column. The media_id column should contain the ID of the media you wish to update, and the status column should contain one of the following case-insensitive values: \"1\" or \"True\" (to publish the media) \"0\" or \"False\" (to unpublish the media) Here is an example CSV that updates the published status of some media: media_id,status 100,tRuE 101,0 Updating custom fields attached to media To update custom fields attached to media, you must provide a CSV file with, at minimum, a media_id column and columns with the machine names of the fields you wish to update. The media_id column should contain the ID of the media you wish to update, and the other columns should contain the values you wish to set for the fields. Here is an example CSV that updates the published status of some media: media_id,name,field_my_custom_field 100,My Media,My Custom Value Leaving fields unchanged If you wish to leave a field unchanged, you can leave it blank in the column for that field. Here is an example CSV that updates the published status of some media and leaves others unchanged: media_id,status 100,1 101, 102,0","title":"Updating media"},{"location":"updating_media/#updating-files-attached-to-media","text":"Note Updating files attached to media is currently only supported for media attached to a node. Warning This operation will delete the existing file attached to the media and replace it with the file specified in the CSV file. The update_mode setting has no effect on replacing files. To update the file attached to a media, you must provide a CSV file with, at minimum, a media_id column and a file column. The media_id column should contain the ID of the media you wish to update, and the file column should contain the path to the file you wish to attach to the media (always use file and not the media-type-specific file fieldname). Here is an example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the file column can be paths to files on the local filesystem, full URLs, or full URL aliases.","title":"Updating files attached to media"},{"location":"updating_media/#updating-the-track-files-attached-to-media","text":"Note This functionality is currently only supported for media attached to a node. Note This operation will delete all existing track files attached to the media and replace them with the track files specified in the CSV file. To update the set of track files attached to a media, you must provide a CSV file with, at minimum, a media_id column and a column with a name that matches the media_track_file_fields setting in the configuration file. By default, the media_track_file_fields setting in the configuration file is set to field_track for both audio and video. If you have a custom setup that has a different machine name of the field on the media that holds the track file and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field For the purposes of this example, we will assume that the media_track_file_fields setting is set to field_track for both audio and video. The media_id column should contain the ID of the media you wish to update, and the field_track column should contain information related to the track files you wish to attach to the media. The field_track has a special format, which requires you to specify the following information separated by colons, in exactly the following order: - The label for the track - Its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\") - The language code for the track (\"en\", \"fr\", \"es\", etc.) - The absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case) Here is an example CSV that updates the set of track files attached to the media with ID 100: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt You may also wish to attach multiple track files to a single media. To do this, you can specify items in the field_track column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the set of track files attached to the media with ID 100 to have multiple track files: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt|French Subtitles:subtitles:fr:/path/to/french_subtitles.vtt","title":"Updating the track files attached to media"},{"location":"updating_media/#updating-the-media-use-tids-associated-with-media","text":"To update the Media Use TIDs associated with media, you must provide a CSV file with, at minimum, a media_id column and a media_use_tid column. The media_id column should contain the ID of the media you wish to update, and the media_use_tid column should contain the TID(s) of the media use term(s) you wish to associate with the media. If a value is not specified for the media_use_tid column in a particular row, the value for the media_use_tid setting in the configuration file (Service File by default) will be used. Here is an example CSV that updates the Media Use TID associated with the media with ID 100: media_id,media_use_tid 100,17 You may also wish to associate multiple Media Use TIDs with a single media. To do this, you can specify items in the media_use_tid column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the Media Use TIDs associated with the media with ID 100 to have multiple Media Use TIDs: media_id,media_use_tid 100,17|18 Values in the media_use_tid column can be the taxonomy term ID of the media use or the taxonomy term URL alias.","title":"Updating the media use TIDs associated with media"},{"location":"updating_media/#updating-the-published-status-of-media","text":"To update the published status of media, you must provide a CSV file with, at minimum, a media_id column and a status column. The media_id column should contain the ID of the media you wish to update, and the status column should contain one of the following case-insensitive values: \"1\" or \"True\" (to publish the media) \"0\" or \"False\" (to unpublish the media) Here is an example CSV that updates the published status of some media: media_id,status 100,tRuE 101,0","title":"Updating the published status of media"},{"location":"updating_media/#updating-custom-fields-attached-to-media","text":"To update custom fields attached to media, you must provide a CSV file with, at minimum, a media_id column and columns with the machine names of the fields you wish to update. The media_id column should contain the ID of the media you wish to update, and the other columns should contain the values you wish to set for the fields. Here is an example CSV that updates the published status of some media: media_id,name,field_my_custom_field 100,My Media,My Custom Value","title":"Updating custom fields attached to media"},{"location":"updating_media/#leaving-fields-unchanged","text":"If you wish to leave a field unchanged, you can leave it blank in the column for that field. Here is an example CSV that updates the published status of some media and leaves others unchanged: media_id,status 100,1 101, 102,0","title":"Leaving fields unchanged"},{"location":"updating_nodes/","text":"You can update existing nodes by providing a CSV file with a node_id column plus field data you want to update. The type of update is determined by the value of the update_mode configuration option: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. The column headings in the CSV file other than node_id must match machine names of fields that exist in the target Islandora content type. Only include fields that you want to update. Workbench can update fields following the same CSV conventions used when creating nodes as described in the \" Fields \" documentation. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: node_id,field_description,field_rights,field_access_terms,field_member_of 100,This is my new title,I have changed my mind. This item is yours to keep.,27,45 The config file for update operations looks like this (note the task option is 'update'): task: update host: \"http://localhost:8000\" username: admin password: islandora content_type: my_content_type input_csv: update.csv If you want to append the values in the CSV to values that already exist in the target nodes, add the update_mode configuration option: task: update host: \"http://localhost:8000\" username: admin password: islandora content_type: my_content_type input_csv: update.csv update_mode: append Some things to note: If your target Drupal content type is not islandora_object (the default value), you must include content_type in your configuration file as illustrated above. The update_mode applies to all rows in your CSV; it cannot be specified for particular rows. Updates apply to entire fields. Workbench cannot replace individual values in field. Values in the node_id column can be numeric node IDs (e.g. 467 ) or full URLs, including URL aliases. If a node you are updating doesn't have a field named in your input CSV, Workbench will skip updating the node and add a log entry to that effect. For update tasks where the update_mode is \"replace\" or \"append\", blank/empty CSV values will do nothing; in other words, empty CSV values tell Workbench to not update the field. For update tasks where the update_mode is \"delete\", it doesn't matter if the column(s) in the input CSV are blank or contain values - the values in the corresponding Drupal fields are deleted in both cases. Islandora Workbench will never allow a field to contain more values than the field's configuration allows. Attempts to update a field with more values than the maximum number allowed will result in the surplus values being ignored during the \"update\" task. If Workbench does this, it will write an entry to the log indicating it has done so.","title":"Updating nodes"},{"location":"updating_terms/","text":"You can update existing taxonomy terms in an update_terms tasks by providing a CSV file with a term_id column plus field data you want to update. The type of update is determined by the value of the update_mode configuration option: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. Islandora Workbench will never allow a field to contain more values than the field's configuration allows. Attempts to update a field with more values than the maximum number allowed will result in the surplus values being ignored during the \"update\" task. If Workbench does this, it will write an entry to the log indicating it has done so. The column headings in the CSV file other than term_id must match either the machine names of fields that exist in the target Islandora content type, or their human-readable labels, with exceptions for the following fields: term_name , description , parent , weight , and published (more information about these fields is available in the \" Creating taxonomy terms \" documentation). Only include fields that you want to update. Currently, fields with the data types as described in the \" Fields \" documentation can be updated. Note If you are going to use published in your update_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. For example, using the fields defined by the Islandora Defaults module for the \"Person\" vocabulary, your CSV file could look like this: term_id,term_name,description,field_authority_link 100,\"Jordan, Mark\",Mark Jordan's Person term.,http://viaf.org/viaf/106468077%%VIAF Record The config file for update operations looks like this (note the task option is 'update'): task: update_terms host: \"http://localhost:8000\" username: admin password: islandora # vocab_id is required. vocab_id: person input_csv: update.csv If you want to append the values in the CSV to values that already exist in the target nodes, add the update_mode configuration option: task: update_terms host: \"http://localhost:8000\" username: admin password: islandora vocab_id: person input_csv: update.csv update_mode: append Some things to note: The vocab_id config setting is required. The update_mode applies to all rows in your CSV; it cannot be specified for particular rows. Updates apply to entire fields. Workbench cannot replace individual values in field. Values in the term_id column can be numeric term IDs (e.g. 467 ) or string (e.g. Dogs ). If a string, it must match the existing term identically other than for trailing and leading whitespace. In tasks where you want to update the values in term_name , you should use term_id to identify the term entity. For update tasks where the update_mode is \"replace\" or \"append\", blank/empty CSV values will do nothing; in other words, empty CSV values tell Workbench to not update the field. For update tasks where the update_mode is \"delete\", it doesn't matter if the column(s) in the input CSV are blank or contain values - the values in the corresponding Drupal fields are deleted in both cases.","title":"Updating taxonomy terms"},{"location":"workflows/","text":"Islandora Workbench can be used in a variety of content ingest workflows. Several are outlined below. Batch ingest This is the most common workflow. A user prepares a CSV file and accompanying media files, and runs Workbench to ingest the content: Note that within this basic workflow, options exist for creating nodes with no media , and creating stub nodes from files (i.e., no accompanying CSV file). Distributed batch ingest It is possible to separate the tasks of creating a node and its accompanying media. This can be done in a couple of ways: creating the nodes first, using the nodes_only: true configuration option, and adding media to those nodes separately creating stub nodes directly from media files , and updating the nodes separately In this workflow, the person creating the nodes and the person updating them later need not be the same. In both cases, Workbench can create an output CSV that can be used in the second half of the workflow. Migrations Islandora Workbench is not intended to replace Drupal's Migrate framework, but it can be used in conjunction with other tools and processes as part of an \" extract, transform, load \" (ETL) workflow. The source could be any platform. If it is Islandora 7, several tools exist to extract content, including the get_islandora_7_content.py script that comes with Workbench or the Islandora Get CSV module for Islandora 7. This content can then be used as input for Islandora Workbench, as illustrated here: On the left side of the diagram, get_islandora_7_content.py or the Islandora Get CSV module are used in the \"extract\" phase of the ETL workflow, and on the right side, running the user's computer, Islandora Workbench is used in the \"load\" phase. Before loading the content, the user would modify the extracted CSV file to confirm with Workbench's CSV content requirements. The advantage of migrating to Islandora in this way is that the exported CSV file can be cleaned or supplemented (manually or otherwise) prior to using it as Workbench's input. The specific tasks required during this \"transform\" phase will vary depending on the quality and consistency of metadata and other factors. Note Workbench's ability to add multiple media to a node at one time is useful during migrations, if you want to reuse derivatives such as thumbnails and OCR transcripts from the source platform. Using this ability can speed up ingest substantially, since Islandora won't need to generate derivative media that are added this way . See the \" Adding multiple media \" section for more information. Watch folders Since Islandora workbench is a command-line tool, it can be run in a scheduled job such as Linux \"cron\". If CSV and file content are present when Workbench runs, Workbench will operate on them in the same way as if a person ran Workbench manually. Note Islandora Workbench does not detect changes in directories. While tools to do so exist, Workbench's ability to ingest Islandora content in batches makes it useful to scheduled jobs, as opposed to realtime detection of new files in a directory. An example of this workflow is depicted in the diagrams below, the source of the files is the daily output of someone scanning images. If these images are saved in the directory that is specified in Workbench's input_dir configuration option, and Workbench is run in a cron job using the \" create_from_files \" task, nodes will be created when the cron job executes (over night, for example): A variation on this workflow is to combine it with the \"Distributed\" workflow described above: In this workflow, the nodes are created overnight and then updated with CSV data the next day. Note If you are using a CSV file as input (that is, a standard create task), a feature that is useful in this workflow is that Workbench can check to see if a node already exists in the target Drupal before it creates the node. Using this feature, you could continue to append rows to your input CSV and not worry about accidentally creating duplicate nodes. Metadata maintenance Workbench can help you maintain your metadata using a variation of the extract, transform, load pattern mentioned above. For example, Rosie Le Faive demonstrates this round-tripping technique in this video (no need to log in), in which they move publisher data from the Linked Agent field to a dedicated Publisher field. Rosie uses a get_data_from_view task to export the Linked Agent field data from a group of nodes, then does some offline transformation of that data into a separate Publisher field (in this case, a Python script, but any suitable tool could be used), then finally uses a pair of update tasks to put the modified data back into Drupal. Another example of round-tripping metadata is if you need to change a Drupal field's configuration (for example, change a text field's maximum length) but Drupal won't allow you to do that directly. Using Workbench, you could export all the data in the field you want to modify, create a new Drupal field to replace it, and then use an update task to populate the replacement field. Drupal's Views Bulk Operations module (documented here ) lets you do simple metadata maintenance, but round-tripping techniques like the ones described here enables you to do things that VBO simply can't. Integrations with other systems A combination of the \"Migrations\" workflow and the \"Watch folder\" workflow can be used to automate the periodic movement of content from a source system (in the diagram below, Open Journal Systems or Archivematica) into Islandora: The extraction of data from the source system, conversion of it into the CSV and file arrangement Workbench expects, and running of Workbench can all be scripted and executed in sequence using scheduled jobs. The case study below provides a full, operational example of this workflow. Using hooks Islandora Workbench provides several \" hooks \" that allow you to execute external scripts at specific times. For example, the \"post-action script\" enables you to execute scripts immediately after a node is created or updated, or a media is created. Drupal informs Workbench if an action was successful or not, and in either case, post-action hook scripts registered in the Workbench configuration file execute. These scripts can interact with external applications: Potential uses for this ability include adding new Islandora content to external processing queues, or informing upstream applications like those described in the \"Integrations with other systems\" section above that content they provide has been (or has not been) ingested into Islandora. As a simpler example, post-action hook scripts can be used to write custom or special-purpose log files. Warning Schedulers such as Linux cron usually require that all file paths are absolute, unless the scheduler changes its current working directory when running a job. When running Islandora Workbench in a scheduled job, all paths to files and directories included in configuration files should be absolute, not relative to Workbench. Also, the path to Workbench configuration file used as the value of --config should be absolute. If a scheduled job is not executing the way you expect, the first thing you should check is whether all paths to files and directories are expressed as absolute paths, not relative paths. Sharing the input CSV with other applications Some workflows can benefit from having Workbench share its input CSV with other scripts or applications. For example, you might use Workbench to ingest nodes into Islandora but want to use the same CSV file in a script to create metadata for loading into another application such as a library discovery layer. Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names. To accommodate CSV columns that do not correspond to Drupal field names, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the non-Workbench column headers in the ignore_csv_columns configuration setting. For example, if you want to include a date_generated and a qa by column in your CSV, include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated', 'qa by'] With this setting in place, Workbench will ignore the date_generated column in the input CSV. More information on this feature is available . Sharing configuration files with other applications Islandora Workbench ignores entries in its YAML configuration files it doesn't recognize. This means that you can include YAML data that you may need for an application you are using in conjuction with Workbench. For example, in an automated deployment, you may need to unzip an archive containing inut CSV and sample images, PDFs, etc. that you want to load into Islandora as part of the deployment. If you put something like mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip in your config file, your deployment scripts could read the same config file used by Workbench, pick out the mylib_zip_location entry and get its value, download and unpack the content of the file into the Workbench input_dir location, and then run Workbench: task: create host: https://islandora.traefik.me username: admin password: password input_dir: /tmp input_csv: deployment_sample_data.csv mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip It's probably a good idea to namespace your custom/non-Workbench config entries so they are easy to identify and to reduce the chance they conflict with Workbench config settings, but it's not necessary to do so. Cross-environment deployment / Continuous Integration Workbench can be used to create the same content across different development, testing, and deployment environments. One application of this is to allow a team of developers can be sure they are all using the same Islandora content. For example, it is possible to commit one or more Workbench config files to a shared Git repository, and when a developer rebuilds their envrionment, automatically run Workbench using those configuration files to load the shared content. Some tips on making Islandora more portable across environments include: Adding the Workbench user's password to an envrionment variable , eliminating the need to include the password in the configuration file Using a Google Sheet as the input_csv value Using a remote Zip archive as the input images, PDFs, etc. This Zip archive can also contain in input CSV, eliminating the need to point to a Google Sheet. Using URL aliases in your field_member_of CSV column to avoid relying on Drupal instance-specific node IDs (note that this assumes the node aliases will be consistent across Drupal instances) Configuring a View to check if nodes already exist , if you want to rerun Workbench using the same input data but avoid creating duplicate nodes and media The same capabilities apply to using Workbench to load data for automated testing during Continuous Integration workflows and configurations. An interesting facet of using the combination of remote Zip and (optionally) a Google Sheet as input is that people other than developers, such as content managers or testers, can add content to ingest without having to commit to the Git repository. All they need to do is update the Google Sheet and Zip file. As long as the URLs of these inputs do not change, the next time the developers run Workbench, the new content will be ingested. Automatically populating staging and production environments on build is a useful way to test that these environments have deployed successfully. Running a rollback task can then delete the sample content once deployment has been confirmed. The Islandora Sandbox is populated on build using Workbench. Case study Simon Fraser University Library uses Islandora Workbench to automate the transfer of theses from its locally developed thesis registration application (called, unsurprisingly, the Thesis Registration System , or TRS) to Summit , the SFU institutional research repository running Islandora. This transfer happens through a series of scheduled tasks that run every evening. This diagram depicts the automated workflow, with an explanation of each step below the diagram. This case study is an example of the \" Integration with other systems \" workflow described above. Steps 1 and 2 do not involve Workbench directly and run as separate cron jobs an hour apart - step 1 runs at 7:00 PM, step 2 runs at 8:00 PM. Steps 3 and 4 are combined into a single Workbench \"create\" task which is run as a cron job at 9:00 PM. Step 1: Fetch the day's theses from the TRS The first scheduled task runs a script that fetches a list of all the theses approved by the SFU Library Thesis Office staff during that day. Every thesis that has been approved and does not have in its metadata a URL in Summit is in the daily list. (The fact that a thesis doesn't have a Summit URL in its TRS metadata yet will come up again in Step 4; we'll come back to that later.) After retrieving the list of theses, the script retrieves the metadata for each thesis (as a JSON file), the thesis PDF file, and, if they are present, any supplemental data files such as Excel, CSV, or video files attached to the thesis. All the data for each thesis is written to a temporary directory, where it becomes input to the script described in Step 2. Step 2: Convert the TRS data into CSV The script executed in this step converts the thesis data into a Workbench input CSV file. If there are any theses in a daily batch that have supplemental files, the script generates a second CSV file that is used in a Workbench secondary task to create compound objects (described in more detail in the next step). Step 3: Run Islandora Workbench With the thesis CSV created in step 2 in place (and the accompanying supplemental file CSV, if there are any supplemental files in the day's batch), a scheduled job executes Islandora Workbench. The main Workbench configuration file looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: sfu_thesis allow_adding_terms: true require_entity_reference_views: false subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/theses_daily.csv secondary_tasks: ['/home/utilityuser/islandora_workbench/supplemental_files_secondary_task.yml'] log_file_path: /home/zlocal/islandora_workbench/theses_daily.log node_post_create: ['/home/utilityuser/islandora_workbench/patch_summit.py'] path_to_python: /opt/rh/rh-python38/root/usr/bin/python path_to_workbench_script: /home/utilityuser/islandora_workbench/workbench The input CSV, which describes the theses (and is named in the input_csv setting in the above config file), looks like this (a single CSV row shown here): id,file,title,field_sfu_abstract,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_sfu_department,field_identifier,field_tags,field_language,field_member_of,field_sfu_permissions,field_sfu_thesis_advisor,field_sfu_thesis_type,field_resource_type,field_extent,field_display_hints,field_model 6603,/home/utilityuser/summit_data/tmp/etd21603/etd21603.pdf,\"Additively manufactured digital microfluidics\",\"With the development of lithography techniques, microfluidic systems have drastically evolved in the past decades. Digital microfluidics (DMF), which enables discrete droplet actuation without any carrying liquid as opposed to the continuous-fluid-based microfluidics, emerged as the candidate for the next generation of lab-on-a-chip systems. The DMF has been applied to a wide variety of fields including electrochemical and biomedical assays, drug delivery, and point-of-care diagnosis of diseases. Most of the DMF devices are made with photolithography which requires complicated processes, sophisticated equipment, and cleanroom setting. Based on the fabrication technology being used, these DMF manipulate droplets in a planar format that limits the increase of chip density. The objective of this study is to introduce additive manufacturing (AM) into the fabrication process of DMF to design and build a 3D-structured DMF platform for droplet actuation between different planes. The creation of additively manufactured DMF is demonstrated by fabricating a planar DMF device with ion-selective sensing functions. Following that, the design of vertical DMF electrodes overcomes the barrier for droplets to move between different actuation components, and the application of AM helps to construct free-standing xylem DMF to drive the droplet upward. To form a functional system, the horizontal and xylem DMF are integrated so that a three-dimensional (3D) droplet manipulation is demonstrated. The integrated system performs a droplet actuation speed of 1 mm/s from horizontal to vertical with various droplet sizes. It is highly expected that the 3D-structured DMF open new possibilities for the design of DMF devices that can be used in many practical applications.\",\"relators:aut:Min, Xin\",575,2021-08-27,\"Applied Sciences: School of Mechatronic Systems Engineering\",etd21603,\"Digital microfluidics%%%Additive manufacturing%%%3D printing%%%Electrowetting\",531,30044%%%30035,\"This thesis may be printed or downloaded for non-commercial research and scholarly purposes.\",\"relators:ths:Soo, Kim, Woo\",\"(Thesis) Ph.D.\",545,\"125 pages.\",519,512 Supplemental files for a thesis are created as child nodes, with the thesis node containing the PDF media as the parent. Here is the thesis in Summit created from the CSV data above. This thesis has four supplemental video files, which are created as the child nodes using a Workbench secondary task . The configuration file for this secondary task looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: islandora_object allow_adding_terms: true subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] # In case the supplemental file doesn't download, etc. we still create the node. allow_missing_files: true input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/supplemental_daily.csv log_file_path: /home/utilityuser/islandora_workbench/theses_daily.log The input CSV for this secondary task, which describes the supplemental files (and is named in the input_csv setting in the \"secondary\" config file), looks like this (only rows for children of the above item shown here): id,parent_id,title,file,field_model,field_member_of,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_description,field_display_hints 6603.1,6603,\"DMF droplet actuation_Planar\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_Planar.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.2,6603,\"DMF droplet actuation_3D electrode\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_3D electrode.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.3,6603,\"DMF droplet actuation_xylem DMF\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_xylem DMF.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.4,6603,\"Horizontal to vertical movement\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-Horizontal to vertical movement.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, Step 4: Update the TRS If a thesis (and any supplemental files, if present) have been successfully added to Summit, Workbench uses a post-node-create hook to run a script that updates the TRS in two ways: it populates the \"Summit URL\" field in the thesis record in the TRS, and it updates the student's user record in the TRS to prevent the student from logging into the TRS. The \"Summit URL\" for a thesis not only links the metadata in the TRS with the migrated item in Summit, it is also used to prevent theses from entering the daily feed described in Step 1. Specifically, theses that have been migrated to Summit (and therefore have a \"Summit URL\") are excluded from the daily feed generated by the TRS. This prevents a thesis from being migrated more than once. If for some reason the Thesis Office wants a thesis to be re-migrated to Summit, all they need to do is delete the first copy from Summit, then remove the \"Summit URL\" from the thesis metadata in the TRS. Doing so will ensure that the thesis gets into the next day's feed. Thesis Office staff do not want students logging into the TRS after their theses have been published in Summit. To prevent a student from logging in, once a thesis has been successfully ingested, Workbench executes a post-node-creation hook script that disables the student's user record in the TRS. If a student wants to log into the TRS after their thesis has been migrated, they need to contact the Thesis Office. Not depicted in the diagram After Workbench has completed executing, a final daily job runs that parses out entries from the Workbench log and the log written by the TRS daily script, and emails those entries to the Summit administrator to warn them of any errors that may have occurred. There is an additional monthly scheduled job (that runs on the first day of each month) that generates MARC records for each thesis added to Summit in the previous month. Before it finishes executing, this script emails the resulting MARC communications file to the Library's metadata staff, who load it into the Library's catalogue.","title":"Workflows"},{"location":"workflows/#batch-ingest","text":"This is the most common workflow. A user prepares a CSV file and accompanying media files, and runs Workbench to ingest the content: Note that within this basic workflow, options exist for creating nodes with no media , and creating stub nodes from files (i.e., no accompanying CSV file).","title":"Batch ingest"},{"location":"workflows/#distributed-batch-ingest","text":"It is possible to separate the tasks of creating a node and its accompanying media. This can be done in a couple of ways: creating the nodes first, using the nodes_only: true configuration option, and adding media to those nodes separately creating stub nodes directly from media files , and updating the nodes separately In this workflow, the person creating the nodes and the person updating them later need not be the same. In both cases, Workbench can create an output CSV that can be used in the second half of the workflow.","title":"Distributed batch ingest"},{"location":"workflows/#migrations","text":"Islandora Workbench is not intended to replace Drupal's Migrate framework, but it can be used in conjunction with other tools and processes as part of an \" extract, transform, load \" (ETL) workflow. The source could be any platform. If it is Islandora 7, several tools exist to extract content, including the get_islandora_7_content.py script that comes with Workbench or the Islandora Get CSV module for Islandora 7. This content can then be used as input for Islandora Workbench, as illustrated here: On the left side of the diagram, get_islandora_7_content.py or the Islandora Get CSV module are used in the \"extract\" phase of the ETL workflow, and on the right side, running the user's computer, Islandora Workbench is used in the \"load\" phase. Before loading the content, the user would modify the extracted CSV file to confirm with Workbench's CSV content requirements. The advantage of migrating to Islandora in this way is that the exported CSV file can be cleaned or supplemented (manually or otherwise) prior to using it as Workbench's input. The specific tasks required during this \"transform\" phase will vary depending on the quality and consistency of metadata and other factors. Note Workbench's ability to add multiple media to a node at one time is useful during migrations, if you want to reuse derivatives such as thumbnails and OCR transcripts from the source platform. Using this ability can speed up ingest substantially, since Islandora won't need to generate derivative media that are added this way . See the \" Adding multiple media \" section for more information.","title":"Migrations"},{"location":"workflows/#watch-folders","text":"Since Islandora workbench is a command-line tool, it can be run in a scheduled job such as Linux \"cron\". If CSV and file content are present when Workbench runs, Workbench will operate on them in the same way as if a person ran Workbench manually. Note Islandora Workbench does not detect changes in directories. While tools to do so exist, Workbench's ability to ingest Islandora content in batches makes it useful to scheduled jobs, as opposed to realtime detection of new files in a directory. An example of this workflow is depicted in the diagrams below, the source of the files is the daily output of someone scanning images. If these images are saved in the directory that is specified in Workbench's input_dir configuration option, and Workbench is run in a cron job using the \" create_from_files \" task, nodes will be created when the cron job executes (over night, for example): A variation on this workflow is to combine it with the \"Distributed\" workflow described above: In this workflow, the nodes are created overnight and then updated with CSV data the next day. Note If you are using a CSV file as input (that is, a standard create task), a feature that is useful in this workflow is that Workbench can check to see if a node already exists in the target Drupal before it creates the node. Using this feature, you could continue to append rows to your input CSV and not worry about accidentally creating duplicate nodes.","title":"Watch folders"},{"location":"workflows/#metadata-maintenance","text":"Workbench can help you maintain your metadata using a variation of the extract, transform, load pattern mentioned above. For example, Rosie Le Faive demonstrates this round-tripping technique in this video (no need to log in), in which they move publisher data from the Linked Agent field to a dedicated Publisher field. Rosie uses a get_data_from_view task to export the Linked Agent field data from a group of nodes, then does some offline transformation of that data into a separate Publisher field (in this case, a Python script, but any suitable tool could be used), then finally uses a pair of update tasks to put the modified data back into Drupal. Another example of round-tripping metadata is if you need to change a Drupal field's configuration (for example, change a text field's maximum length) but Drupal won't allow you to do that directly. Using Workbench, you could export all the data in the field you want to modify, create a new Drupal field to replace it, and then use an update task to populate the replacement field. Drupal's Views Bulk Operations module (documented here ) lets you do simple metadata maintenance, but round-tripping techniques like the ones described here enables you to do things that VBO simply can't.","title":"Metadata maintenance"},{"location":"workflows/#integrations-with-other-systems","text":"A combination of the \"Migrations\" workflow and the \"Watch folder\" workflow can be used to automate the periodic movement of content from a source system (in the diagram below, Open Journal Systems or Archivematica) into Islandora: The extraction of data from the source system, conversion of it into the CSV and file arrangement Workbench expects, and running of Workbench can all be scripted and executed in sequence using scheduled jobs. The case study below provides a full, operational example of this workflow.","title":"Integrations with other systems"},{"location":"workflows/#using-hooks","text":"Islandora Workbench provides several \" hooks \" that allow you to execute external scripts at specific times. For example, the \"post-action script\" enables you to execute scripts immediately after a node is created or updated, or a media is created. Drupal informs Workbench if an action was successful or not, and in either case, post-action hook scripts registered in the Workbench configuration file execute. These scripts can interact with external applications: Potential uses for this ability include adding new Islandora content to external processing queues, or informing upstream applications like those described in the \"Integrations with other systems\" section above that content they provide has been (or has not been) ingested into Islandora. As a simpler example, post-action hook scripts can be used to write custom or special-purpose log files. Warning Schedulers such as Linux cron usually require that all file paths are absolute, unless the scheduler changes its current working directory when running a job. When running Islandora Workbench in a scheduled job, all paths to files and directories included in configuration files should be absolute, not relative to Workbench. Also, the path to Workbench configuration file used as the value of --config should be absolute. If a scheduled job is not executing the way you expect, the first thing you should check is whether all paths to files and directories are expressed as absolute paths, not relative paths.","title":"Using hooks"},{"location":"workflows/#sharing-the-input-csv-with-other-applications","text":"Some workflows can benefit from having Workbench share its input CSV with other scripts or applications. For example, you might use Workbench to ingest nodes into Islandora but want to use the same CSV file in a script to create metadata for loading into another application such as a library discovery layer. Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names. To accommodate CSV columns that do not correspond to Drupal field names, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the non-Workbench column headers in the ignore_csv_columns configuration setting. For example, if you want to include a date_generated and a qa by column in your CSV, include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated', 'qa by'] With this setting in place, Workbench will ignore the date_generated column in the input CSV. More information on this feature is available .","title":"Sharing the input CSV with other applications"},{"location":"workflows/#sharing-configuration-files-with-other-applications","text":"Islandora Workbench ignores entries in its YAML configuration files it doesn't recognize. This means that you can include YAML data that you may need for an application you are using in conjuction with Workbench. For example, in an automated deployment, you may need to unzip an archive containing inut CSV and sample images, PDFs, etc. that you want to load into Islandora as part of the deployment. If you put something like mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip in your config file, your deployment scripts could read the same config file used by Workbench, pick out the mylib_zip_location entry and get its value, download and unpack the content of the file into the Workbench input_dir location, and then run Workbench: task: create host: https://islandora.traefik.me username: admin password: password input_dir: /tmp input_csv: deployment_sample_data.csv mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip It's probably a good idea to namespace your custom/non-Workbench config entries so they are easy to identify and to reduce the chance they conflict with Workbench config settings, but it's not necessary to do so.","title":"Sharing configuration files with other applications"},{"location":"workflows/#cross-environment-deployment-continuous-integration","text":"Workbench can be used to create the same content across different development, testing, and deployment environments. One application of this is to allow a team of developers can be sure they are all using the same Islandora content. For example, it is possible to commit one or more Workbench config files to a shared Git repository, and when a developer rebuilds their envrionment, automatically run Workbench using those configuration files to load the shared content. Some tips on making Islandora more portable across environments include: Adding the Workbench user's password to an envrionment variable , eliminating the need to include the password in the configuration file Using a Google Sheet as the input_csv value Using a remote Zip archive as the input images, PDFs, etc. This Zip archive can also contain in input CSV, eliminating the need to point to a Google Sheet. Using URL aliases in your field_member_of CSV column to avoid relying on Drupal instance-specific node IDs (note that this assumes the node aliases will be consistent across Drupal instances) Configuring a View to check if nodes already exist , if you want to rerun Workbench using the same input data but avoid creating duplicate nodes and media The same capabilities apply to using Workbench to load data for automated testing during Continuous Integration workflows and configurations. An interesting facet of using the combination of remote Zip and (optionally) a Google Sheet as input is that people other than developers, such as content managers or testers, can add content to ingest without having to commit to the Git repository. All they need to do is update the Google Sheet and Zip file. As long as the URLs of these inputs do not change, the next time the developers run Workbench, the new content will be ingested. Automatically populating staging and production environments on build is a useful way to test that these environments have deployed successfully. Running a rollback task can then delete the sample content once deployment has been confirmed. The Islandora Sandbox is populated on build using Workbench.","title":"Cross-environment deployment / Continuous Integration"},{"location":"workflows/#case-study","text":"Simon Fraser University Library uses Islandora Workbench to automate the transfer of theses from its locally developed thesis registration application (called, unsurprisingly, the Thesis Registration System , or TRS) to Summit , the SFU institutional research repository running Islandora. This transfer happens through a series of scheduled tasks that run every evening. This diagram depicts the automated workflow, with an explanation of each step below the diagram. This case study is an example of the \" Integration with other systems \" workflow described above. Steps 1 and 2 do not involve Workbench directly and run as separate cron jobs an hour apart - step 1 runs at 7:00 PM, step 2 runs at 8:00 PM. Steps 3 and 4 are combined into a single Workbench \"create\" task which is run as a cron job at 9:00 PM.","title":"Case study"},{"location":"workflows/#step-1-fetch-the-days-theses-from-the-trs","text":"The first scheduled task runs a script that fetches a list of all the theses approved by the SFU Library Thesis Office staff during that day. Every thesis that has been approved and does not have in its metadata a URL in Summit is in the daily list. (The fact that a thesis doesn't have a Summit URL in its TRS metadata yet will come up again in Step 4; we'll come back to that later.) After retrieving the list of theses, the script retrieves the metadata for each thesis (as a JSON file), the thesis PDF file, and, if they are present, any supplemental data files such as Excel, CSV, or video files attached to the thesis. All the data for each thesis is written to a temporary directory, where it becomes input to the script described in Step 2.","title":"Step 1: Fetch the day's theses from the TRS"},{"location":"workflows/#step-2-convert-the-trs-data-into-csv","text":"The script executed in this step converts the thesis data into a Workbench input CSV file. If there are any theses in a daily batch that have supplemental files, the script generates a second CSV file that is used in a Workbench secondary task to create compound objects (described in more detail in the next step).","title":"Step 2: Convert the TRS data into CSV"},{"location":"workflows/#step-3-run-islandora-workbench","text":"With the thesis CSV created in step 2 in place (and the accompanying supplemental file CSV, if there are any supplemental files in the day's batch), a scheduled job executes Islandora Workbench. The main Workbench configuration file looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: sfu_thesis allow_adding_terms: true require_entity_reference_views: false subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/theses_daily.csv secondary_tasks: ['/home/utilityuser/islandora_workbench/supplemental_files_secondary_task.yml'] log_file_path: /home/zlocal/islandora_workbench/theses_daily.log node_post_create: ['/home/utilityuser/islandora_workbench/patch_summit.py'] path_to_python: /opt/rh/rh-python38/root/usr/bin/python path_to_workbench_script: /home/utilityuser/islandora_workbench/workbench The input CSV, which describes the theses (and is named in the input_csv setting in the above config file), looks like this (a single CSV row shown here): id,file,title,field_sfu_abstract,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_sfu_department,field_identifier,field_tags,field_language,field_member_of,field_sfu_permissions,field_sfu_thesis_advisor,field_sfu_thesis_type,field_resource_type,field_extent,field_display_hints,field_model 6603,/home/utilityuser/summit_data/tmp/etd21603/etd21603.pdf,\"Additively manufactured digital microfluidics\",\"With the development of lithography techniques, microfluidic systems have drastically evolved in the past decades. Digital microfluidics (DMF), which enables discrete droplet actuation without any carrying liquid as opposed to the continuous-fluid-based microfluidics, emerged as the candidate for the next generation of lab-on-a-chip systems. The DMF has been applied to a wide variety of fields including electrochemical and biomedical assays, drug delivery, and point-of-care diagnosis of diseases. Most of the DMF devices are made with photolithography which requires complicated processes, sophisticated equipment, and cleanroom setting. Based on the fabrication technology being used, these DMF manipulate droplets in a planar format that limits the increase of chip density. The objective of this study is to introduce additive manufacturing (AM) into the fabrication process of DMF to design and build a 3D-structured DMF platform for droplet actuation between different planes. The creation of additively manufactured DMF is demonstrated by fabricating a planar DMF device with ion-selective sensing functions. Following that, the design of vertical DMF electrodes overcomes the barrier for droplets to move between different actuation components, and the application of AM helps to construct free-standing xylem DMF to drive the droplet upward. To form a functional system, the horizontal and xylem DMF are integrated so that a three-dimensional (3D) droplet manipulation is demonstrated. The integrated system performs a droplet actuation speed of 1 mm/s from horizontal to vertical with various droplet sizes. It is highly expected that the 3D-structured DMF open new possibilities for the design of DMF devices that can be used in many practical applications.\",\"relators:aut:Min, Xin\",575,2021-08-27,\"Applied Sciences: School of Mechatronic Systems Engineering\",etd21603,\"Digital microfluidics%%%Additive manufacturing%%%3D printing%%%Electrowetting\",531,30044%%%30035,\"This thesis may be printed or downloaded for non-commercial research and scholarly purposes.\",\"relators:ths:Soo, Kim, Woo\",\"(Thesis) Ph.D.\",545,\"125 pages.\",519,512 Supplemental files for a thesis are created as child nodes, with the thesis node containing the PDF media as the parent. Here is the thesis in Summit created from the CSV data above. This thesis has four supplemental video files, which are created as the child nodes using a Workbench secondary task . The configuration file for this secondary task looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: islandora_object allow_adding_terms: true subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] # In case the supplemental file doesn't download, etc. we still create the node. allow_missing_files: true input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/supplemental_daily.csv log_file_path: /home/utilityuser/islandora_workbench/theses_daily.log The input CSV for this secondary task, which describes the supplemental files (and is named in the input_csv setting in the \"secondary\" config file), looks like this (only rows for children of the above item shown here): id,parent_id,title,file,field_model,field_member_of,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_description,field_display_hints 6603.1,6603,\"DMF droplet actuation_Planar\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_Planar.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.2,6603,\"DMF droplet actuation_3D electrode\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_3D electrode.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.3,6603,\"DMF droplet actuation_xylem DMF\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_xylem DMF.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.4,6603,\"Horizontal to vertical movement\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-Horizontal to vertical movement.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,,","title":"Step 3: Run Islandora Workbench"},{"location":"workflows/#step-4-update-the-trs","text":"If a thesis (and any supplemental files, if present) have been successfully added to Summit, Workbench uses a post-node-create hook to run a script that updates the TRS in two ways: it populates the \"Summit URL\" field in the thesis record in the TRS, and it updates the student's user record in the TRS to prevent the student from logging into the TRS. The \"Summit URL\" for a thesis not only links the metadata in the TRS with the migrated item in Summit, it is also used to prevent theses from entering the daily feed described in Step 1. Specifically, theses that have been migrated to Summit (and therefore have a \"Summit URL\") are excluded from the daily feed generated by the TRS. This prevents a thesis from being migrated more than once. If for some reason the Thesis Office wants a thesis to be re-migrated to Summit, all they need to do is delete the first copy from Summit, then remove the \"Summit URL\" from the thesis metadata in the TRS. Doing so will ensure that the thesis gets into the next day's feed. Thesis Office staff do not want students logging into the TRS after their theses have been published in Summit. To prevent a student from logging in, once a thesis has been successfully ingested, Workbench executes a post-node-creation hook script that disables the student's user record in the TRS. If a student wants to log into the TRS after their thesis has been migrated, they need to contact the Thesis Office.","title":"Step 4: Update the TRS"},{"location":"workflows/#not-depicted-in-the-diagram","text":"After Workbench has completed executing, a final daily job runs that parses out entries from the Workbench log and the log written by the TRS daily script, and emails those entries to the Summit administrator to warn them of any errors that may have occurred. There is an additional monthly scheduled job (that runs on the first day of each month) that generates MARC records for each thesis added to Summit in the previous month. Before it finishes executing, this script emails the resulting MARC communications file to the Library's metadata staff, who load it into the Library's catalogue.","title":"Not depicted in the diagram"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Overview Islandora Workbench is a command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. It is an alternative to using Drupal's built-in Migrate framework for ingesting Islandora content from CSV files . Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Drupal server. Drupal's Migrate framework, however, is much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot. Note that Islandora Workbench is not related in any way to the Drupal contrib module called Workbench . Features Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files Allows creation of paged/compound content Can run from anywhere - it does not need to be run from the Drupal server's command line Provides both sensible default configuration values and rich configuration options for power users Provides robust data validation functionality Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation) Can generate a CSV file template based on Drupal content type Can generate a contact sheet from CSV data Can use a Google Sheet or an Excel file instead of a CSV file as input Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs Allows creation of new taxonomy terms from CSV field data Allows the assignment of URL aliases Allows creation of URL redirects Allows adding alt text to images Supports transmission fixity auditing for media files Cross platform (Windows, Mac, and Linux) Well documented Well tested Usage Within the islandora_workbench directory, run the following command, providing the name of your configuration file (\"config.yml\" in this example): ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. --check validates your configuration and input data. Typical output looks like: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. If your configuration file is not in the same directory as the workbench script, use its absolute path, e.g.: ./workbench --config /home/mark/config.yml --check If --check hasn't identified any problems, you can then rerun Islandora Workbench without the --check option to create the nodes: ./workbench --config config.yml Workbench will then create a node and attached media for each record in your input CSV file. Typical output looks like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created. If you'd rather not see all this detail, you can set an option in your configuration file to see a progress bar instead: [==================================> 40.0% ] License Islandora Workbench's documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License . Contributing Contributions to this documentation are welcome. If you have a suggestion, please open an issue on the Islandora Workbench GitHub repository's queue and tag your issue \"documentation\".","title":"Home"},{"location":"#overview","text":"Islandora Workbench is a command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. It is an alternative to using Drupal's built-in Migrate framework for ingesting Islandora content from CSV files . Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Drupal server. Drupal's Migrate framework, however, is much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot. Note that Islandora Workbench is not related in any way to the Drupal contrib module called Workbench .","title":"Overview"},{"location":"#features","text":"Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files Allows creation of paged/compound content Can run from anywhere - it does not need to be run from the Drupal server's command line Provides both sensible default configuration values and rich configuration options for power users Provides robust data validation functionality Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation) Can generate a CSV file template based on Drupal content type Can generate a contact sheet from CSV data Can use a Google Sheet or an Excel file instead of a CSV file as input Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs Allows creation of new taxonomy terms from CSV field data Allows the assignment of URL aliases Allows creation of URL redirects Allows adding alt text to images Supports transmission fixity auditing for media files Cross platform (Windows, Mac, and Linux) Well documented Well tested","title":"Features"},{"location":"#usage","text":"Within the islandora_workbench directory, run the following command, providing the name of your configuration file (\"config.yml\" in this example): ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. --check validates your configuration and input data. Typical output looks like: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. If your configuration file is not in the same directory as the workbench script, use its absolute path, e.g.: ./workbench --config /home/mark/config.yml --check If --check hasn't identified any problems, you can then rerun Islandora Workbench without the --check option to create the nodes: ./workbench --config config.yml Workbench will then create a node and attached media for each record in your input CSV file. Typical output looks like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created. If you'd rather not see all this detail, you can set an option in your configuration file to see a progress bar instead: [==================================> 40.0% ]","title":"Usage"},{"location":"#license","text":"Islandora Workbench's documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .","title":"License"},{"location":"#contributing","text":"Contributions to this documentation are welcome. If you have a suggestion, please open an issue on the Islandora Workbench GitHub repository's queue and tag your issue \"documentation\".","title":"Contributing"},{"location":"adding_media/","text":"Adding media from files You can add media to existing nodes by providing a CSV file with a node_id column plus a file field that contains the name of the file you want to add: node_id,file 100,test.txt Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for \"add_media\" tasks like this (note the task option is 'add_media'): task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv # media_use_tid is optional, it defaults to \"Original file\". media_use_tid: 21 media_type: file This is the same configuration file using a term URI in media_use_tid rather than a term ID: task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv media_use_tid: \"http://pcdm.org/use#Transcript\" media_type: file If you want to specify a media_use_tid per CSV row, you can include that column in your CSV (in either \"add_media\" or \"create\" tasks): node_id,file,media_use_tid 100,test.txt,21 110,test2.txt,35 If you include media_use_tid values in your CSV file, they override the media_use_tid value set in your configuration file. Note If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's 'field_edited_text' field, allowing it to be indexed in Solr. Note The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Adding references to media for use with DGI's Image Discovery module DiscoveryGarden's DGI Image Discovery module provides a way to assign the same thumbnail image to multiple nodes. This is not the module's main purpose, but reusing a thumbnail image on many nodes is easy to accomplish using this module and Islandora Workbench. Essentially, the Image Discovery module defines a specific Drupal field on a node, field_representative_image , that contains a reference to an existing media, for example a media with an Islandora Media Use term \"Thumbnail image\". This approach to defining a thumbnail is different than Islandora's normal node/media relationship, where the media entity references its parent node in a field_media_of field attached to the media. To use Workbench to populate field_representative_image , simply include that field in your Workbench create task CSV and populate it with the media ID of the thumbnail media you want to use. The following example CSV populates the field with the media ID \"3784\": id,file,title,field_representative_image,field_model test-001,Test node 001,3784,Image test-002,Test node 002,3784,Digital Document Note that this approach to assigning a thumbnail image to a node does not use Workbench's additional_fields configuration setting to define a CSV column containing the filename of a thumbnail image. It simply populates a node's field_representative_image field with a media ID. DGI's module ensures that the referenced media is used as the node's thumbnail image. No other Workbench configuration is necessary. It's just as easy to use an update task to add the referenced media ID to nodes: node_id,field_representative_image 1089,3784 1093,3784","title":"Adding media to nodes"},{"location":"adding_media/#adding-media-from-files","text":"You can add media to existing nodes by providing a CSV file with a node_id column plus a file field that contains the name of the file you want to add: node_id,file 100,test.txt Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for \"add_media\" tasks like this (note the task option is 'add_media'): task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv # media_use_tid is optional, it defaults to \"Original file\". media_use_tid: 21 media_type: file This is the same configuration file using a term URI in media_use_tid rather than a term ID: task: add_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: add_media.csv media_use_tid: \"http://pcdm.org/use#Transcript\" media_type: file If you want to specify a media_use_tid per CSV row, you can include that column in your CSV (in either \"add_media\" or \"create\" tasks): node_id,file,media_use_tid 100,test.txt,21 110,test2.txt,35 If you include media_use_tid values in your CSV file, they override the media_use_tid value set in your configuration file. Note If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's 'field_edited_text' field, allowing it to be indexed in Solr. Note The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration.","title":"Adding media from files"},{"location":"adding_media/#adding-references-to-media-for-use-with-dgis-image-discovery-module","text":"DiscoveryGarden's DGI Image Discovery module provides a way to assign the same thumbnail image to multiple nodes. This is not the module's main purpose, but reusing a thumbnail image on many nodes is easy to accomplish using this module and Islandora Workbench. Essentially, the Image Discovery module defines a specific Drupal field on a node, field_representative_image , that contains a reference to an existing media, for example a media with an Islandora Media Use term \"Thumbnail image\". This approach to defining a thumbnail is different than Islandora's normal node/media relationship, where the media entity references its parent node in a field_media_of field attached to the media. To use Workbench to populate field_representative_image , simply include that field in your Workbench create task CSV and populate it with the media ID of the thumbnail media you want to use. The following example CSV populates the field with the media ID \"3784\": id,file,title,field_representative_image,field_model test-001,Test node 001,3784,Image test-002,Test node 002,3784,Digital Document Note that this approach to assigning a thumbnail image to a node does not use Workbench's additional_fields configuration setting to define a CSV column containing the filename of a thumbnail image. It simply populates a node's field_representative_image field with a media ID. DGI's module ensures that the referenced media is used as the node's thumbnail image. No other Workbench configuration is necessary. It's just as easy to use an update task to add the referenced media ID to nodes: node_id,field_representative_image 1089,3784 1093,3784","title":"Adding references to media for use with DGI's Image Discovery module"},{"location":"adding_multiple_media/","text":"By default, Islandora Workbench adds only one media per node. This applies to both create and add_media tasks, and the file that is used to create the media is the one named in the \"file\" CSV column. The media use term ID is defined in either the media_use_tid configuration option or in the \"media_use_tid\" CSV field. However, it is possible to add more than a single media per node in create and add_media tasks. This ability might be useful if you have pre-generated derivatives (thumbnails, OCR, etc.) or if you want to add the main file (e.g. with a media use term of \"Original File\") and one or more files that are not generated by Islandora's microservices. Workbench looks for the additional_files configuration option, which maps CSV field names to media use term IDs. If the field names defined in this option exist in your CSV (in either a create task or an add_media task), Workbench creates a media for each file and assigns it the corresponding media use term. Here is an example of this configuration option: additional_files: - thumbnail: 20 - rightsdocs: 280 Note The syntax of the additional_files entries is important. Be sure to include a space after the dash at the start of each entry, like - rightsdocs , not -rightsdocs . The accompanying create CSV would look like this: file,id,title,thumbnail,rightsdocs main.jpg,036,Testing,/tmp/thumbs/main_tn.jpg,/tmp/rightsdocs/036.txt The accompanying add_media CSV using the same sample file paths would look like this: file,node_id,thumbnail,rightsdocs main.jpg,2078,/tmp/thumbs/main_tn.jpg,/tmp/rightsdocs/036.txt You can use multiple media use term IDs by subdelimiting them, e.g.: additional_files: - thumbnail: 20 - rightsdocs: 280|465 Warning If you are creating Media using this feature, you should temporarily disable Contexts that would normally generate derivatives equivalent to the additional Media you are creating. Do this at admin/structure/context by choosing \"Disable\" in the \"Operations\" list for each applicable Context, and be sure to re-enable them after running Workbench. Because Islandora creates derivatives asynchronously, it is impossible to guarantee that Contexts will not overwrite additional Media created using this feature. Running --check will warn you of this by issuing a message like the following: \"Warning: Term ID '18' registered in the 'additional_files' config option for field 'thumbnail' will assign an Islandora Media Use term that might conflict with derivative media. You should temporarily disable the Context or Action that generates those derivatives.\" A few notes: If the additional_files columns are empty in any rows, or if the reserved file column is empty because all file-related columns are listed in additional_files, you will need to add the allow_missing_files: true option to your configuration file. Fixity checking is only available for files named in the file CSV column, not in the additional columns described here. See this issue for more information. If you create an \"Extracted Text\" media, the contents of the specified text file will be added to the media's \"field_edited_text\" field, allowing it to be indexed in Solr.","title":"Adding multiple media"},{"location":"aliases/","text":"In create tasks, you can assign URL aliases to nodes by adding the url_alias field to you CSV file, like this: file,title,field_description,url_alias IMG_1410.tif,Small boats in Havana Harbour,Some are blue.,/havana_boats IMG_2549.jp2,Manhatten Island,Manhatten is part of New York City.,/manhatten No other configuration is required. URL aliases must start with a forward slash ( / ). When you run Workbench with its --check option, it will check whether each alias starts with this character, and whether the alias already exists. Note This method of assigning URL aliases is useful if, for example, you are migrating from another platform and want to retain the URLs of items from the source platform. If you want to assign URL aliases that are derived from node-specific field data (like title, date, taxonomy terms, etc.), you can use the Drupal contrib module Pathauto instead. But, note also that any URL aliases created through Drupal's core URL alias functionality, which the method described above uses, is overwritten by Pathauto. This means that if you use Pathauto to create aliases, any URL aliases created by Workbench will likely not work. You can also assign URL aliases in update tasks: node_id,url_alias 345,/this_is_a_cool_item 367,/soisthisone However, in update tasks, you can only assign/update the url_alias for nodes that do not already have an alias. Your update_mode setting can be either append or replace .","title":"Assigning URL aliases"},{"location":"alt_text/","text":"Islandora image media require a value in their \"Alternative text\" field. This text is used as the alt text in the HTML markup rendering the image. You can assign alt text values to image media by adding the image_alt_text field to you CSV file, like this: file,title,field_model,image_alt_text IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" The value will only be applied to image media that Workbench creates. If you do not include this field in your CSV file, or the field is present but empty, Workbench will use the node's title as the alt text. Warning Workbench only adds alt text to the image media it creates. So if the image files in your create task are \"Original file\" media use (the default), those are the image media that will get the alt text value from your CSV. If the image files you create are \"Service file\" media (e.g via an additional_files configuration), those are the ones that get the alt text. A couple notes: Assigning alt text to images is only available in create tasks (but see this issue ). Workbench strips out all HTML markup within the alt text to prevent potential cross-site scripting vulnerabilities.","title":"Adding alt text to images"},{"location":"cancelling/","text":"Cancelling Workbench You can cancel/quit Islandora Workbench manually, before it completes running. You normally wouldn't do this, but if you do need to cancel/quit a Workbench job, press ctrl-c (the standard way to exit a running script) while Workbench is running. To illustrate what happens when you do this, let's use the following simple CSV input file, which we'll use to create some nodes: file,id,title IMG_1410.tif,01,Small boats in Havana Harbour IMG_2549.jp2,02,Manhatten Island IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island We run Workbench, and after two nodes have been created, we issue a ctrl-c to cancel: OK, connection to Drupal at http://localhost verified. Node for \"Small boats in Havana Harbour\" (record 01) created at http://localhost/node/33. Node for \"Manhatten Island\" (record 02) created at http://localhost/node/34. ^CExiting before entire CSV processed. See log for more info. workbench.log will contain the following entries: 07-Nov-21 09:39:30 - INFO - OK, connection to Drupal at http://localhost verified. 07-Nov-21 09:39:31 - INFO - Writing rollback CSV to input_data/rollback.csv 07-Nov-21 09:39:31 - INFO - \"Create\" task started using config file foo.yml 07-Nov-21 09:39:39 - INFO - \"nodes_only\" option in effect. No media will be created. 07-Nov-21 09:39:40 - INFO - Node for Small boats in Havana Harbour (record 01) created at http://localhost/node/33. 07-Nov-21 09:39:41 - INFO - Node for Manhatten Island (record 02) created at http://localhost/node/34. 07-Nov-21 09:39:42 - WARNING - Workbench exiting after receiving \"ctrl c.\" Consult the documentation to learn how to resume your batch. Resuming a cancelled job/batch The \"documentation\" you are referred to is this page! The log shows that the last row in the CSV that resulted in a node being created is the row for record 02; after that row was processed, the user issued a ctrl-c to stop Workbench. To process the remaining CSV records (\"resume your batch\"), you need to remove from the input CSV the rows that were processed (according to the example log above, the rows for record 01 and record 02), and run Workbench on the resulting unprocessed records: file,id,title IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island Note that cancelling Workbench simply stops it from executing. It doesn't use a transaction to ensure that all child objects or media that are being processed are also processed: If it stops while creating a media associated with a node, it might stop before the media is created. If it stops while creating a compound object such as a book, it might stop before the children/pages are being processed. If you cancel Workbench while it is running, you should always inspect the last object created and any of its media/children to ensure that they were all created. Use information in the log to see what was processed just prior to exiting. Note You can also issue a ctrl-c while running a --check . If you do so, Workbench just logs the action and exits.","title":"Cancelling Workbench"},{"location":"cancelling/#cancelling-workbench","text":"You can cancel/quit Islandora Workbench manually, before it completes running. You normally wouldn't do this, but if you do need to cancel/quit a Workbench job, press ctrl-c (the standard way to exit a running script) while Workbench is running. To illustrate what happens when you do this, let's use the following simple CSV input file, which we'll use to create some nodes: file,id,title IMG_1410.tif,01,Small boats in Havana Harbour IMG_2549.jp2,02,Manhatten Island IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island We run Workbench, and after two nodes have been created, we issue a ctrl-c to cancel: OK, connection to Drupal at http://localhost verified. Node for \"Small boats in Havana Harbour\" (record 01) created at http://localhost/node/33. Node for \"Manhatten Island\" (record 02) created at http://localhost/node/34. ^CExiting before entire CSV processed. See log for more info. workbench.log will contain the following entries: 07-Nov-21 09:39:30 - INFO - OK, connection to Drupal at http://localhost verified. 07-Nov-21 09:39:31 - INFO - Writing rollback CSV to input_data/rollback.csv 07-Nov-21 09:39:31 - INFO - \"Create\" task started using config file foo.yml 07-Nov-21 09:39:39 - INFO - \"nodes_only\" option in effect. No media will be created. 07-Nov-21 09:39:40 - INFO - Node for Small boats in Havana Harbour (record 01) created at http://localhost/node/33. 07-Nov-21 09:39:41 - INFO - Node for Manhatten Island (record 02) created at http://localhost/node/34. 07-Nov-21 09:39:42 - WARNING - Workbench exiting after receiving \"ctrl c.\" Consult the documentation to learn how to resume your batch.","title":"Cancelling Workbench"},{"location":"cancelling/#resuming-a-cancelled-jobbatch","text":"The \"documentation\" you are referred to is this page! The log shows that the last row in the CSV that resulted in a node being created is the row for record 02; after that row was processed, the user issued a ctrl-c to stop Workbench. To process the remaining CSV records (\"resume your batch\"), you need to remove from the input CSV the rows that were processed (according to the example log above, the rows for record 01 and record 02), and run Workbench on the resulting unprocessed records: file,id,title IMG_2940.JPG,03,Looking across Burrard Inlet IMG_2958.JPG,04,Amsterdam waterfront IMG_5083.JPG,05,Alcatraz Island Note that cancelling Workbench simply stops it from executing. It doesn't use a transaction to ensure that all child objects or media that are being processed are also processed: If it stops while creating a media associated with a node, it might stop before the media is created. If it stops while creating a compound object such as a book, it might stop before the children/pages are being processed. If you cancel Workbench while it is running, you should always inspect the last object created and any of its media/children to ensure that they were all created. Use information in the log to see what was processed just prior to exiting. Note You can also issue a ctrl-c while running a --check . If you do so, Workbench just logs the action and exits.","title":"Resuming a cancelled job/batch"},{"location":"changelog/","text":"main branch (no tag/release) December 3, 2024: Resolved (commit eb5a849) issue 856 ; resolved (commit 22e5d1e) issue 857 and issue 858 . December 1, 2024: Resolved (commit f50e9c2) issue 855 . November 26, 2024: Resolved (commit d1b6ea4) issue 845 . November 19, 2024: Resolved (commit 12a81dc) issue 844 , (commit 53801bb) issue 847 , (commit f86d56d) issue 850 , (commit 9b15fb4) issue 852 , and (commit 72ccb97) issue 792 . November 12, 2024: Resolved (commit 584ca09) issue 838 . November 11, 2024: Merged (c1f6cd4) PR 837 . Thanks ajstanley! Resolved (commit c8bc120) issue 847 . November 10, 2024: Resolved (commit 12a81dc) issue 844 . October 20, 2024: Resolved (commit 4358d41) issue 832 and issue 833 . October 15, 2024: Merged (c1f6cd4) PR 828 . Thanks @newzealandpaul! October 14, 2024: Resolved (c1f6cd4) issue 796 . Thanks g011um! September 23, 2024: Resolved (commit e855926) issue 825 and (commit 538b10a) issue 826 . September 17, 2024: Resolved (commit 4f3b5e0) issue 822 . September 3, 2024: Resolved (commit c787245) issue 807 . September 2, 2024: Resolved (commit 8fe842c) issue 816 and (commit d70e8b5) issue 820 . August 26, 2024: Resolved (commit fe614563) issue 819 . August 23, 2024: Merged (commit 2e0f7f5) PR 815 and resolved (commit 3c14eac) issue 817 . August 20, 2024: Resolved (commit 525c1c9) issue 812 . August 9, 2024: Resolved (commit 8117024) issue 806 . August 6, 2024: Resolved (commit ff901d8) issue 791 and (commit ef01052) issue 808 . July 9, 2024: Resolved (commit 7f5f814) issue 337 and issue 752 . July 7, 2024: Resolved issue 798 . July 3, 2024: (commit d32acdb): Relocated \" in console output version of the 'Node for \"[title]\"' log entry to match the console output. June 30, 2024: (commit ecd5d86): Resolved issue 795 . June 24, 2024: (commit 90c112c): Resolved issue 793 . June 20, 2024: (commit 5a049bb): Resolved issue 398 . June 7, 2024: (commit 986e8ba) Merged PR 788 ; resolved (commit 7101c2a) issue 787 . May 21, 2024: (commit be71dd5) Resolved issue 782 and (commit fff678e) issue 783 . May 20, 2024: (commit 3acddfa) Resolved issue 745 , (commit 3f7b966) issue 773 , (commit f667639) issue 766 , and (commit e268f97) issue 781 . May 15, 2024: (commit e7e55ca) Merged PR 778 . April 23, 2024: (commit c197de3) Added email_log_if_errors.py script. April 22, 2024: (commit edd8870) Resolved issue 771 . April 15, 2024: (commit 19ffa9c) Resolved issue 770 . April 14, 2024: (commit 0824988) Resolved issue 749 . APril 10, 2024: (commit 13f3618) Merged Seth Shaw's work to allow using term names in Entity Reference Views fields ( issue 642 ). April 9, 2024: (commit f751ad8) Resolved issue 767 and (commit 253f2d6) issue 768 . April 4, 2024: (commit 0824988) Resolved issue 765 . April 2, 2024: (commit 0777318) Resolved issue 763 . March 28, 2024: (commit 76736ba) Work on issue 762 . March 27, 2024: (commit 1cf0717) Resolved issue 756 and (commit 514b8f3) issue 755 . March 4, 2024: (commit 5332f33) Work on issue 747 . February 20, 2024: (commit 2b686d5) Resolved issue 743 . February 13, 2024: (commit dab400f) Resolved issue 740 . January 30, 2024: (commit 3681ae1) Added a new get_media_report_from_view task, arising from discussion at issue 727 . January 25, 2024: Resolved issue 735 . January 24, 2024 (commit eea1165): Resolved issue 639 . January 23, 2024 (commit 60285ad): Resolved issue 730 . January 17, 2024 (commit 197e55a): Resolved issue 734 . January 15, 2024 (commit 5d0b38c): Resolved issue 733 . January 14, 2024 (commit 5dfd657): Resolved issue 731 and issue 732 . January 10, 2024 (commit 7d9aa0): Resolved issue 606 and (commit ac46541) issue 728 . January 5, 2024 (commit c36cc5d): Resolved issue issue 723 . January 2, 2024 (commit 248560b): Resolved issue issue 726 . December 12, 2023 (commit 864be45): Merged @ajstanley's work on PR 722 . December 11, 2023 (commit f03f97b): Merged @joecorall's work on issue issue 719 . December 10, 2023 (commit 988c69d): Resolved issue issue 687 . December 1, 2023 (commit 749265b): Resolved issue 715 , issue 716 , issue 717 , issue 718 . October 30, 2023 (commit b38d37e): Resolved issue issue 702 and issue issue 705 . October 29, 2023 (commit 3c3f7a8): Resolved issue issue 703 . October 23, 2023 (commit f41fa85): Resolved issue issue 701 . September 24, 2023 (commit 7c66389): Resolved issue issue 690 . September 20, 2023 (commit e41ece7): Some minor coding style cleanup. September 19, 2023 (commit bd3e05e): Work on issue 692 . September 13, 2023: Merged in @seth-shaw-asu's work to resolve issue 694 ; merged @ysuarez's work on issue 445 ; WIP on issue 691 and issue 692 . August 28, 2023 (commit 575b7ba): Merged @hassanelsheikha's work to resolve issue 95 (not yet documented). August 24, 2023: Resolved issues issue 682 and issue 684 . August 18, 2023 (commit f33e8df): Work in progress on issue 487 . August 15, 2023 (commit f33e8df): Work in progress on issue 487 ; updated minimum version of requests-cache in setup.py as per issue 632 . August 14, 2023 (commit c989e39): Resolved issues issue 657 and (commit 866b6c2) issue 671 . August 10, 2023 (commit 1ab0172): Resolved issue issue 664 . August 3, 2023 (commit c3fe7e1): Resolved issues issue 613 and issue 647 . August 2, 2023 (commit 4e4f14f): Preliminary work on issue 663 . August 1, 2023 (commit 18ea969): Resolved issue 648 . July 31, 2023 (commit a45a869): Resolved issue 652 . Thanks @willtp87! July 28, 2023 (commit 63b3b83): Added @noahsmith's fix in commit f50ebf2 to all task functions, and accounted for enable_http_cache: false ; Resolved (commit fce9db7) issue 654 . July 26, 2023 (commit f50ebf2): Merged @noahsmith's fix for pruning the HTTP cache ( PR 651 , work on issue 632 ). Thanks Noah! July 20, 2023 (commit 8c1995e): Merged @aOelschlager's contribution (thanks!) of an update_terms task from PR 622 , plus some additional prerequisite cleanup needed for her code to work. July 14, 2023 (commit dfa60ff): Merged @noahsmith's introduction of \"soft checks\" as described in issue 620 (thanks!). July 13, 2023 (commit 2a589f2): @noahsmith found and fixed issue 640 . July 11, 2023 (commit 411cd2d): Merged initial work on Paragraphs support (thanks @seth-shaw-asu). --check functionality and documentation forthcoming. July 10, 2023 (commit 2373149): Changes to how the CSV ID to node ID map works; (commit 52a5db) clear sqlite cache file (work on issue 632 ). July 7, 2023 (commit eae85c5): Resolved issue 633 ; resolved (commit 7511828) issue 635 . July 5, 2023 (commit b2fd24c): Resolved issue 631 . July 4, 2023 (commit 4a93ef0a): Resolved issue 443 ; resolved (commit 1f6051b) issue 629 . June 30, 2023 (commit 59f3c69): clarified --check message to user when \"log_term_creation\" config setting is set to \"false\". June 29, 2023 (commit 7d44d1c): Merged PR 625 into main branch and added some accompanying defensive logic to --check . June 28, 2023 (commit 5f4f35c): Further work on issue 607 . June 12, 2023 (commit a6404ea): Resolved issue 615 . May 29, 2023 (commit ad6c954): Resolved issue 611 ; (commit 3ce7fba) resolved issue 607 . May 27, 2023 (commit 391ee07): Updated PR template. May 26, 2023 (commit 3dc81c6): Resolved issue 610 ; (commit fcdeb7b) Improved wording of error/log messages when vocabulary or content type doesn't exist. May 25, 2023 (commit c93a706): Resolved issue 608 . May 24, 2023 (commit 2cc3cb7): Resolved issue 609 . May 22, 2023 (commit 66b4cd6): Merged in contribution from @hassanelsheikha enabling Workbench to update media. May 19, 2023 (commit 8e0d662): Resolved bug portion of issue 605 . May 10, 2023 (commit 13ae4c2): Resolved issue 601 . May 5, 2023 (commit 23d2941): Resolved issue 597 , (commit ab8ee21) resolved issue 367 . April 26, 2023 (commit eeade7f): Resolved issue 593 . March 26, 2023 (commit fab2501): Resolved issue 590 . March 23, 2023 (commit 478e8bb): Resolved issue 588 . March 22, 2023 (commit 1ed5f91): Resolved issue 586 . March 20, 2023 (commit b342451): Resolved issue 574 . March 13, 2023 (commit 3eb9c19): Resolved issue 584 . March 10, 2023 (commit a39bd8f): Resolved issue 580 . March 7, 2023 (commit 591dac1): Resolved issue 405 ; (commit bd5ee60) resolved issue 579 . March 6, 2023 (commit 3c19cf6): Resolved issue 576 . March 5, 2023: Fixed URL to the \" Entity Reference Views fields \" docs; resolved issue 566 (commit 19b1c2e). March 2, 2023: Created drupal_8.5_and_lower tag. Users of Drupal 8.5 and earlier must use this version of Workbench. February 28, 2023 (commit 542325f): Resolved issue 569 . February 24, 2023: Added clean_csv_values_skip config setting (commit e659616e, issue 567 ). February 22, 2023: Resolved issue 563 ; Added csv_value_templates config setting (commit ae1fcd2b, issue 566 ). February 20, 2023 (commit 96cc6ef): Resolved issue 554 ; (commit a143bab): resolved issue 556 ). February 18, 2023 (commit ffa03de): Added csv_headers config option (issue 559 ). February 16, 2023 (commit 9a8828b): Removed sample config files from workbench directory (issue 552 ). Added new config option log_term_creation (commit 51348d0, issue 558 ). February 15, 2023 (commit 309c311): Added temp_dir config option (issue 551 ). February 14, 2023 (commit d200db6): Resolved issue (issue 553 ). February 11, 2023 (commit 869bd5b): Resolved issue (issue 547 ). Added rollback_dir config option (commit 1abad16, pull request 550 ). Updated PR template (commit a32e88f). February 5, 2023 (commit 65db118): Resolved issue (issue 538 ). January 31, 2023 (commit b452450): Resolved issue (issue 536 ). January 29, 2023 (commit cff6008): Added ability to generate a contact sheet (issue 515 ). January 26, 2023 (commit 6b0c16b): Added validation in --check of parent/child position in CSV file (issue 529 ); resolved issue 531 (commit 3150b4b). January 19, 2023 (commit b97b563): Fixed bug 522 and (commit 76d8c44) bug 523 ; changed log level from ERROR to WARNING when there are missing files and the allow_missing_files config option is set to true. January 18, 2023 (commit 727145f): Added validate_parent_node_exists config option (issue 521 ). January 17, 2023 (commit a4a5008): Added better trimming of trailing slash in the host config option (issue 519 ); (commit 1763fe6) fixed bug when \"field_member_of\" contained multiple values 520 . January 15, 2023 (commit ba149d6d): Added validation of extensions for files named in the CSV file column (issue 126 ); (commit 82dd02c) added validation of CSV values for \"List (text)\" type fields 509 . January 9, 2023 (commit a3931df): Added ability to create media track files (issue 373 ); fixed some integration tests. January 6, 2023 (commit f4e4c8d): Fixed issue 502 . December 31, 2022: Better cleanup when using remote files - @ajstanley's fix for issue 497 (commit a0412af), resolved issue 499 (commit b8f74c8). December 28, 2022 (commit e4e6e49): Fixed bug where running Workbench using a Google Sheet or Excel file as input without first running --check caused a \"file not found\" error (issue 496 ). Thanks to @ruebot for discovering this bug. December 11, 2022 (commit 24b70fd): Added ability to export files along with CSV data (issue 492 ). December 5, 2022 (commit 0dbd459): Fixed bug in file closing when running --check during \"get_data_from_view\" tasks on Windows (issue 490 ). November 28, 2022 (commit 46cfc34): Added quick delete option for nodes and media (issue 488 ). November 24, 2022 (commit 3fe5c28): Extracted text media now have their \"field_edited_text\" field automatically populated with the contents of the specified text file (issue 407 ). November 22, 2022 (commit 74a83cf): Added more detailed logging on node, media, and file creation (issue 480 ). November 22, 2022 (commit f2a8a65): Added @DonRichards Dockerfile (PR 233 ). November 16, 2022 (commit 07a74b2): Added new config options path_to_python and path_to_workbench_script (issue 483 ). November 9, 2022 (commit 7c3e072): Fixed misspelling of \"preprocessed\" in code and temporary filenames (issue 482 ). November 1, 2022 (commit 7c3e072): Workbench now exits when run without --check and there are no records in the input CSV (issue 481 ). September 19, 2022 (commit 51c0f79): Replaced exit_on_first_missing_file_during_check configuration option with strict_check (issue 470 ). exit_on_first_missing_file_during_check will be available until Nov. 1, 2022, at which time strict_check will be the only option allowed. September 18, 2022 (commit 00f50d6): Added ability to tell Workbench to only process a subset of CSV records (issue 468 ). September 1, 2022 (commit 6aad517): All hook scripts now log their exit codes (issue 464 ). August 16, 2022 (commit 4270d13): Fixed bug that would not delete media with no files (issue 460 ). August 13, 2022 (commit 1b7b801): Added ability to run shutdown scripts (issue 459 ). August 12, 2022 (commit b821533): Provided configuration option standalone_media_url: true for sites who have Drupal's \"Standalone media URL\" option enabled (issue 466 ). August 11, 2022 (commit df0a609): Fixed bug where items in secondary task CSV were created even if they didn't have a parent in the primary CSV, or if their parent was not created (issue 458 ). They are now skipped. July 28, 2022 (commit 3d1753a): Added option to prompt user for password (issue 449 ; fixed 'version' in setup.py). July 27, 2022 (commit 029cb6d): Shifted to using Drupal's default media URIs (issue 446 ). July 26, 2022 (commit 8dcf85a): Fixed setup.py on macOS/Homebrew (issue 448 ). July 26, 2022 (commit 09e9f53): Changed license in setup.py to \"MIT\". Documentation December 3, 2024: Update docs on \" Applying CSV value templates to rows in your input CSV \"; updated docs on \" Rolling back nodes and media .\" December 1, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings added in issue 855 . Updated docs on \" Field data applied to pages/children \" to include link to \" Applying CSV value templates to paged content .\" November 26, 2024: Resolved issue 853 ; added docs on new paged_content_ignore_files config setting. November 19, 2024: Updated docs on \" Creating media track files \"; updated \" Troubleshooting \" to include the new --print_config argument. November 12, 2024: Updated docs on \" Creating taxonomy terms \" and \" Updating taxonomy terms \" to include use of the published CSV column. November 11, 2024: Updated docs on \" Configuring media types \".newzealandpaul November 10, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings, and added a dedicated section for rollbacks to the \" Configuration \" page. November 1, 2024: Updated docs on \" Adding alt text to images \" and \" Known limitations \". October 20, 2024: Updated docs on \" Creating redirects \". October 14, 2024: Added docs on populating the \" field_domain_access \" CSV column. Thanks for the docs @dara2! September 23, 2024: Updated docs on \" Checking configuration and input data \" to include new config setting check_lock_file_path . September 2, 2024: Added \" Prompting the user \". August 25, 2024: Updated \" Creating paged, compound, and collection content \" to document the new page_files_source_dir_field config setting. August 20, 2024: Added \" Processing or ignoring rows based on field values \". August 12, 2024: Added \" Overriding Workbench's default file extension to MIME type mappings \"; updated \" CSV value templates ; updated \" Taxonomy reference fields \". August 6, 2024: Updated \" Ingesting OCR (and other) files with page images \". July 16, 2024: Added \" Cross-environment deployment / Continuous Integration \". July 14, 2024: Added docs on the new protected_vocabularies config setting to \" Taxonomy reference fields \" and removed some cruft; added docs on \" Encoding of text files \". July 9, 2024: Added \" Using numbers as term names \". July 7, 2024: Added \" Using a local or remote .zip archive as input data \". July 4, 2024: Added \" Ingesting pages, their parents, and their 'grandparents' using a single CSV file \". July 3, 2024: Added \" Sharing configuration files with other applications \". June 30, 2024: Added docs on using values other than node IDs in field_member_of . June 20, 2024: Started docs on \" Creating redirects \". June 7, 2024: Updated docs on \" Exporting Islandora 7 content \". May 31, 2024: Some edits to \" Using subdirectories \" section of the docs on creating paged/compound content. May 21, 2024: Minor edit to \" Updating taxonomy terms \". May 20, 2024: Minor edit to \" Updating nodes ; updated docs to indicate that media_type configuration setting is now required for add_media and update_media tasks. April 22, 2024: Updated \" Hooks \" to document scripts/generate_iiif_manifests.py (from issue 771) and add some clarifications. April 17, 2024: Updated \" Hooks \" to be explicit about what Workbench configuration settings are available within external scripts. April 16, 2024: Updated \" Field data (Drupal and CSV) \" to clarify warning about Entity Reference Views fields; merged Rosie's changes to the docs on using Paragraphs . April 15, 2024: Resolved issue 748 ; updated docs to include new log_file_name_and_line_number config setting. April 14, 2024: Added \" Checking if nodes already exist \". Added documentation to \" Field data (Drupal and CSV) \" on configuring Views to allow using term names in Entity Reference Views fields. April 12, 2024: Added \"metadata maintenance\" section to the \" Workflows \" docs using Rosie Le Faive's excellent demonstration of round tripping metadata. April 8, 2024: Updated \" Field data (Drupal and CSV) \" to add documentation on Entity Reference Revisions fields (paragraphs). April 5, 2024: Updated \" Ignoring CSV rows and columns \" to describe using the new csv_rows_to_process config setting. April 2, 2024: Updated \" Choosing a task \"; updated \" Adding media to nodes \" to describe using DGI's Image Discovery module . February 21, 2024: Updated \" Development guide \". February 20, 2024: Updated \" Updating media \" to indicate that media_type is now a required configuration setting in update_media tasks. January 30, 2024: Added new docs on \" Using a Drupal View to generate a media report as CSV \". Also updated these docs to be clearer on the difference between Contextual Filters and Filter Criteria. January 24, 2024: Added new docs on \" Ingesting OCR (and other) files with page images \" and updated the \" Configuration \" page with the newly introduced config settings. January 17, 2023: Updated the \" Updating media \" docs to mention the update_mode config setting. January 14, 2024: Added promote to the \" Base fields \" docs; updated \" Updating media \". January 2, 2024: Updated the docs on allow_missing_files and perform_soft_checks . December 1, 2023: Updated the \" Updating media \" docs. November 28, 2023: Addressed issue 713 ; merged in @rosiel's https://github.com/mjordan/islandora_workbench_docs/pull/12. November 2, 2023: Updated the \" Troubleshooting \" page to include how to narrow down errors involving SSL certificates, and some additonal minor changes. October 29, 2023: Updated the docs on \" Assigning URL aliases \". September 13, 2023: Updated the docs on \" CSV preprocessor scripts \". September 1, 2023: Updated the \" Development guide \" page. August 21, 2023: Updated the \" Troubleshooting \" page to include how to eliminate Python \"InsecureRequestWarning\"s. August 16, 2023: Merged in @ysuarez's spelling fixes (issue 674 ). August 14, 2023: Update published entry in \" Base fields \" to allow media types to set their default published values. August 4, 2023: Removed published as a standalone configuration setting, and updated its entry in \" Base fields \". August 3, 2023: Documented the config settings query_csv_id_to_node_id_map_for_parents , ignore_duplicate_parent_ids , field_for_media_title , use_nid_in_media_title , use_node_title_for_media_title , use_node_title_for_remote_filename , use_nid_in_remote_filename , and field_for_remote_filename . Updated \" Using the CSV ID to node ID map \". August 2, 2023: Added mention of, and a screenshot showing, the DB Browser for SQLite to \" Using the CSV ID to node ID map \". Thanks for the tip @ajstanley! July 21, 2023: Updated \"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file\" entry in \" Troubleshooting \". July 20, 2023: Updated \" Checking configuration and input data \" to include the new perform_soft_checks config setting. July 19, 2023: Clarified that update tasks require the content_type setting in their config files if the target Drupal content type is not islandora_object . July 18, 2023: Updates to the published entry in the \" Base fields \" documentation; added entry for perform_soft_checks to \" Configuration \" docs (note: this new setting replaces strict_check ). July 10, 2023: Updates to \" Creating paged, compound, and collection content \" to reflect changes in the CSV ID to node ID map, specifically the new ignore_existing_parent_ids config setting. June 30, 2023: Corrected entry in \" Configuration docs \" for the strict_check setting. June 28, 2023: Updated \" Configuration docs \" and \" Using a secondary task \" to include new query_csv_id_to_node_id_map_for_parents configuration setting. Also added a note to the \"id\" reserved column entry in the \" Field data docs \" about importance of using unique ID values. June 26, 2023: Updated the \" Configuration docs \" to include the new HTTP cache settings introduced in issue 608. June 4, 2023: Updated the \" Requirements and installation \" and \" Checking configuration and input data \" docs with instructions on calling Python explicitly on Macs using Homebrew. June 1, 2023: Updated the \" With page/child-level metadata \" section to clarify use of parent_id as per issue 595 . May 30, 2023: Updated the \" Using the CSV ID to node ID map \" section. May 29, 2023: Added the \" Using the CSV ID to node ID map \" section and a few associated updates elsewhere. May 22, 2023: Added \" Updating media \" and a few associated updates elsewhere. May 11, 2023: Updated \" Post-action hooks .\" May 8, 2023: Added \" When Workbench skips invalid CSV data .\" May 7, 2023: Added note to \" Text fields \" that Workbench will truncate CSV values for fields configured in Drupal as \"text\" data type and that have a maximum allowed length. May 5, 2023: Add \" Text fields with markup .\" May 1, 2023: Updated \" Updating nodes .\" April 26, 2023: Updated \" Exporting Islandora 7 content \"; added docs for the new mimetype_extensions config option. April 14, 2023: Updated \" Troubleshooting .\" March 28, 2023: Added \" Choosing a task .\" March 23, 2023: Updated \" Configuration \" and \" Base fields .\" March 23, 2023: Updated \" Assigning URL aliases .\" March 13, 2023: Updated \" Exporting Islandora 7 content .\" March 7, 2023: Updated \" How Workbench cleans your input data \"; updated \" Checking configuration and input data \". March 6, 2023: Added an entry for the require_entity_reference_views config setting to the \" Drupal settings \"; minor corrections and updates to \" Workbench's relationship to Drupal and Islandora \". March 4, 2023: Updated the \" Exporting Islandora 7 content \" page. March 2, 2023: Added mention of drupal_8.5_and_lower tag to \" Requirements and installation . February 28, 2023: Removed references to the iteration-utilities Python library; add new page \" Workbench's relationship to Drupal and Islandora \". February 27, 2023: Replaced \"Islandora 8\" with \"Islandora 2\". February 24, 2023: Added \" How Workbench cleans your input data \". February 22, 2023: Updated \" Preparing your data \"; added \" CSV value templates \". February 20, 2023: Updated \" Rolling back nodes and media \". February 18, 2023: Updated \" Field data (Drupal and CSV) \" to include new csv_headers setting. February 16, 2023: Updated \" Configuration \" to include new log_term_creation setting. February 15, 2023: Updated \" Configuration \" to include new temp_dir setting. February 11, 2023: Updated \" Troubleshooting \" and \" Rolling back nodes and media .\" February 5, 2023: Updated the \" Using subdirectories \" method of creating compound/paged content to explain using the new page_title_template config option. January 31, 2023: Updated \" Generating a contact sheet \"; updated \" Configuring media types \". January 30, 2023: Edits to the \" Using subdirectories \" method of creating compound/paged content to clarify the absence of the \"file\" CSV column. January 29, 2023: Added \" Generating a contact sheet \". January 23, 2023: Added example CSVs for primary and secondary tasks in the \" Case study \" section of the Workflows documentation. January 22, 2023: Several clarifications and corrections, including @rosiel's correction of how to use allow_missing_files and additional_files together; added some examples of planning large compound/paged content ingests . January 16, 2023: Updated \" Exporting Islandora 7 content .\" January 15, 2023: Updated \" Checking configuration and input data .\" January 9, 2023: Added docs for creating \" Media track files .\" January 8, 2023: Updated \" Known limitations \" with a work around for unsupported \"Filter by an entity reference View\" fields; added examples of valid Windows paths to \" Values in the 'file' column .\" December 29, 2022: Minor corrections to \" Known limitations \", \" Workflows \", \" Creating paged, compound, and collection content ,\" and \" Preparing your data .\" December 28, 2022: Added cross reference between \" CSV field templates \" and \" Ignoring CSV rows and columns \". December 17, 2022: Corrected URI for http://pcdm.org/use#OriginalFile on \" Generating CSV files \" and \" Configuration .\" December 11, 2022: Updated \" Generating CSV files \" and \" Output CSV settings \" Configuration docs to include new ability to export files along with CSV data. Note: the data_from_view_file_path setting in \"get_data_from_view\" tasks has been replaced with export_csv_file_path . November 28, 2022: Added \" Quick delete \" docs; added clarification to \" Configuring Drupal's media URLs \" that standalone_media_url: true must be in all config files for tasks that interact with media; added note to \" Adding media to nodes \" and \" Values in the 'file' column \" clarifying that it is not possible to override the filesystem a media's file field is configured to use. November 26, 2022: Changed documentation theme from readthedocs to material; some edits for clarity to the docs for \"file\" field values ; some edits for clarity to the docs for \" adaptive pause .\" November 24, 2022: Added note to \" Adding media to nodes \" and \" Adding multiple media \" about extracted text media; added a note about using absolute file paths in scheduled jobs to the \" Workflows \" and \" Troubleshooting \"; removed the \"required\" \u2714\ufe0f from the password configuration setting entry in the table in \" Configuration \". November 17, 2022: Added new config options path_to_python and path_to_workbench_script to \" Configuration \" docs. October 28, 2022: Updated \" Configuration \" docs to provide details on YAML (configuration file) syntax. September 19, 2022: Updated references to exit_on_first_missing_file_during_check to use strict_check . Configuration settings entry advises exit_on_first_missing_file_during_check will be removed Nov. 1, 2022. September 18, 2022: Added entry \" Ignoring CSV rows and columns .\" September 15, 2022: Added entry to \" Limitations \" page about lack of support for HTML markup. Also added a section on \"Password management\" to \" Requirements and installation \". September 8, 2022: Added documentation on \" Reducing Workbench's impact on Drupal .\" August 30, 2022: Updated \" Hooks \" docs to clarify that the HTTP response code passed to post-entity-create scripts is a string, not an integer. August 18, 2022: Updated standalone_media_url entry in the \" Configuration \" docs, and added brief entry to the \" Troubleshooting \" page about clearing Drupal's cache. August 13, 2022: Updated \" Configuration \" and \" Hooks \" page to describe shutdown scripts. August 11, 2022: Added text to \" Creating paged, compound, and collection content \" page to clarify what happens when a row in the secondary CSV does not have a matching row in the primary CSV. August 8, 2022: Added entry to \" Limitations \" page about support for \"Filter by an entity reference View\" fields. August 3, 2022: Added entry to \" Troubleshooting \" page about missing Microsoft Visual C++ error when installing Workbench on Windows. August 3, 2022: Updated the \" Limitations \" page with entry about Paragraphs. August 2, 2022: Added note about ownership requirements on files to \" Deleting nodes \"; was previously only on \"Deleting media\". July 28, 2022: Updated password entry in the \" Configuration \" docs to mention the new password prompt feature.","title":"Change log"},{"location":"changelog/#main-branch-no-tagrelease","text":"December 3, 2024: Resolved (commit eb5a849) issue 856 ; resolved (commit 22e5d1e) issue 857 and issue 858 . December 1, 2024: Resolved (commit f50e9c2) issue 855 . November 26, 2024: Resolved (commit d1b6ea4) issue 845 . November 19, 2024: Resolved (commit 12a81dc) issue 844 , (commit 53801bb) issue 847 , (commit f86d56d) issue 850 , (commit 9b15fb4) issue 852 , and (commit 72ccb97) issue 792 . November 12, 2024: Resolved (commit 584ca09) issue 838 . November 11, 2024: Merged (c1f6cd4) PR 837 . Thanks ajstanley! Resolved (commit c8bc120) issue 847 . November 10, 2024: Resolved (commit 12a81dc) issue 844 . October 20, 2024: Resolved (commit 4358d41) issue 832 and issue 833 . October 15, 2024: Merged (c1f6cd4) PR 828 . Thanks @newzealandpaul! October 14, 2024: Resolved (c1f6cd4) issue 796 . Thanks g011um! September 23, 2024: Resolved (commit e855926) issue 825 and (commit 538b10a) issue 826 . September 17, 2024: Resolved (commit 4f3b5e0) issue 822 . September 3, 2024: Resolved (commit c787245) issue 807 . September 2, 2024: Resolved (commit 8fe842c) issue 816 and (commit d70e8b5) issue 820 . August 26, 2024: Resolved (commit fe614563) issue 819 . August 23, 2024: Merged (commit 2e0f7f5) PR 815 and resolved (commit 3c14eac) issue 817 . August 20, 2024: Resolved (commit 525c1c9) issue 812 . August 9, 2024: Resolved (commit 8117024) issue 806 . August 6, 2024: Resolved (commit ff901d8) issue 791 and (commit ef01052) issue 808 . July 9, 2024: Resolved (commit 7f5f814) issue 337 and issue 752 . July 7, 2024: Resolved issue 798 . July 3, 2024: (commit d32acdb): Relocated \" in console output version of the 'Node for \"[title]\"' log entry to match the console output. June 30, 2024: (commit ecd5d86): Resolved issue 795 . June 24, 2024: (commit 90c112c): Resolved issue 793 . June 20, 2024: (commit 5a049bb): Resolved issue 398 . June 7, 2024: (commit 986e8ba) Merged PR 788 ; resolved (commit 7101c2a) issue 787 . May 21, 2024: (commit be71dd5) Resolved issue 782 and (commit fff678e) issue 783 . May 20, 2024: (commit 3acddfa) Resolved issue 745 , (commit 3f7b966) issue 773 , (commit f667639) issue 766 , and (commit e268f97) issue 781 . May 15, 2024: (commit e7e55ca) Merged PR 778 . April 23, 2024: (commit c197de3) Added email_log_if_errors.py script. April 22, 2024: (commit edd8870) Resolved issue 771 . April 15, 2024: (commit 19ffa9c) Resolved issue 770 . April 14, 2024: (commit 0824988) Resolved issue 749 . APril 10, 2024: (commit 13f3618) Merged Seth Shaw's work to allow using term names in Entity Reference Views fields ( issue 642 ). April 9, 2024: (commit f751ad8) Resolved issue 767 and (commit 253f2d6) issue 768 . April 4, 2024: (commit 0824988) Resolved issue 765 . April 2, 2024: (commit 0777318) Resolved issue 763 . March 28, 2024: (commit 76736ba) Work on issue 762 . March 27, 2024: (commit 1cf0717) Resolved issue 756 and (commit 514b8f3) issue 755 . March 4, 2024: (commit 5332f33) Work on issue 747 . February 20, 2024: (commit 2b686d5) Resolved issue 743 . February 13, 2024: (commit dab400f) Resolved issue 740 . January 30, 2024: (commit 3681ae1) Added a new get_media_report_from_view task, arising from discussion at issue 727 . January 25, 2024: Resolved issue 735 . January 24, 2024 (commit eea1165): Resolved issue 639 . January 23, 2024 (commit 60285ad): Resolved issue 730 . January 17, 2024 (commit 197e55a): Resolved issue 734 . January 15, 2024 (commit 5d0b38c): Resolved issue 733 . January 14, 2024 (commit 5dfd657): Resolved issue 731 and issue 732 . January 10, 2024 (commit 7d9aa0): Resolved issue 606 and (commit ac46541) issue 728 . January 5, 2024 (commit c36cc5d): Resolved issue issue 723 . January 2, 2024 (commit 248560b): Resolved issue issue 726 . December 12, 2023 (commit 864be45): Merged @ajstanley's work on PR 722 . December 11, 2023 (commit f03f97b): Merged @joecorall's work on issue issue 719 . December 10, 2023 (commit 988c69d): Resolved issue issue 687 . December 1, 2023 (commit 749265b): Resolved issue 715 , issue 716 , issue 717 , issue 718 . October 30, 2023 (commit b38d37e): Resolved issue issue 702 and issue issue 705 . October 29, 2023 (commit 3c3f7a8): Resolved issue issue 703 . October 23, 2023 (commit f41fa85): Resolved issue issue 701 . September 24, 2023 (commit 7c66389): Resolved issue issue 690 . September 20, 2023 (commit e41ece7): Some minor coding style cleanup. September 19, 2023 (commit bd3e05e): Work on issue 692 . September 13, 2023: Merged in @seth-shaw-asu's work to resolve issue 694 ; merged @ysuarez's work on issue 445 ; WIP on issue 691 and issue 692 . August 28, 2023 (commit 575b7ba): Merged @hassanelsheikha's work to resolve issue 95 (not yet documented). August 24, 2023: Resolved issues issue 682 and issue 684 . August 18, 2023 (commit f33e8df): Work in progress on issue 487 . August 15, 2023 (commit f33e8df): Work in progress on issue 487 ; updated minimum version of requests-cache in setup.py as per issue 632 . August 14, 2023 (commit c989e39): Resolved issues issue 657 and (commit 866b6c2) issue 671 . August 10, 2023 (commit 1ab0172): Resolved issue issue 664 . August 3, 2023 (commit c3fe7e1): Resolved issues issue 613 and issue 647 . August 2, 2023 (commit 4e4f14f): Preliminary work on issue 663 . August 1, 2023 (commit 18ea969): Resolved issue 648 . July 31, 2023 (commit a45a869): Resolved issue 652 . Thanks @willtp87! July 28, 2023 (commit 63b3b83): Added @noahsmith's fix in commit f50ebf2 to all task functions, and accounted for enable_http_cache: false ; Resolved (commit fce9db7) issue 654 . July 26, 2023 (commit f50ebf2): Merged @noahsmith's fix for pruning the HTTP cache ( PR 651 , work on issue 632 ). Thanks Noah! July 20, 2023 (commit 8c1995e): Merged @aOelschlager's contribution (thanks!) of an update_terms task from PR 622 , plus some additional prerequisite cleanup needed for her code to work. July 14, 2023 (commit dfa60ff): Merged @noahsmith's introduction of \"soft checks\" as described in issue 620 (thanks!). July 13, 2023 (commit 2a589f2): @noahsmith found and fixed issue 640 . July 11, 2023 (commit 411cd2d): Merged initial work on Paragraphs support (thanks @seth-shaw-asu). --check functionality and documentation forthcoming. July 10, 2023 (commit 2373149): Changes to how the CSV ID to node ID map works; (commit 52a5db) clear sqlite cache file (work on issue 632 ). July 7, 2023 (commit eae85c5): Resolved issue 633 ; resolved (commit 7511828) issue 635 . July 5, 2023 (commit b2fd24c): Resolved issue 631 . July 4, 2023 (commit 4a93ef0a): Resolved issue 443 ; resolved (commit 1f6051b) issue 629 . June 30, 2023 (commit 59f3c69): clarified --check message to user when \"log_term_creation\" config setting is set to \"false\". June 29, 2023 (commit 7d44d1c): Merged PR 625 into main branch and added some accompanying defensive logic to --check . June 28, 2023 (commit 5f4f35c): Further work on issue 607 . June 12, 2023 (commit a6404ea): Resolved issue 615 . May 29, 2023 (commit ad6c954): Resolved issue 611 ; (commit 3ce7fba) resolved issue 607 . May 27, 2023 (commit 391ee07): Updated PR template. May 26, 2023 (commit 3dc81c6): Resolved issue 610 ; (commit fcdeb7b) Improved wording of error/log messages when vocabulary or content type doesn't exist. May 25, 2023 (commit c93a706): Resolved issue 608 . May 24, 2023 (commit 2cc3cb7): Resolved issue 609 . May 22, 2023 (commit 66b4cd6): Merged in contribution from @hassanelsheikha enabling Workbench to update media. May 19, 2023 (commit 8e0d662): Resolved bug portion of issue 605 . May 10, 2023 (commit 13ae4c2): Resolved issue 601 . May 5, 2023 (commit 23d2941): Resolved issue 597 , (commit ab8ee21) resolved issue 367 . April 26, 2023 (commit eeade7f): Resolved issue 593 . March 26, 2023 (commit fab2501): Resolved issue 590 . March 23, 2023 (commit 478e8bb): Resolved issue 588 . March 22, 2023 (commit 1ed5f91): Resolved issue 586 . March 20, 2023 (commit b342451): Resolved issue 574 . March 13, 2023 (commit 3eb9c19): Resolved issue 584 . March 10, 2023 (commit a39bd8f): Resolved issue 580 . March 7, 2023 (commit 591dac1): Resolved issue 405 ; (commit bd5ee60) resolved issue 579 . March 6, 2023 (commit 3c19cf6): Resolved issue 576 . March 5, 2023: Fixed URL to the \" Entity Reference Views fields \" docs; resolved issue 566 (commit 19b1c2e). March 2, 2023: Created drupal_8.5_and_lower tag. Users of Drupal 8.5 and earlier must use this version of Workbench. February 28, 2023 (commit 542325f): Resolved issue 569 . February 24, 2023: Added clean_csv_values_skip config setting (commit e659616e, issue 567 ). February 22, 2023: Resolved issue 563 ; Added csv_value_templates config setting (commit ae1fcd2b, issue 566 ). February 20, 2023 (commit 96cc6ef): Resolved issue 554 ; (commit a143bab): resolved issue 556 ). February 18, 2023 (commit ffa03de): Added csv_headers config option (issue 559 ). February 16, 2023 (commit 9a8828b): Removed sample config files from workbench directory (issue 552 ). Added new config option log_term_creation (commit 51348d0, issue 558 ). February 15, 2023 (commit 309c311): Added temp_dir config option (issue 551 ). February 14, 2023 (commit d200db6): Resolved issue (issue 553 ). February 11, 2023 (commit 869bd5b): Resolved issue (issue 547 ). Added rollback_dir config option (commit 1abad16, pull request 550 ). Updated PR template (commit a32e88f). February 5, 2023 (commit 65db118): Resolved issue (issue 538 ). January 31, 2023 (commit b452450): Resolved issue (issue 536 ). January 29, 2023 (commit cff6008): Added ability to generate a contact sheet (issue 515 ). January 26, 2023 (commit 6b0c16b): Added validation in --check of parent/child position in CSV file (issue 529 ); resolved issue 531 (commit 3150b4b). January 19, 2023 (commit b97b563): Fixed bug 522 and (commit 76d8c44) bug 523 ; changed log level from ERROR to WARNING when there are missing files and the allow_missing_files config option is set to true. January 18, 2023 (commit 727145f): Added validate_parent_node_exists config option (issue 521 ). January 17, 2023 (commit a4a5008): Added better trimming of trailing slash in the host config option (issue 519 ); (commit 1763fe6) fixed bug when \"field_member_of\" contained multiple values 520 . January 15, 2023 (commit ba149d6d): Added validation of extensions for files named in the CSV file column (issue 126 ); (commit 82dd02c) added validation of CSV values for \"List (text)\" type fields 509 . January 9, 2023 (commit a3931df): Added ability to create media track files (issue 373 ); fixed some integration tests. January 6, 2023 (commit f4e4c8d): Fixed issue 502 . December 31, 2022: Better cleanup when using remote files - @ajstanley's fix for issue 497 (commit a0412af), resolved issue 499 (commit b8f74c8). December 28, 2022 (commit e4e6e49): Fixed bug where running Workbench using a Google Sheet or Excel file as input without first running --check caused a \"file not found\" error (issue 496 ). Thanks to @ruebot for discovering this bug. December 11, 2022 (commit 24b70fd): Added ability to export files along with CSV data (issue 492 ). December 5, 2022 (commit 0dbd459): Fixed bug in file closing when running --check during \"get_data_from_view\" tasks on Windows (issue 490 ). November 28, 2022 (commit 46cfc34): Added quick delete option for nodes and media (issue 488 ). November 24, 2022 (commit 3fe5c28): Extracted text media now have their \"field_edited_text\" field automatically populated with the contents of the specified text file (issue 407 ). November 22, 2022 (commit 74a83cf): Added more detailed logging on node, media, and file creation (issue 480 ). November 22, 2022 (commit f2a8a65): Added @DonRichards Dockerfile (PR 233 ). November 16, 2022 (commit 07a74b2): Added new config options path_to_python and path_to_workbench_script (issue 483 ). November 9, 2022 (commit 7c3e072): Fixed misspelling of \"preprocessed\" in code and temporary filenames (issue 482 ). November 1, 2022 (commit 7c3e072): Workbench now exits when run without --check and there are no records in the input CSV (issue 481 ). September 19, 2022 (commit 51c0f79): Replaced exit_on_first_missing_file_during_check configuration option with strict_check (issue 470 ). exit_on_first_missing_file_during_check will be available until Nov. 1, 2022, at which time strict_check will be the only option allowed. September 18, 2022 (commit 00f50d6): Added ability to tell Workbench to only process a subset of CSV records (issue 468 ). September 1, 2022 (commit 6aad517): All hook scripts now log their exit codes (issue 464 ). August 16, 2022 (commit 4270d13): Fixed bug that would not delete media with no files (issue 460 ). August 13, 2022 (commit 1b7b801): Added ability to run shutdown scripts (issue 459 ). August 12, 2022 (commit b821533): Provided configuration option standalone_media_url: true for sites who have Drupal's \"Standalone media URL\" option enabled (issue 466 ). August 11, 2022 (commit df0a609): Fixed bug where items in secondary task CSV were created even if they didn't have a parent in the primary CSV, or if their parent was not created (issue 458 ). They are now skipped. July 28, 2022 (commit 3d1753a): Added option to prompt user for password (issue 449 ; fixed 'version' in setup.py). July 27, 2022 (commit 029cb6d): Shifted to using Drupal's default media URIs (issue 446 ). July 26, 2022 (commit 8dcf85a): Fixed setup.py on macOS/Homebrew (issue 448 ). July 26, 2022 (commit 09e9f53): Changed license in setup.py to \"MIT\".","title":"main branch (no tag/release)"},{"location":"changelog/#documentation","text":"December 3, 2024: Update docs on \" Applying CSV value templates to rows in your input CSV \"; updated docs on \" Rolling back nodes and media .\" December 1, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings added in issue 855 . Updated docs on \" Field data applied to pages/children \" to include link to \" Applying CSV value templates to paged content .\" November 26, 2024: Resolved issue 853 ; added docs on new paged_content_ignore_files config setting. November 19, 2024: Updated docs on \" Creating media track files \"; updated \" Troubleshooting \" to include the new --print_config argument. November 12, 2024: Updated docs on \" Creating taxonomy terms \" and \" Updating taxonomy terms \" to include use of the published CSV column. November 11, 2024: Updated docs on \" Configuring media types \".newzealandpaul November 10, 2024: Updated docs on \" Rolling back nodes and media \" to include new settings, and added a dedicated section for rollbacks to the \" Configuration \" page. November 1, 2024: Updated docs on \" Adding alt text to images \" and \" Known limitations \". October 20, 2024: Updated docs on \" Creating redirects \". October 14, 2024: Added docs on populating the \" field_domain_access \" CSV column. Thanks for the docs @dara2! September 23, 2024: Updated docs on \" Checking configuration and input data \" to include new config setting check_lock_file_path . September 2, 2024: Added \" Prompting the user \". August 25, 2024: Updated \" Creating paged, compound, and collection content \" to document the new page_files_source_dir_field config setting. August 20, 2024: Added \" Processing or ignoring rows based on field values \". August 12, 2024: Added \" Overriding Workbench's default file extension to MIME type mappings \"; updated \" CSV value templates ; updated \" Taxonomy reference fields \". August 6, 2024: Updated \" Ingesting OCR (and other) files with page images \". July 16, 2024: Added \" Cross-environment deployment / Continuous Integration \". July 14, 2024: Added docs on the new protected_vocabularies config setting to \" Taxonomy reference fields \" and removed some cruft; added docs on \" Encoding of text files \". July 9, 2024: Added \" Using numbers as term names \". July 7, 2024: Added \" Using a local or remote .zip archive as input data \". July 4, 2024: Added \" Ingesting pages, their parents, and their 'grandparents' using a single CSV file \". July 3, 2024: Added \" Sharing configuration files with other applications \". June 30, 2024: Added docs on using values other than node IDs in field_member_of . June 20, 2024: Started docs on \" Creating redirects \". June 7, 2024: Updated docs on \" Exporting Islandora 7 content \". May 31, 2024: Some edits to \" Using subdirectories \" section of the docs on creating paged/compound content. May 21, 2024: Minor edit to \" Updating taxonomy terms \". May 20, 2024: Minor edit to \" Updating nodes ; updated docs to indicate that media_type configuration setting is now required for add_media and update_media tasks. April 22, 2024: Updated \" Hooks \" to document scripts/generate_iiif_manifests.py (from issue 771) and add some clarifications. April 17, 2024: Updated \" Hooks \" to be explicit about what Workbench configuration settings are available within external scripts. April 16, 2024: Updated \" Field data (Drupal and CSV) \" to clarify warning about Entity Reference Views fields; merged Rosie's changes to the docs on using Paragraphs . April 15, 2024: Resolved issue 748 ; updated docs to include new log_file_name_and_line_number config setting. April 14, 2024: Added \" Checking if nodes already exist \". Added documentation to \" Field data (Drupal and CSV) \" on configuring Views to allow using term names in Entity Reference Views fields. April 12, 2024: Added \"metadata maintenance\" section to the \" Workflows \" docs using Rosie Le Faive's excellent demonstration of round tripping metadata. April 8, 2024: Updated \" Field data (Drupal and CSV) \" to add documentation on Entity Reference Revisions fields (paragraphs). April 5, 2024: Updated \" Ignoring CSV rows and columns \" to describe using the new csv_rows_to_process config setting. April 2, 2024: Updated \" Choosing a task \"; updated \" Adding media to nodes \" to describe using DGI's Image Discovery module . February 21, 2024: Updated \" Development guide \". February 20, 2024: Updated \" Updating media \" to indicate that media_type is now a required configuration setting in update_media tasks. January 30, 2024: Added new docs on \" Using a Drupal View to generate a media report as CSV \". Also updated these docs to be clearer on the difference between Contextual Filters and Filter Criteria. January 24, 2024: Added new docs on \" Ingesting OCR (and other) files with page images \" and updated the \" Configuration \" page with the newly introduced config settings. January 17, 2023: Updated the \" Updating media \" docs to mention the update_mode config setting. January 14, 2024: Added promote to the \" Base fields \" docs; updated \" Updating media \". January 2, 2024: Updated the docs on allow_missing_files and perform_soft_checks . December 1, 2023: Updated the \" Updating media \" docs. November 28, 2023: Addressed issue 713 ; merged in @rosiel's https://github.com/mjordan/islandora_workbench_docs/pull/12. November 2, 2023: Updated the \" Troubleshooting \" page to include how to narrow down errors involving SSL certificates, and some additonal minor changes. October 29, 2023: Updated the docs on \" Assigning URL aliases \". September 13, 2023: Updated the docs on \" CSV preprocessor scripts \". September 1, 2023: Updated the \" Development guide \" page. August 21, 2023: Updated the \" Troubleshooting \" page to include how to eliminate Python \"InsecureRequestWarning\"s. August 16, 2023: Merged in @ysuarez's spelling fixes (issue 674 ). August 14, 2023: Update published entry in \" Base fields \" to allow media types to set their default published values. August 4, 2023: Removed published as a standalone configuration setting, and updated its entry in \" Base fields \". August 3, 2023: Documented the config settings query_csv_id_to_node_id_map_for_parents , ignore_duplicate_parent_ids , field_for_media_title , use_nid_in_media_title , use_node_title_for_media_title , use_node_title_for_remote_filename , use_nid_in_remote_filename , and field_for_remote_filename . Updated \" Using the CSV ID to node ID map \". August 2, 2023: Added mention of, and a screenshot showing, the DB Browser for SQLite to \" Using the CSV ID to node ID map \". Thanks for the tip @ajstanley! July 21, 2023: Updated \"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file\" entry in \" Troubleshooting \". July 20, 2023: Updated \" Checking configuration and input data \" to include the new perform_soft_checks config setting. July 19, 2023: Clarified that update tasks require the content_type setting in their config files if the target Drupal content type is not islandora_object . July 18, 2023: Updates to the published entry in the \" Base fields \" documentation; added entry for perform_soft_checks to \" Configuration \" docs (note: this new setting replaces strict_check ). July 10, 2023: Updates to \" Creating paged, compound, and collection content \" to reflect changes in the CSV ID to node ID map, specifically the new ignore_existing_parent_ids config setting. June 30, 2023: Corrected entry in \" Configuration docs \" for the strict_check setting. June 28, 2023: Updated \" Configuration docs \" and \" Using a secondary task \" to include new query_csv_id_to_node_id_map_for_parents configuration setting. Also added a note to the \"id\" reserved column entry in the \" Field data docs \" about importance of using unique ID values. June 26, 2023: Updated the \" Configuration docs \" to include the new HTTP cache settings introduced in issue 608. June 4, 2023: Updated the \" Requirements and installation \" and \" Checking configuration and input data \" docs with instructions on calling Python explicitly on Macs using Homebrew. June 1, 2023: Updated the \" With page/child-level metadata \" section to clarify use of parent_id as per issue 595 . May 30, 2023: Updated the \" Using the CSV ID to node ID map \" section. May 29, 2023: Added the \" Using the CSV ID to node ID map \" section and a few associated updates elsewhere. May 22, 2023: Added \" Updating media \" and a few associated updates elsewhere. May 11, 2023: Updated \" Post-action hooks .\" May 8, 2023: Added \" When Workbench skips invalid CSV data .\" May 7, 2023: Added note to \" Text fields \" that Workbench will truncate CSV values for fields configured in Drupal as \"text\" data type and that have a maximum allowed length. May 5, 2023: Add \" Text fields with markup .\" May 1, 2023: Updated \" Updating nodes .\" April 26, 2023: Updated \" Exporting Islandora 7 content \"; added docs for the new mimetype_extensions config option. April 14, 2023: Updated \" Troubleshooting .\" March 28, 2023: Added \" Choosing a task .\" March 23, 2023: Updated \" Configuration \" and \" Base fields .\" March 23, 2023: Updated \" Assigning URL aliases .\" March 13, 2023: Updated \" Exporting Islandora 7 content .\" March 7, 2023: Updated \" How Workbench cleans your input data \"; updated \" Checking configuration and input data \". March 6, 2023: Added an entry for the require_entity_reference_views config setting to the \" Drupal settings \"; minor corrections and updates to \" Workbench's relationship to Drupal and Islandora \". March 4, 2023: Updated the \" Exporting Islandora 7 content \" page. March 2, 2023: Added mention of drupal_8.5_and_lower tag to \" Requirements and installation . February 28, 2023: Removed references to the iteration-utilities Python library; add new page \" Workbench's relationship to Drupal and Islandora \". February 27, 2023: Replaced \"Islandora 8\" with \"Islandora 2\". February 24, 2023: Added \" How Workbench cleans your input data \". February 22, 2023: Updated \" Preparing your data \"; added \" CSV value templates \". February 20, 2023: Updated \" Rolling back nodes and media \". February 18, 2023: Updated \" Field data (Drupal and CSV) \" to include new csv_headers setting. February 16, 2023: Updated \" Configuration \" to include new log_term_creation setting. February 15, 2023: Updated \" Configuration \" to include new temp_dir setting. February 11, 2023: Updated \" Troubleshooting \" and \" Rolling back nodes and media .\" February 5, 2023: Updated the \" Using subdirectories \" method of creating compound/paged content to explain using the new page_title_template config option. January 31, 2023: Updated \" Generating a contact sheet \"; updated \" Configuring media types \". January 30, 2023: Edits to the \" Using subdirectories \" method of creating compound/paged content to clarify the absence of the \"file\" CSV column. January 29, 2023: Added \" Generating a contact sheet \". January 23, 2023: Added example CSVs for primary and secondary tasks in the \" Case study \" section of the Workflows documentation. January 22, 2023: Several clarifications and corrections, including @rosiel's correction of how to use allow_missing_files and additional_files together; added some examples of planning large compound/paged content ingests . January 16, 2023: Updated \" Exporting Islandora 7 content .\" January 15, 2023: Updated \" Checking configuration and input data .\" January 9, 2023: Added docs for creating \" Media track files .\" January 8, 2023: Updated \" Known limitations \" with a work around for unsupported \"Filter by an entity reference View\" fields; added examples of valid Windows paths to \" Values in the 'file' column .\" December 29, 2022: Minor corrections to \" Known limitations \", \" Workflows \", \" Creating paged, compound, and collection content ,\" and \" Preparing your data .\" December 28, 2022: Added cross reference between \" CSV field templates \" and \" Ignoring CSV rows and columns \". December 17, 2022: Corrected URI for http://pcdm.org/use#OriginalFile on \" Generating CSV files \" and \" Configuration .\" December 11, 2022: Updated \" Generating CSV files \" and \" Output CSV settings \" Configuration docs to include new ability to export files along with CSV data. Note: the data_from_view_file_path setting in \"get_data_from_view\" tasks has been replaced with export_csv_file_path . November 28, 2022: Added \" Quick delete \" docs; added clarification to \" Configuring Drupal's media URLs \" that standalone_media_url: true must be in all config files for tasks that interact with media; added note to \" Adding media to nodes \" and \" Values in the 'file' column \" clarifying that it is not possible to override the filesystem a media's file field is configured to use. November 26, 2022: Changed documentation theme from readthedocs to material; some edits for clarity to the docs for \"file\" field values ; some edits for clarity to the docs for \" adaptive pause .\" November 24, 2022: Added note to \" Adding media to nodes \" and \" Adding multiple media \" about extracted text media; added a note about using absolute file paths in scheduled jobs to the \" Workflows \" and \" Troubleshooting \"; removed the \"required\" \u2714\ufe0f from the password configuration setting entry in the table in \" Configuration \". November 17, 2022: Added new config options path_to_python and path_to_workbench_script to \" Configuration \" docs. October 28, 2022: Updated \" Configuration \" docs to provide details on YAML (configuration file) syntax. September 19, 2022: Updated references to exit_on_first_missing_file_during_check to use strict_check . Configuration settings entry advises exit_on_first_missing_file_during_check will be removed Nov. 1, 2022. September 18, 2022: Added entry \" Ignoring CSV rows and columns .\" September 15, 2022: Added entry to \" Limitations \" page about lack of support for HTML markup. Also added a section on \"Password management\" to \" Requirements and installation \". September 8, 2022: Added documentation on \" Reducing Workbench's impact on Drupal .\" August 30, 2022: Updated \" Hooks \" docs to clarify that the HTTP response code passed to post-entity-create scripts is a string, not an integer. August 18, 2022: Updated standalone_media_url entry in the \" Configuration \" docs, and added brief entry to the \" Troubleshooting \" page about clearing Drupal's cache. August 13, 2022: Updated \" Configuration \" and \" Hooks \" page to describe shutdown scripts. August 11, 2022: Added text to \" Creating paged, compound, and collection content \" page to clarify what happens when a row in the secondary CSV does not have a matching row in the primary CSV. August 8, 2022: Added entry to \" Limitations \" page about support for \"Filter by an entity reference View\" fields. August 3, 2022: Added entry to \" Troubleshooting \" page about missing Microsoft Visual C++ error when installing Workbench on Windows. August 3, 2022: Updated the \" Limitations \" page with entry about Paragraphs. August 2, 2022: Added note about ownership requirements on files to \" Deleting nodes \"; was previously only on \"Deleting media\". July 28, 2022: Updated password entry in the \" Configuration \" docs to mention the new password prompt feature.","title":"Documentation"},{"location":"check/","text":"Overview You should always check your configuration and input prior to creating, updating, or deleting content. You can do this by running Workbench with the --check option, e.g.: ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. Similarly, if you are on a Mac that has the Homebrew version of Python, you may need to run Workbench by providing the full path to the Homebrew Python interpreter, e.g., /opt/homebrew/bin/python3 workbench --config config.yml --check . If you do this, Workbench will check the following conditions and report any errors that require your attention before proceeding: Configuration file Whether your configuration file is valid YAML (i.e., no YAML syntax errors). Whether your configuration file contains all required values. Connection to Drupal Whether your Drupal has the required Workbench Integration module enabled, and that the module is up to date. Whether the host you provided will accept the username and password you provided. Input directory Whether the directory named in the input_dir configuration setting exists. Rollback files Whether the rollback config file and CSV file can be written. CSV file Whether the CSV file is encoded in either ASCII or UTF-8. Whether each row contains the same number of columns as there are column headers. Whether there are any duplicate column headers. Whether your CSV file contains required columns headers, including the field defined as the unique ID for each record (defaults to \"id\" if the id_field key is not in your config file) Whether your CSV column headers correspond to existing Drupal field labels or machine names. Whether all Drupal fields that are configured to be required are present in the CSV file. Whether required fields in your CSV contain values (i.e., they are not blank). Whether the columns required to create paged content are present (see \"Creating paged content\" below). If creating compound/paged content using the \"With page/child-level metadata\" method, --check will tell you if any child item rows in your CSV precede their parent rows. If your config file includes csv_headers: labels , --check will tell you if it detects any duplicate field labels. Media files Whether the files named in the CSV file are present, or in the case of remote files, are accessible (but this check is skipped if allow_missing_files: true is present in your config file for \"create\" tasks). If nodes_only is true, this check is skipped. Whether files in the file CSV column have extensions that are registered with the media's file field in Drupal. Note that validation of file extensions does not yet apply to files named using the additional_files configuration or for remote files (see this issue for more info). Whether the media types configured for specific file extensions are configured on the target Drupal. Islandora Workbench will default to the 'file' media type if it can't find another more specific media type for a file, so the most likely cause for this check to fail is that the assigned media type does not exist on the target Drupal. If creating media track files , --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) does not include the \"Service File\" taxonomy term. Field values Base fields If the langcode field is present in your CSV, whether values in it are valid Drupal language codes. Whether your CSV file contains a title field ( create task only) Whether values in the title field exceed Drupal's maximum length for titles of 255 characters, or whatever the value of the max_node_title_length configuration setting is. If the created field is present in your CSV file, whether the values in it are formatted correctly (like \"2020-11-15T23:49:22+00:00\") and whether the date is in the past (both of which are Drupal requirements). If the uid field is present in your CSV file, whether the user IDs in that field exist in the target Drupal. Note that this check does not inspect permissions or roles, only that the user ID exists. Whether aliases in the url_alias field in your CSV already exist, and whether they start with a leading slash ( / ). Taxonomy Whether term ID and term URIs used in CSV fields correspond to existing terms. Whether the length of new terms exceeds 255 characters, which is the maximum length for a term name. Whether the term ID (or term URI) provided for media_use_tid is a member of the \"Islandora Media Use\" vocabulary. Whether term names in your CSV require a vocabulary namespace. Typed Relation fields Whether values used in typed relation fields are in the required format Whether values need to be namespaced Whether the term IDs/term names/term URIs used in the values exist in the vocabularies configured for the field. \"List\" text fields Whether values in CSV fields of this Drupal field type are in the field's configured \"Allowed values list\". If using the pages from directories configuration ( paged_content_from_directories: true ): Whether page filenames contain an occurrence of the sequence separator. Whether any page directories are empty. Whether the content type identified in the content_type configuration option exists. Whether multivalued fields exceed their allowed number of values. Whether values in text-type fields exceed their configured maximum length. Whether the nodes referenced in field_member_of (if that field is present in the CSV) exist. Whether values used in geolocation fields are valid lat,long coordinates. Whether values used in EDTF fields are valid EDTF date/time values (subset of date/time values only; see documentation for more detail). Also validates whether dates are valid Gregorian calendar dates. Hook scripts Whether registered bootstrap, preprocessor, and post-action scripts exist and are executable. If Workbench detects a configuration or input data violation, it will either stop and tell you why it stopped, or (if the violation will not cause Workbench's interaction with Drupal to fail), tell you that it found an anomaly and to check the log file for more detail. Note Adding perform_soft_checks: true to you configuration file will tell --check to not stop when it encounters an error with 1) parent/child order in your input CSV, 2) file extensions in your CSV that are not registered with the Drupal configuration of their media type's file fields, or 3) invalid EDTF dates. Workbench will continue to log all of these errors, but will exit after it has checked every row in your CSV file. A successful outcome of running --check confirms that all of the conditions listed above are in place, but it does not guarantee a successful job. There are a lot of factors in play during ingest/update/delete interactions with Drupal that can't be checked in advance, most notably network stability, load on the Drupal server, or failure of an Islandora microservice. But in general --check will tell you if there's a problem that you can investigate and resolve before proceeding with your task. Typical (and recommended) Islandora Workbench usage You will probably need to run Workbench using --check a few times before you will be ready to run it without --check and commit your data to Islandora. For example, you may need to correct errors in taxonomy term IDs or names, fix errors in media filenames, or wrap values in your CSV files in quotation marks. It's also a good idea to check the Workbench log file after running --check . All warnings and errors are printed to the console, but the log file may contain additional information or detail that will help you resolve issues. Once you have used --check to detect all of the problems with your CSV data, committing it to Islandora will work very reliably. Also, it is good practice to check your log after each time you run Islandora Workbench, since it may contain information that is not printed to the console.` Prompting the user to run --check As decribed elsewhere , you can configure Workbench to prompt the user to remind them to run --check . To do so, include remind_user_to_run_check: true in your config file. If this setting is present, the user will be prompted \"Have you run --check? (y/n)\". Responding \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. Note that this setting does not force the user to run --check , it merely ask them if they have run it. Requiring --check You can require a successful --check to have to have been run by including check_lock_file_path with the name (or path) of a file as its value, for example check_lock_file_path: checklock.txt . If this setting is present in your config file, and it has a file name or path as its value, when Workbench is run without --check using the same configuration file, it will look for the \"lock\" file. It compares the data in this lock file with an expected value, and if they are the same, Workbench executes normally. If they differ, Workbench logs this difference and exits. If --check has detected any errors, Workbench will not execute. This can be useful if you want to force the user to run check, or if you are running Workbench in a scheduled or scripted environment and you want to only execute Workbench if --check has been successful.","title":"Checking configuration and input data"},{"location":"check/#overview","text":"You should always check your configuration and input prior to creating, updating, or deleting content. You can do this by running Workbench with the --check option, e.g.: ./workbench --config config.yml --check Note If you're on Windows, you will likely need to run Workbench by explicitly invoking Python, e.g. python workbench --config config.yml --check instead of using ./workbench as illustrated above. Similarly, if you are on a Mac that has the Homebrew version of Python, you may need to run Workbench by providing the full path to the Homebrew Python interpreter, e.g., /opt/homebrew/bin/python3 workbench --config config.yml --check . If you do this, Workbench will check the following conditions and report any errors that require your attention before proceeding: Configuration file Whether your configuration file is valid YAML (i.e., no YAML syntax errors). Whether your configuration file contains all required values. Connection to Drupal Whether your Drupal has the required Workbench Integration module enabled, and that the module is up to date. Whether the host you provided will accept the username and password you provided. Input directory Whether the directory named in the input_dir configuration setting exists. Rollback files Whether the rollback config file and CSV file can be written. CSV file Whether the CSV file is encoded in either ASCII or UTF-8. Whether each row contains the same number of columns as there are column headers. Whether there are any duplicate column headers. Whether your CSV file contains required columns headers, including the field defined as the unique ID for each record (defaults to \"id\" if the id_field key is not in your config file) Whether your CSV column headers correspond to existing Drupal field labels or machine names. Whether all Drupal fields that are configured to be required are present in the CSV file. Whether required fields in your CSV contain values (i.e., they are not blank). Whether the columns required to create paged content are present (see \"Creating paged content\" below). If creating compound/paged content using the \"With page/child-level metadata\" method, --check will tell you if any child item rows in your CSV precede their parent rows. If your config file includes csv_headers: labels , --check will tell you if it detects any duplicate field labels. Media files Whether the files named in the CSV file are present, or in the case of remote files, are accessible (but this check is skipped if allow_missing_files: true is present in your config file for \"create\" tasks). If nodes_only is true, this check is skipped. Whether files in the file CSV column have extensions that are registered with the media's file field in Drupal. Note that validation of file extensions does not yet apply to files named using the additional_files configuration or for remote files (see this issue for more info). Whether the media types configured for specific file extensions are configured on the target Drupal. Islandora Workbench will default to the 'file' media type if it can't find another more specific media type for a file, so the most likely cause for this check to fail is that the assigned media type does not exist on the target Drupal. If creating media track files , --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) does not include the \"Service File\" taxonomy term. Field values Base fields If the langcode field is present in your CSV, whether values in it are valid Drupal language codes. Whether your CSV file contains a title field ( create task only) Whether values in the title field exceed Drupal's maximum length for titles of 255 characters, or whatever the value of the max_node_title_length configuration setting is. If the created field is present in your CSV file, whether the values in it are formatted correctly (like \"2020-11-15T23:49:22+00:00\") and whether the date is in the past (both of which are Drupal requirements). If the uid field is present in your CSV file, whether the user IDs in that field exist in the target Drupal. Note that this check does not inspect permissions or roles, only that the user ID exists. Whether aliases in the url_alias field in your CSV already exist, and whether they start with a leading slash ( / ). Taxonomy Whether term ID and term URIs used in CSV fields correspond to existing terms. Whether the length of new terms exceeds 255 characters, which is the maximum length for a term name. Whether the term ID (or term URI) provided for media_use_tid is a member of the \"Islandora Media Use\" vocabulary. Whether term names in your CSV require a vocabulary namespace. Typed Relation fields Whether values used in typed relation fields are in the required format Whether values need to be namespaced Whether the term IDs/term names/term URIs used in the values exist in the vocabularies configured for the field. \"List\" text fields Whether values in CSV fields of this Drupal field type are in the field's configured \"Allowed values list\". If using the pages from directories configuration ( paged_content_from_directories: true ): Whether page filenames contain an occurrence of the sequence separator. Whether any page directories are empty. Whether the content type identified in the content_type configuration option exists. Whether multivalued fields exceed their allowed number of values. Whether values in text-type fields exceed their configured maximum length. Whether the nodes referenced in field_member_of (if that field is present in the CSV) exist. Whether values used in geolocation fields are valid lat,long coordinates. Whether values used in EDTF fields are valid EDTF date/time values (subset of date/time values only; see documentation for more detail). Also validates whether dates are valid Gregorian calendar dates. Hook scripts Whether registered bootstrap, preprocessor, and post-action scripts exist and are executable. If Workbench detects a configuration or input data violation, it will either stop and tell you why it stopped, or (if the violation will not cause Workbench's interaction with Drupal to fail), tell you that it found an anomaly and to check the log file for more detail. Note Adding perform_soft_checks: true to you configuration file will tell --check to not stop when it encounters an error with 1) parent/child order in your input CSV, 2) file extensions in your CSV that are not registered with the Drupal configuration of their media type's file fields, or 3) invalid EDTF dates. Workbench will continue to log all of these errors, but will exit after it has checked every row in your CSV file. A successful outcome of running --check confirms that all of the conditions listed above are in place, but it does not guarantee a successful job. There are a lot of factors in play during ingest/update/delete interactions with Drupal that can't be checked in advance, most notably network stability, load on the Drupal server, or failure of an Islandora microservice. But in general --check will tell you if there's a problem that you can investigate and resolve before proceeding with your task.","title":"Overview"},{"location":"check/#typical-and-recommended-islandora-workbench-usage","text":"You will probably need to run Workbench using --check a few times before you will be ready to run it without --check and commit your data to Islandora. For example, you may need to correct errors in taxonomy term IDs or names, fix errors in media filenames, or wrap values in your CSV files in quotation marks. It's also a good idea to check the Workbench log file after running --check . All warnings and errors are printed to the console, but the log file may contain additional information or detail that will help you resolve issues. Once you have used --check to detect all of the problems with your CSV data, committing it to Islandora will work very reliably. Also, it is good practice to check your log after each time you run Islandora Workbench, since it may contain information that is not printed to the console.`","title":"Typical (and recommended) Islandora Workbench usage"},{"location":"check/#prompting-the-user-to-run-check","text":"As decribed elsewhere , you can configure Workbench to prompt the user to remind them to run --check . To do so, include remind_user_to_run_check: true in your config file. If this setting is present, the user will be prompted \"Have you run --check? (y/n)\". Responding \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. Note that this setting does not force the user to run --check , it merely ask them if they have run it.","title":"Prompting the user to run --check"},{"location":"check/#requiring-check","text":"You can require a successful --check to have to have been run by including check_lock_file_path with the name (or path) of a file as its value, for example check_lock_file_path: checklock.txt . If this setting is present in your config file, and it has a file name or path as its value, when Workbench is run without --check using the same configuration file, it will look for the \"lock\" file. It compares the data in this lock file with an expected value, and if they are the same, Workbench executes normally. If they differ, Workbench logs this difference and exits. If --check has detected any errors, Workbench will not execute. This can be useful if you want to force the user to run check, or if you are running Workbench in a scheduled or scripted environment and you want to only execute Workbench if --check has been successful.","title":"Requiring --check"},{"location":"checking_if_nodes_exist/","text":"In create tasks, you can configure Workbench to query Drupal to determine if a node exists, using the values in a specified field (referred to as the \"lookup field\" below) in your input CSV, such as field_local_identifier . If Workbench finds a node with a matching value in that field, it will skip to the next CSV row and not create the duplicate node. This feature is useful if you create a subset of items as a test or quality assurance step before loading all items in your CSV, possibly using the csv_rows_to_process configuration setting. Another situation this might be useful in is if some of the nodes represented in your CSV failed to be created, you can fix the problems with those rows and simply rerun the entire batch without having to worry about removing the successful rows. Warning For this to work, the values in the lookup field need to be unique to each node. In other words, two more more nodes should not have the same value in this field. Another assumption Workbench makes is that the values do not contain any spaces. They can however contain underscores, hyphens, and colons. Creating the required View To use this feature, you first need to create a View that Workbench will query. This might seem like a lot of setup, but you can probably use the same View (and corresponding Workbench configuration, as illustrated below) over time for many create tasks. Create a new \"Content\" View that has a REST Export display (no other display types are needed). In the Format configuration, choose \"Serializer\" and under Settings, check the \"Force using fields\" box and the \"json\" format. In the Fields configuration, choose \"Content: ID\" (i.e., the node ID). Go to the Other configuration (on the right side of the View configuration) and in the Use Aggregation configuration, choose \"Yes\". If asked for \"Aggregation type\", select \"Group results together\". In the Filter criteria configuration, add the field in your content type that you want to use as the lookup field, e.g. Local Identifier. Check \"Expose this filter\". Choose \"Single filter\". In the Operator configuration, select \"Contains any word\" and in the Value field, enter the machine name of your field (e.g. field_local_identifier ) In the Filter identifier section, enter the name of the URL parameter requests to this View will use to identify the CSV values. You should use the same string that you used in Operator configuration (which is also the same as your field's machine name), e.g. field_local_identifier . In the Path settings, provide a path, e.g. local_identifier_lookup (do not add the leading / ) Assign this path Basic authentication. Access should be by Permission > View published content. In the Pager settings, choose \"Display all items\". Save your View. Here is a screenshot of an example View's configuration: If you have curl installed on your system, you can test you new REST export View: curl -v -uadmin:password -H \"Content-Type: application/json\" \"https://islandora.traefik.me/local_identifier_lookup?field_local_identifier=sfu_test_001\" (In this example, the REST export's \"Path\" is local_identifier_lookup . Immediately after the ? comes the filter identifier string configured above, with a value after the = from a node's Local Identifier field.) If testing with curl, change the example above to incorporate your local configuration and add a known value from the lookup field. If your test is successful, the query will return a single node ID in a structure like [{\"nid\":\"126\"}] . If the query can't find any nodes, the returned data will look like [] . If it finds more than one node, the structure will looks like [{\"nid\":\"125\"},{\"nid\":\"247\"}] . If you don't have curl, don't worry. --check will confirm that the configured View REST export URL is accessible and that the configured CSV column is in your input CSV. Configuring Workbench With the View configured and tested, you can now use this feature in Workbench by adding the following configuration block: node_exists_verification_view_endpoint: - field_local_identifier: /local_identifier_lookup In this sample configuration we tell Workbench to query a View at the path /local_identifier_lookup using the filter identifer/field name field_local_identifier . Note that you can only have a single CSV field to View path configuration here. If you include more than one, only the last one is used. Nothing special is required in the input CSV; the Workbench configuration block above is all you need to do. However, note that: the field you choose as your lookup field should be a regular metadata field (e.g. \"Local identifier\"), and not the id field required by Workbench. However, there is nothing preventing you from configuring Workbench (through the id_field setting) to use as its ID column the same field you have configured as your lookup field. the data in the CSV field can be multivalued. Workbench will represent the multiple values in a way that works with the \"Contains any word\" option in your View's Filter configuration. as noted above, for this feature to work, the CSV values in the lookup field cannot be used by more than one node and they cannot contain spaces. Your CSV can look like this: file,id,title,field_model,field_local_identifier IMG_1410.tif,01,Small boats in Havana Harbour,Image,sfu_id_001 IMG_2549.jp2,02,Manhatten Island,Image,sfu_id_002 IMG_2940.JPG,03,Looking across Burrard Inlet,Image,sfu_id_003|special_collections:9362 IMG_2958.JPG,04,Amsterdam waterfront,Image,sfu_id_004 IMG_5083.JPG,05,Alcatraz Island,Image,sfu_id_005 Before it creates a node, Workbench will use data from the CSV column specified on the left-hand side of the node_exists_verification_view_endpoint configuration to query the corresponding View endpoint. If it finds no match, it will create the node as usual; if it finds a single match, it will skip that CSV row and log that it has done so; if it finds more than one match, it will also skip the CSV row and not create a node, and it will log that it has done this.","title":"Checking if nodes already exist"},{"location":"checking_if_nodes_exist/#creating-the-required-view","text":"To use this feature, you first need to create a View that Workbench will query. This might seem like a lot of setup, but you can probably use the same View (and corresponding Workbench configuration, as illustrated below) over time for many create tasks. Create a new \"Content\" View that has a REST Export display (no other display types are needed). In the Format configuration, choose \"Serializer\" and under Settings, check the \"Force using fields\" box and the \"json\" format. In the Fields configuration, choose \"Content: ID\" (i.e., the node ID). Go to the Other configuration (on the right side of the View configuration) and in the Use Aggregation configuration, choose \"Yes\". If asked for \"Aggregation type\", select \"Group results together\". In the Filter criteria configuration, add the field in your content type that you want to use as the lookup field, e.g. Local Identifier. Check \"Expose this filter\". Choose \"Single filter\". In the Operator configuration, select \"Contains any word\" and in the Value field, enter the machine name of your field (e.g. field_local_identifier ) In the Filter identifier section, enter the name of the URL parameter requests to this View will use to identify the CSV values. You should use the same string that you used in Operator configuration (which is also the same as your field's machine name), e.g. field_local_identifier . In the Path settings, provide a path, e.g. local_identifier_lookup (do not add the leading / ) Assign this path Basic authentication. Access should be by Permission > View published content. In the Pager settings, choose \"Display all items\". Save your View. Here is a screenshot of an example View's configuration: If you have curl installed on your system, you can test you new REST export View: curl -v -uadmin:password -H \"Content-Type: application/json\" \"https://islandora.traefik.me/local_identifier_lookup?field_local_identifier=sfu_test_001\" (In this example, the REST export's \"Path\" is local_identifier_lookup . Immediately after the ? comes the filter identifier string configured above, with a value after the = from a node's Local Identifier field.) If testing with curl, change the example above to incorporate your local configuration and add a known value from the lookup field. If your test is successful, the query will return a single node ID in a structure like [{\"nid\":\"126\"}] . If the query can't find any nodes, the returned data will look like [] . If it finds more than one node, the structure will looks like [{\"nid\":\"125\"},{\"nid\":\"247\"}] . If you don't have curl, don't worry. --check will confirm that the configured View REST export URL is accessible and that the configured CSV column is in your input CSV.","title":"Creating the required View"},{"location":"checking_if_nodes_exist/#configuring-workbench","text":"With the View configured and tested, you can now use this feature in Workbench by adding the following configuration block: node_exists_verification_view_endpoint: - field_local_identifier: /local_identifier_lookup In this sample configuration we tell Workbench to query a View at the path /local_identifier_lookup using the filter identifer/field name field_local_identifier . Note that you can only have a single CSV field to View path configuration here. If you include more than one, only the last one is used. Nothing special is required in the input CSV; the Workbench configuration block above is all you need to do. However, note that: the field you choose as your lookup field should be a regular metadata field (e.g. \"Local identifier\"), and not the id field required by Workbench. However, there is nothing preventing you from configuring Workbench (through the id_field setting) to use as its ID column the same field you have configured as your lookup field. the data in the CSV field can be multivalued. Workbench will represent the multiple values in a way that works with the \"Contains any word\" option in your View's Filter configuration. as noted above, for this feature to work, the CSV values in the lookup field cannot be used by more than one node and they cannot contain spaces. Your CSV can look like this: file,id,title,field_model,field_local_identifier IMG_1410.tif,01,Small boats in Havana Harbour,Image,sfu_id_001 IMG_2549.jp2,02,Manhatten Island,Image,sfu_id_002 IMG_2940.JPG,03,Looking across Burrard Inlet,Image,sfu_id_003|special_collections:9362 IMG_2958.JPG,04,Amsterdam waterfront,Image,sfu_id_004 IMG_5083.JPG,05,Alcatraz Island,Image,sfu_id_005 Before it creates a node, Workbench will use data from the CSV column specified on the left-hand side of the node_exists_verification_view_endpoint configuration to query the corresponding View endpoint. If it finds no match, it will create the node as usual; if it finds a single match, it will skip that CSV row and log that it has done so; if it finds more than one match, it will also skip the CSV row and not create a node, and it will log that it has done this.","title":"Configuring Workbench"},{"location":"choosing_a_task/","text":"The task configuration setting defines the specific work you want Workbench to perform. This table may help you choose when to use a specific task: If you want to Then use this task Start with this documentation Create nodes from CSV and, optionally, attached media create Preparing your data , Field data (Drupal and CSV) Create basic nodes without using CSV, and attach media create_from_files Creating nodes from files Update node field data update Updating nodes Delete nodes and, optionally, their attached media delete Deleting nodes Add media to existing nodes using a list of node IDs add_media Adding media to nodes Update media field data update_media Updating media Replace files, including media track files, attached to media update_media Updating media Delete media using a list of media IDs delete_media Deleting media Delete media using a list of node IDs delete_media_by_node Deleting media Export node field data using a list of node IDs export_csv Generating CSV files Export node field data using a Drupal View get_data_from_view Generating CSV files Export a media report using a Drupal View get_media_report_from_view Generating CSV files Populate a vocabulary from CSV create_terms Creating taxonomy terms Update terms in a vocabulary from CSV update_terms Updating taxonomy terms Create URL redirects create_redirects Creating redirects","title":"Choosing a task"},{"location":"configuration/","text":"The configuration file Workbench uses a YAML configuration whose location is indicated in the --config argument. This file defines the various options it will use to create, update, or delete Islandora content. The simplest configuration file needs only the following four options: task: create host: \"http://localhost:8000\" username: admin password: islandora In this example, the task being performed is creating nodes (and optionally media). Other tasks are create_from_files , update , delete , add_media , update_media , and delete_media . Some of the configuration settings documented below are used in all tasks, while others are only used in specific tasks. Configuration settings The settings defined in a configuration file are documented in the tables below, grouped into broad functional categories for easier reference. The order of the options in the configuration file doesn't matter, and settings do not need to be grouped together in any specific way in the configuration file. Use of quotation marks Generally speaking, you do not need to use quotation marks around values in your configuration file. You may wrap values in quotation marks if you wish, and many examples in this documentation do that (especially the host setting), but the only values that should not be wrapped in quotation marks are those that take true or false as values because in YAML, and many other computer/markup languages, \"true\" is a string (in this case, an English word that can mean many things) and true is a reserved symbol that can mean one thing and one thing only, the boolean opposite of false (I'm sorry for this explanation, I can't describe the distinction in any other way without writing a primer on symbolic logic). For example, the following is a valid configuration file: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: true csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 But the following version is not valid, since there are quotes around \"true\" in the nodes_only setting: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: \"true\" csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 Use of spaces and other syntactical features Configuration setting names should start a new line and not have any leading spaces. The exception is illustrated in the values of the csv_field_templates setting in the above examples, where the setting's value is a list of other values. In this case the members of the list start with a dash and a space ( - ). The trailing space in these values is significant. (However, the leading space before the dash is insignificant, and is used for appearance only.) For example, this snippet is valid: csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 whereas this one is not: csv_field_templates: -field_linked_agent: relators:aut:person:Jordan, Mark -field_model: 25 Some setting values are represented in Workbench documentation using square brackets, like this one: export_csv_field_list: ['field_description', 'field_extent'] Strictly speaking, YAML lists can be represented as either a series of entries on their own lines that start with - or as entries enclosed in [ and ] . It's best to follow the examples provided throughout the Workbench documentation. Required configuration settings Setting Required Default value Description task \u2714\ufe0f One of 'create', 'create_from_files', 'update', delete', 'add_media', 'delete_media', 'update_media', 'export_csv', 'get_data_from_view', 'create_terms', or 'delete_media_by_node'. See \" Choosing a task \" for more information. host \u2714\ufe0f The hostname, including http:// or https:// of your Islandora repository, and port number if not the default 80. username \u2714\ufe0f The username used to authenticate the requests. This Drupal user should be a member of the \"Administrator\" role. If you want to create nodes that are owned by a specific Drupal user, include their numeric user ID in the uid column in your CSV. password The user's password. You can also set the password in your ISLANDORA_WORKBENCH_PASSWORD environment variable. If you do this, omit the password option in your configuration file. If a password is not available in either your configuration file or in the environment variable, Workbench will prompt for a password. Drupal settings Setting Required Default value Description content_type islandora_object The machine name of the Drupal node content type you are creating or updating. Required in \"create\" and \"update\" tasks. drupal_filesystem fedora:// One of 'fedora://', 'public://', or 'private://' (the wrapping quotation marks are required). Only used with Drupal 8.x - 9.1; starting with Drupal 9.2, the filesystem is automatically detected from the media's configuration. Will eventually be deprecated. allow_adding_terms false In create , update , add_media , update_media , create_terms , and update_terms tasks, determines if Workbench will add taxonomy terms if they do not exist in the target vocabulary. See more information in the \" Taxonomy reference fields \" section. Note: this setting is not required in create_terms tasks unless you are adding new terms to a taxonomy reference field on the term entries. protected_vocabularies [] (empty list) Allows you to exclude vocabularies from having new terms added via allow_adding_terms . See more information in the \" Taxonomy reference fields \" section. vocab_id \u2714\ufe0f in create_terms tasks. Identifies the vocabulary you are adding terms to in create_tersm tasks. See more information in the \" Creating taxonomy terms \" section. update_mode replace Determines if Workbench will replace , append (add to) , or delete field values during update tasks. See more information in the \" Updating nodes \" section. Also applies to update_media tasks. validate_terms_exist true If set to false, during --check Workbench will not query Drupal to determine if taxonomy terms exist. The structure of term values in CSV are still validated; this option only tells Workbench to not check for each term's existence in the target Drupal. Useful to speed up the --check process if you know terms don't exist in the target Drupal. validate_parent_node_exists true If set to false, during --check Workbench will not query Drupal to determine if nodes whose node IDs are in field_member_of exist. Useful to speed up the --check process if you know terms already exist in the target Drupal. max_node_title_length 255 Set to the number of allowed characters for node titles if your Drupal uses Node Title Length . If unsure what your the maximum length of the node titles your site allows, check the length of the \"title\" column in your Drupal database's \"node_field_data\" table. list_missing_drupal_fields false Set to true to tell Workbench to provide a list of fields that exist in your input CSV but that cannot be matched to Drupal field names (or reserved column names such as \"file\"). If false , Workbench will still check for CSV column headers that it can't match to Drupal fields, but will exit upon finding the first such field. This option produces a list of fields instead of exiting on detecting the first field. standalone_media_url false Set to true if your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings checked. The Drupal default is to have this unchecked, so you only need to use this Workbench option if you have changed Drupal's default. More information is available. require_entity_reference_views true Set to false to tell Workbench to not require a View to expose the values in an entity reference field configured to use an Entity Reference View. Additional information is available here . entity_reference_view_endpoints A list of mappings from Drupal/CSV field names to Views REST Export endpoints used to look up term names for entity reference field configured to use an Entity Reference View. Additional information is available here . text_format_id basic_html The text format ID (machine name) to apply to all Drupal text fields that have a \"formatted\" field type. See \" Text fields with markup \" for more information. field_text_format_ids Defines a mapping between field machine names the machine names of format IDs for \"formatted\" fields. See \" Text fields with markup \" for more information. paragraph_fields Defines structure of paragraph fields in the input CSV. See \" Entity Reference Revisions fields (paragraphs) \" for more information. Input data location settings Setting Required Default value Description input_dir input_data The full or relative path to the directory containing the files and metadata CSV file. input_csv metadata.csv Path to the CSV metadata file. Can be absolute, or if just the filename is provided, will be assumed to be in the directory named in input_dir . Can also be the URL to a Google spreadsheet (see the \" Using Google Sheets as input data \" section for more information). google_sheets_csv_filename google_sheet.csv Local CSV filename for data from a Google spreadsheet. See the \" Using Google Sheets as input data \" section for more information. google_sheets_gid 0 The \"gid\" of the worksheet to use in a Google Sheet. See \" Using Google Sheets as input data \" section for more information. excel_worksheet Sheet1 If using an Excel file as your input CSV file, the name of the worksheet that the CSV data will be extracted from. input_data_zip_archives [] List of local file paths and/or URLs to .zip archives to extract into the directory defined in input_dir . See \" Using a local or remote .zip archive as input data \" for more info. delete_zip_archive_after_extraction true Tells Workbench to delete an inpu zip archive after it has been extracted. Input CSV file settings Setting Required Default value Description id_field id The name of the field in the CSV that uniquely identifies each record. delimiter , [comma] The delimiter used in the CSV file, for example, \",\" or \"\\t\" (must use double quotes with \"\\t\"). If omitted, defaults to \",\". subdelimiter | [pipe] The subdelimiter used in the CSV file to define multiple values in one field. If omitted, defaults to \"|\". Can be a string of multiple characters, e.g. \"^^^\". csv_field_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding values that are copied into the CSV input file. More detail provided in the \" CSV field templates \" section. csv_value_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding templates. More detail provided in the \" CSV value templates \" section. csv_value_templates_for_paged_content Used in create tasks only. Similar to csv_value_templates but applies to paged/child items created using the \" Using subdirectories \" method of creating paged content. More detail provided in the \" CSV value templates \" section. csv_value_templates_rand_length 5 Length of the $random_alphanumeric_string and $random_number_string variables CSV value templates. More detail provided in the \" CSV value templates \" section. allow_csv_value_templates_if_field_empty [] List of fields to populate with CSV value templates if the CSV field is empty. More detail provided in the \" CSV value templates \" section. ignore_csv_columns Used in the create and update tasks only. A list of CSV column headers that Workbench should ignore. For example, ignore_csv_columns: [Target Collection, Ready to publish] csv_start_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) before the designated row number. More information is available. csv_stop_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) after the designated row number. More information is available. csv_rows_to_process Used in create and update tasks. Tells Workbench to process only the rows/records in input CSV (or Google Sheet or Excel) with the specified \"id\" column values. More information is available. csv_headers names Used in \"create\", \"update\" and \"create_terms\" tasks. Set to \"labels\" to allow use of field labels (as opposed to machine names) as CSV column headers. clean_csv_values_skip [] (empty list) Used in all tasks that use CSV input files. See \" How Workbench cleans your input data \" for more information. columns_with_term_names [] (empty list) Used in all tasks that allow creation of terms on entity ingest. See \" Using numbers as term names \" for more information. Output CSV settings See \" Generating CSV files \" section for more information. Setting Required Default value Description output_csv Used in \"create\" tasks. The full or relative (to the \"workbench\" script) path to a CSV file with one record per node created by Workbench. output_csv_include_input_csv false Used in \"create\" tasks in conjunction with output_csv . Include in the output CSV all the fields (and their values) from the input CSV. export_csv_term_mode tid Used in \"export_csv\" tasks to indicate whether vocabulary term IDs or names are included in the output CSV file. Set to \"tid\" (the default) to include term IDs, or set to \"name\" to include term names. See \" Exporting field data into a CSV file \" for more information. export_csv_field_list [] (empty list) List of fields to include in exported CSV data. If empty, all fields will be included. See \" Using a Drupal View to identify content to export as CSV \" for more information. view_parameters List of URL parameter/value strings to include in requests to a View. See \" Using a Drupal View to identify content to export as CSV \" for more information. export_csv_file_path \u2714\ufe0f in get_data_from_view tasks. Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the exported CSV file. Required in the \"get_data_from_view\" task; in the \"export_csv\" task, if left empty (the default), the file will be named after the value of the input_csv with \".csv_file_with_field_values\" appended and saved in the directory identified in input_dir . export_file_directory Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the directory where files corresponding to the data in the CSV output file will be written. export_file_media_use_term_id http://pcdm.org/use#OriginalFile Used in the \"export_csv\" and \"get_data_from_view\" tasks. The term ID or URI from the Islandora Media Use vocabulary that identifies the file you want to export. Media settings Setting Required Default value Description nodes_only false Include this option in create tasks, set to true , if you want to only create nodes and not their accompanying media. See the \"Creating nodes but not media\" section for more information. allow_missing_files false Determines if file values that point to missing (not found) files are allowed. Used in the create and add_media tasks. If set to true, file values that point to missing files are allowed. For create tasks, a true value will result in nodes without attached media. For add_media tasks, a true value will skip adding a media for the missing file CSV value. Defaults to false (which means all file values must name files that exist at their specified locations). Note that this setting has no effect on empty file values; these are always logged, and their corresponding media are simply not created. exit_on_first_missing_file_during_check true Removed as a configuration setting November 1, 2022. Use strict_check instead. strict_check Replaced with perform_soft_checks as of commit dfa60ff (July 14, 2023). media_use_tid http://pcdm.org/use#OriginalFile The term ID for the term from the \"Islandora Media Use\" vocabulary you want to apply to the media being created in create and add_media tasks. You can provide a term URI instead of a term ID, for example \"http://pcdm.org/use#OriginalFile\" . You can specify multiple values for this setting by joining them with the subdelimiter configured in the subdelimiter setting; for example, media_use_tid: 17|18 . You can also set this at the object level by including media_use_tid in your CSV file; values there will override the value set in your configuration file. If you are \" Adding multiple media \", you define media use term IDs in a slightly different way. media_type \u2714\ufe0f in add_media and update_media tasks Overrides, for all media being created, Workbench's default definition of whether the media being created is an image, file, document, audio, or video. Used in the create , create_from_files , add_media , and update_media , tasks. More detail provided in the \" Configuring Media Types \" section. Required in all update_media and add_media tasks, not to override Workbench's defaults, but to explicitly indicate the media type being updated/added. media_types_override Overrides default media type definitions on a per file extension basis. Used in the create , add_media , and create_from_files tasks. More detail provided in the \" Configuring Media Types \" section. media_type_file_fields Defines the name of the media field that references media's file (i.e., the field on the Media type). Usually used with custom media types and accompanied by either the media_type or media_types_override option. For more information, see the \" Configuring Media Types \" section. mimetype_extensions Overrides Workbench's default mappings between MIME types and file extensions. Usually used with remote files where the remote web server returns a MIME type that is not standard. For more information, see the \" Configuring Media Types \" section. extensions_to_mimetypes Overrides Workbench's default mappings between file extension and MIME types. For more information, see the \" Configuring Media Types \" section. delete_media_with_nodes true When a node is deleted using a delete task, by default, all if its media are automatically deleted. Set this option to false to not delete all of a node's media (you do not generally want to keep the media without the node). use_node_title_for_media_title true If set to true (default), name media the same as the parent node's title value. Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_nid_in_media_title false If set to true , assigns a name to the media following the pattern {node_id}-Original File . Set to true to use the parent node's node ID as the media title. Applies to both create and add_media tasks. field_for_media_title Identifies a CSV field name (machine name, not human readable) that will be used as the media title in create tasks. For example, field_for_media_title: id . Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_node_title_for_remote_filename false Set to true to use a version of the parent node's title as the filename for a remote (http[s]) file. Replaces all non-alphanumeric characters with an underscore ( _ ). Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) use_nid_in_media_filename setting. field_for_remote_filename Identifies a CSV field name (machine name, not human readable) that will be used as the filename for a remote (http[s]) file. For example, field_for_remote_filename: id . Truncates the value of the field to 255 characters. If the field is empty in the CSV, the CSV ID field value will be used. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) field_for_media_filename setting. delete_tmp_upload false For remote files, if set to true , the temporary copy of the remote file is deleted after it is used to create media. If false , the copy will remain in the location defined in your temp_dir setting. If the file cannot be deleted (e.g. a virus scanner is scanning it), it will remain and an error message will be added to the log file. additional_files Maps a set of CSV field names to media use terms IDs to create additional media (additional to the media created from the file named in the \"file\" column, that is) in create and add_media tasks. See \" Adding multiple media \" for more information. fixity_algorithm None Checksum/hash algorithm to use during transmission fixity checking. Must be one of \"md5\", \"sha1\", or \"sha256\". See \" Fixity checking \" for more information. validate_fixity_during_check false Perform checksum validation during --check . See \" Fixity checking \" for more information. delete_media_by_node_media_use_tids [] (empty list) During delete_media_by_node tasks, allows you to specify which media to delete. Only media with the listed terms IDs from the Islandora Media Use vocabulary will be deleted. By default (an empty list), all media are deleted. See \" Deleting Media \" for more information. Islandora model settings Setting Required Default value Description model [singular] Used in the create_from_files task only. Defines the term ID from the the \"Islandora Models\" vocabulary for all nodes created using this task. Note: one of model or models is required. More detail provided in the \" Creating nodes from files \" section. models [plural] Used in the create_from_files task only. Provides a mapping between file extensions and terms in the \"Islandora Models\" vocabulary. Note: one of model or models is required. More detail provided in the Creating nodes from files \" section. Paged and compound content settings See the section \" Creating paged content \" for more information. Setting Required Default value Description paged_content_from_directories false Set to true if you are using the \" Using subdirectories \" method of creating paged content. page_files_source_dir_field id [or whatever is defined as your id column using the id_field configuration setting] Set to directory if your input CSV contains a \"directory\" column that names each row's page, if are using the \" Using subdirectories \" method of creating paged content. paged_content_sequence_separator - [hyphen] The character used to separate the page sequence number from the rest of the filename. Used when creating paged content with the \" Using subdirectories \" method. Note: this configuration option was originally misspelled \"paged_content_sequence_seprator\". paged_content_page_model_tid Required if paged_content_from_directories is true. The the term ID from the Islandora Models (or its URI) taxonomy to assign to pages. paged_content_image_file_extension If the subdirectories containing your page image files also contain other files (that are not page images), you need to use this setting to tell Workbench which files to create pages from. Common values include tif , tiff , and jpg . paged_content_additional_page_media A mapping of Media Use term IDs (or URIs) to file extensions telling Workbench which Media Use term to apply to media created from additional files such as OCR text files. paged_content_page_display_hints The term ID from the Islandora Display taxonomy to assign to pages. If not included, defaults to the value of the field_display_hints in the parent's record in the CSV file. paged_content_page_content_type Set to the machine name of the Drupal node content type for pages created using the \" Using subdirectories \" method if it is different than the content type of the parent (which is specified in the content_type setting). page_title_template '$parent_title, page $weight' Template used to generate the titles of pages/members in the \" Using subdirectories \" method. paged_content_ignore_files [\"Thumbs.db\"] List of filenames that you want Workbench to ignore when it scans directories to create page- or child-level nodes. See \" Using subdirectories \" for more inforation. Logging settings See the \" Logging \" section for more information. Setting Required Default value Description log_file_path workbench.log The path to the log file, absolute or relative to the directory Workbench is run from. log_file_mode a [append] Set to \"w\" to overwrite the log file, if it exists. log_request_url false Whether or not to log the request URL (and its method). Useful for debugging. log_json false Whether or not to log the raw request JSON POSTed, PUT, or PATCHed to Drupal. Useful for debugging. log_headers false Whether or not to log the raw HTTP headers used in all requests. Useful for debugging. log_response_status_code false Whether or not to log the HTTP response code. Useful for debugging. log_response_time false Whether or not to log the response time of each request that is slower than the average response time for the last 20 HTTP requests Workbench makes to the Drupal server. Useful for debugging. log_response_body false Whether or not to log the raw HTTP response body. Useful for debugging. log_file_name_and_line_number false Whether or not to add the filename and line number that created the log entry. Useful for debugging. log_term_creation true Whether or not to log the creation of new terms during \"create\" and \"update\" tasks (does not apply to the \"create_terms\" task). --check will still report that terms in the CSV do not exist in their respective vocabularies. HTTP settings Setting Required Default value Description user_agent Islandora Workbench String to use as the User-Agent header in HTTP requests. allow_redirects true Whether or not to allow Islandora Workbench to respond to HTTP redirects. secure_ssl_only true Whether or not to require valid SSL certificates. Set to false if you want to ignore SSL certificates. enable_http_cache true Whether or not to enable Workbench's client-side request cache. Set to false if you want to disable the cache during troubleshooting, etc. http_cache_storage memory The backend storage type for the client-side cache. Set to sqlite if you are getting out of memory errors while running Islandora Workbench. http_cache_storage_expire_after 1200 Length of the client-side cache lifespan (in seconds). Reduce this number if you are using the sqlite storage backend and the database is using too much disk space. Note that reducing the cache lifespan will result in increased load on your Drupal server. Rollback configuration and CSV file settings See \" Rolling back \" for more information. Setting Required Default value Description rollback_dir Value of input_dir setting Absolute path to the directory where you want your \"rollback.csv\" file to be written. rollback_config_file_path rollback.yml Relative (to workbench) or absolute path to write the rollback config YAML file to. rollback_csv_file_path [input_dir/rollback.csv] Relative (to workbench) or absolute path to write the rollback CSV file to. timestamp_rollback false Set to true to add a timestamp to the \"rollback.yml\" and corresponding \"rollback.csv\" generated in \"create\" and \"create_from_files\" tasks. rollback_config_filename_template Defines a template that will be used to create the rollback configuration file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_csv_filename_template Defines a template that will be used to create the rollback CSV file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_file_comments Defines a list of lines to be added to both the rollback config and CSV file as comments. include_password_in_rollback_config_file false Set to true to include the value of the password configuration setting in your rollback config YAML file. Miscellaneous settings Setting Required Default value Description perform_soft_checks false If set to true, --check will not exit when it encounters an error with parent/child order, file extension registration with Drupal media file fields, missing files named in the files CSV column, or EDTF validation. Instead, it will log any errors it encounters and exit after it has checked all rows in the CSV input file. Note: this setting replaces strict_check as of commit dfa60ff (July 14, 2023). temp_dir Value of the temporary directory defined by your system as defined by Python's gettempdir() function. Relative or absolute path to the directory where you want Workbench's temporary files to be written. These include the \".preprocessed\" version of the your input CSV, remote files temporarily downloaded to create media, and the CSV ID to node ID map database. sqlite_db_filename [temp_dir]/workbench_temp_data.db Name of the SQLite database filename used to store session data. pause Defines the number of seconds to pause between all 'POST', 'PUT', 'PATCH', 'DELETE' requests to Drupal. Include it in your configuration to lessen the impact of Islandora Workbench on your site during large jobs, for example pause: 1.5. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause Defines the number of seconds to pause between each REST request to Drupal. Works like \"pause\" but only takes effect when the Drupal server's response to the most recent request is slower (determined by the \"adaptive_pause_threshold\" value) than the average response time for the last 20 requests. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause_threshold 2 A weighting of the response time for the most recent request, relative to the average response times of the last 20 requests. This weighting determines how much slower the Drupal server's response to the most recent Workbench request must be in order for adaptive pausing to take effect for the next request. For example, if set to \"1\", adaptive pausing will happen when the response time is equal to the average of the last 20 response times; if set to \"2\", adaptive pausing will take effect if the last requests's response time is double the average. progress_bar false Show a progress bar when running Workbench instead of row-by-row output. bootstrap List of absolute paths to one or more scripts that execute prior to Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. shutdown List of absolute paths to one or more scripts that execute after Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. preprocessors List of absolute paths to one or more scripts that are applied to CSV values prior to the values being ingested into Drupal. More information is available in the \" Hooks \" documentation. node_post_create List of absolute paths to one or more scripts that execute after a node is created. More information is available in the \" Hooks \" documentation. node_post_update List of absolute paths to one or more scripts that execute after a node is updated. More information is available in the \" Hooks \" documentation. media_post_create List of absolute paths to one or more scripts that execute after a media is created. More information is available in the \" Hooks \" documentation. drupal_8 Deprecated. path_to_python python Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". path_to_workbench_script workbench Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". contact_sheet_output_dir contact_sheet_output Used in create tasks to specify the name of the directory where contact sheet output is written. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. contact_sheet_css_path assets/contact_sheet/contact-sheet.css Used in create tasks to specify the path of the CSS stylesheet to use in contact sheets. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. node_exists_verification_view_endpoint Used in create tasks to tell Workbench to check whether a node with a matching value in your input CSV already exists. See \" Checking if nodes already exist \" for more information. redirect_status_code 301 Used in create_redirect tasks to set the HTTP response code in redirects. See \" Prompting the user \" for more information. remind_user_to_run_check false Prompt the user to confirm whether they have run Workbench with --check . See \" Prompting the user \" for more information. user_prompts [] (empty list) List of custom prompts to present to the user. See \" Prompting the user \" for more information. check_lock_file_path Defines a \"lock file\" that contains data generated by a successfule --check . See \" Checking configuration and input data \" for more information. secondary_tasks A list of configuration file names that are executed as secondary tasks after the primary task to create compound objects. See \" Using a secondary task \" for more information. csv_id_to_node_id_map_path [temp_dir]/csv_id_to_node_id_map.db Name of the SQLite database filename used to store CSV row ID to node ID mappings for nodes created during create and create_from_files tasks. If you want to store your map database outside of a temporary directory, use an absolute path to the database file for this setting. If you want to disable population of this database, set this config setting to false . See \" Generating CSV files \" for more information. query_csv_id_to_node_id_map_for_parents false Queries the CSV ID to node ID map when creating compound content. Set to true if you want to use parent IDs from previous Workbench sessions. Note: this setting is automatically set to true in secondary task config files. See \" Creating parent/child relationships across Workbench sessions for more information.\" ignore_duplicate_parent_ids true Tells Workbench to ignore entries in the CSV ID to node ID map that have the same parent ID value. Set to false if you want Workbench to warn you that there are duplicate parent IDs in your CSV ID to node ID map. See \" Creating parent/child relationships across Workbench sessions for more information.\" When you run Islandora Workbench with the --check argument, it will verify that all configuration options required for the current task are present, and if they aren't tell you so. Note Islandora Workbench automatically converts any configuration keys to lowercase, e.g., Task is equivalent to task . Validating the syntax of the configuration file When you run Workbench, it confirms that your configuration file is valid YAML. This is a syntax check only, not a content check. If the file is valid YAML, Workbench then goes on to perform a long list of application-specific checks . If this syntax check fails, some detail about the problem will be displayed to the user. The same information plus the entire Python stack trace is also logged to a file named \"workbench.log\" in the directory Islandora Workbench is run from. This file name is Workbench's default log file name, but in this case (validating the config file's YAML syntax), that file name is used regardless of the log file location defined in the configuration's log_file_path option. The reason the error is logged in the default location instead of the value in the configuration file (if one is present) is that the configuration file isn't valid YAML and therefore can't be parsed. Example configuration files These examples provide inline annotations explaining why the settings are included in the configuration file. Blank rows/lines are included for readability. Create nodes only, no media task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to create nodes with no media. # Also, this tells --check to skip all validation of \"file\" locations. # Other media settings, like \"media_use_tid\", are also ignored. nodes_only: true Use a custom log file location task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to write its log file to the location specified # instead of the default \"workbench.log\" within the directory Workbench is run from. log_file_path: /home/mark/workbench_log.txt Include some CSV field templates task: create host: \"http://localhost:8000\" username: admin password: islandora # The values in this list of field templates are applied to every row in the # input CSV file before the CSV file is used to populate Drupal fields. The # field templates are also applied during the \"--check\" in order to validate # the values of the fields. csv_field_templates: - field_member_of: 205 - field_model: 25 Use a Google Sheet as input CSV task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit # You only need to specify the google_sheets_gid option if the worksheet in the Google Sheet # is not the default one. google_sheets_gid: 1867618389 Create nodes and media from files (no input CSV file) task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # The files to create the nodes from are in this directory. input_dir: /tmp/sample_files # This tells Workbench to write a CSV file containing node IDs of the # created nodes, plus the field names used in the target content type # (\"islandora_object\" by default). output_csv: /tmp/sample_files.csv # All nodes should get the \"Model\" value corresponding to this URI. model: 'https://schema.org/DigitalDocument' Create taxonomy terms task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary Ignore some columns in your input CSV file task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: input.csv # This tells Workbench to ignore the 'date_generated' and 'batch_id' # columns in the input.csv file. ignore_csv_columns: ['date_generated', 'batch_id'] Generating sample Islandora content task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # This directory must match the on defined in the script's 'dest_dir' variable. input_dir: /tmp/autogen_input media_use_tid: 17 output_csv: /tmp/my_sample_content_csv.csv model: http://purl.org/coar/resource_type/c_c513 # This is the script that generates the sample content. bootstrap: - \"/home/mark/Documents/hacking/workbench/generate_image_files.py\" Running a post-action script ask: create host: \"http://localhost:8000\" username: admin password: islandora node_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # node_post_update: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # media_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] Create nodes and media from files task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora input_dir: path/to/files models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4']","title":"Configuration"},{"location":"configuration/#the-configuration-file","text":"Workbench uses a YAML configuration whose location is indicated in the --config argument. This file defines the various options it will use to create, update, or delete Islandora content. The simplest configuration file needs only the following four options: task: create host: \"http://localhost:8000\" username: admin password: islandora In this example, the task being performed is creating nodes (and optionally media). Other tasks are create_from_files , update , delete , add_media , update_media , and delete_media . Some of the configuration settings documented below are used in all tasks, while others are only used in specific tasks.","title":"The configuration file"},{"location":"configuration/#configuration-settings","text":"The settings defined in a configuration file are documented in the tables below, grouped into broad functional categories for easier reference. The order of the options in the configuration file doesn't matter, and settings do not need to be grouped together in any specific way in the configuration file.","title":"Configuration settings"},{"location":"configuration/#use-of-quotation-marks","text":"Generally speaking, you do not need to use quotation marks around values in your configuration file. You may wrap values in quotation marks if you wish, and many examples in this documentation do that (especially the host setting), but the only values that should not be wrapped in quotation marks are those that take true or false as values because in YAML, and many other computer/markup languages, \"true\" is a string (in this case, an English word that can mean many things) and true is a reserved symbol that can mean one thing and one thing only, the boolean opposite of false (I'm sorry for this explanation, I can't describe the distinction in any other way without writing a primer on symbolic logic). For example, the following is a valid configuration file: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: true csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 But the following version is not valid, since there are quotes around \"true\" in the nodes_only setting: task: create host: http://localhost:8000 username: admin password: islandora nodes_only: \"true\" csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25","title":"Use of quotation marks"},{"location":"configuration/#use-of-spaces-and-other-syntactical-features","text":"Configuration setting names should start a new line and not have any leading spaces. The exception is illustrated in the values of the csv_field_templates setting in the above examples, where the setting's value is a list of other values. In this case the members of the list start with a dash and a space ( - ). The trailing space in these values is significant. (However, the leading space before the dash is insignificant, and is used for appearance only.) For example, this snippet is valid: csv_field_templates: - field_linked_agent: relators:aut:person:Jordan, Mark - field_model: 25 whereas this one is not: csv_field_templates: -field_linked_agent: relators:aut:person:Jordan, Mark -field_model: 25 Some setting values are represented in Workbench documentation using square brackets, like this one: export_csv_field_list: ['field_description', 'field_extent'] Strictly speaking, YAML lists can be represented as either a series of entries on their own lines that start with - or as entries enclosed in [ and ] . It's best to follow the examples provided throughout the Workbench documentation.","title":"Use of spaces and other syntactical features"},{"location":"configuration/#required-configuration-settings","text":"Setting Required Default value Description task \u2714\ufe0f One of 'create', 'create_from_files', 'update', delete', 'add_media', 'delete_media', 'update_media', 'export_csv', 'get_data_from_view', 'create_terms', or 'delete_media_by_node'. See \" Choosing a task \" for more information. host \u2714\ufe0f The hostname, including http:// or https:// of your Islandora repository, and port number if not the default 80. username \u2714\ufe0f The username used to authenticate the requests. This Drupal user should be a member of the \"Administrator\" role. If you want to create nodes that are owned by a specific Drupal user, include their numeric user ID in the uid column in your CSV. password The user's password. You can also set the password in your ISLANDORA_WORKBENCH_PASSWORD environment variable. If you do this, omit the password option in your configuration file. If a password is not available in either your configuration file or in the environment variable, Workbench will prompt for a password.","title":"Required configuration settings"},{"location":"configuration/#drupal-settings","text":"Setting Required Default value Description content_type islandora_object The machine name of the Drupal node content type you are creating or updating. Required in \"create\" and \"update\" tasks. drupal_filesystem fedora:// One of 'fedora://', 'public://', or 'private://' (the wrapping quotation marks are required). Only used with Drupal 8.x - 9.1; starting with Drupal 9.2, the filesystem is automatically detected from the media's configuration. Will eventually be deprecated. allow_adding_terms false In create , update , add_media , update_media , create_terms , and update_terms tasks, determines if Workbench will add taxonomy terms if they do not exist in the target vocabulary. See more information in the \" Taxonomy reference fields \" section. Note: this setting is not required in create_terms tasks unless you are adding new terms to a taxonomy reference field on the term entries. protected_vocabularies [] (empty list) Allows you to exclude vocabularies from having new terms added via allow_adding_terms . See more information in the \" Taxonomy reference fields \" section. vocab_id \u2714\ufe0f in create_terms tasks. Identifies the vocabulary you are adding terms to in create_tersm tasks. See more information in the \" Creating taxonomy terms \" section. update_mode replace Determines if Workbench will replace , append (add to) , or delete field values during update tasks. See more information in the \" Updating nodes \" section. Also applies to update_media tasks. validate_terms_exist true If set to false, during --check Workbench will not query Drupal to determine if taxonomy terms exist. The structure of term values in CSV are still validated; this option only tells Workbench to not check for each term's existence in the target Drupal. Useful to speed up the --check process if you know terms don't exist in the target Drupal. validate_parent_node_exists true If set to false, during --check Workbench will not query Drupal to determine if nodes whose node IDs are in field_member_of exist. Useful to speed up the --check process if you know terms already exist in the target Drupal. max_node_title_length 255 Set to the number of allowed characters for node titles if your Drupal uses Node Title Length . If unsure what your the maximum length of the node titles your site allows, check the length of the \"title\" column in your Drupal database's \"node_field_data\" table. list_missing_drupal_fields false Set to true to tell Workbench to provide a list of fields that exist in your input CSV but that cannot be matched to Drupal field names (or reserved column names such as \"file\"). If false , Workbench will still check for CSV column headers that it can't match to Drupal fields, but will exit upon finding the first such field. This option produces a list of fields instead of exiting on detecting the first field. standalone_media_url false Set to true if your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings checked. The Drupal default is to have this unchecked, so you only need to use this Workbench option if you have changed Drupal's default. More information is available. require_entity_reference_views true Set to false to tell Workbench to not require a View to expose the values in an entity reference field configured to use an Entity Reference View. Additional information is available here . entity_reference_view_endpoints A list of mappings from Drupal/CSV field names to Views REST Export endpoints used to look up term names for entity reference field configured to use an Entity Reference View. Additional information is available here . text_format_id basic_html The text format ID (machine name) to apply to all Drupal text fields that have a \"formatted\" field type. See \" Text fields with markup \" for more information. field_text_format_ids Defines a mapping between field machine names the machine names of format IDs for \"formatted\" fields. See \" Text fields with markup \" for more information. paragraph_fields Defines structure of paragraph fields in the input CSV. See \" Entity Reference Revisions fields (paragraphs) \" for more information.","title":"Drupal settings"},{"location":"configuration/#input-data-location-settings","text":"Setting Required Default value Description input_dir input_data The full or relative path to the directory containing the files and metadata CSV file. input_csv metadata.csv Path to the CSV metadata file. Can be absolute, or if just the filename is provided, will be assumed to be in the directory named in input_dir . Can also be the URL to a Google spreadsheet (see the \" Using Google Sheets as input data \" section for more information). google_sheets_csv_filename google_sheet.csv Local CSV filename for data from a Google spreadsheet. See the \" Using Google Sheets as input data \" section for more information. google_sheets_gid 0 The \"gid\" of the worksheet to use in a Google Sheet. See \" Using Google Sheets as input data \" section for more information. excel_worksheet Sheet1 If using an Excel file as your input CSV file, the name of the worksheet that the CSV data will be extracted from. input_data_zip_archives [] List of local file paths and/or URLs to .zip archives to extract into the directory defined in input_dir . See \" Using a local or remote .zip archive as input data \" for more info. delete_zip_archive_after_extraction true Tells Workbench to delete an inpu zip archive after it has been extracted.","title":"Input data location settings"},{"location":"configuration/#input-csv-file-settings","text":"Setting Required Default value Description id_field id The name of the field in the CSV that uniquely identifies each record. delimiter , [comma] The delimiter used in the CSV file, for example, \",\" or \"\\t\" (must use double quotes with \"\\t\"). If omitted, defaults to \",\". subdelimiter | [pipe] The subdelimiter used in the CSV file to define multiple values in one field. If omitted, defaults to \"|\". Can be a string of multiple characters, e.g. \"^^^\". csv_field_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding values that are copied into the CSV input file. More detail provided in the \" CSV field templates \" section. csv_value_templates Used in the create and update tasks only. A list of Drupal field machine names and corresponding templates. More detail provided in the \" CSV value templates \" section. csv_value_templates_for_paged_content Used in create tasks only. Similar to csv_value_templates but applies to paged/child items created using the \" Using subdirectories \" method of creating paged content. More detail provided in the \" CSV value templates \" section. csv_value_templates_rand_length 5 Length of the $random_alphanumeric_string and $random_number_string variables CSV value templates. More detail provided in the \" CSV value templates \" section. allow_csv_value_templates_if_field_empty [] List of fields to populate with CSV value templates if the CSV field is empty. More detail provided in the \" CSV value templates \" section. ignore_csv_columns Used in the create and update tasks only. A list of CSV column headers that Workbench should ignore. For example, ignore_csv_columns: [Target Collection, Ready to publish] csv_start_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) before the designated row number. More information is available. csv_stop_row Used in create and update tasks. Tells Workbench to ignore all rows/records in input CSV (or Google Sheet or Excel) after the designated row number. More information is available. csv_rows_to_process Used in create and update tasks. Tells Workbench to process only the rows/records in input CSV (or Google Sheet or Excel) with the specified \"id\" column values. More information is available. csv_headers names Used in \"create\", \"update\" and \"create_terms\" tasks. Set to \"labels\" to allow use of field labels (as opposed to machine names) as CSV column headers. clean_csv_values_skip [] (empty list) Used in all tasks that use CSV input files. See \" How Workbench cleans your input data \" for more information. columns_with_term_names [] (empty list) Used in all tasks that allow creation of terms on entity ingest. See \" Using numbers as term names \" for more information.","title":"Input CSV file settings"},{"location":"configuration/#output-csv-settings","text":"See \" Generating CSV files \" section for more information. Setting Required Default value Description output_csv Used in \"create\" tasks. The full or relative (to the \"workbench\" script) path to a CSV file with one record per node created by Workbench. output_csv_include_input_csv false Used in \"create\" tasks in conjunction with output_csv . Include in the output CSV all the fields (and their values) from the input CSV. export_csv_term_mode tid Used in \"export_csv\" tasks to indicate whether vocabulary term IDs or names are included in the output CSV file. Set to \"tid\" (the default) to include term IDs, or set to \"name\" to include term names. See \" Exporting field data into a CSV file \" for more information. export_csv_field_list [] (empty list) List of fields to include in exported CSV data. If empty, all fields will be included. See \" Using a Drupal View to identify content to export as CSV \" for more information. view_parameters List of URL parameter/value strings to include in requests to a View. See \" Using a Drupal View to identify content to export as CSV \" for more information. export_csv_file_path \u2714\ufe0f in get_data_from_view tasks. Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the exported CSV file. Required in the \"get_data_from_view\" task; in the \"export_csv\" task, if left empty (the default), the file will be named after the value of the input_csv with \".csv_file_with_field_values\" appended and saved in the directory identified in input_dir . export_file_directory Used in the \"export_csv\" and \"get_data_from_view\" tasks. The path to the directory where files corresponding to the data in the CSV output file will be written. export_file_media_use_term_id http://pcdm.org/use#OriginalFile Used in the \"export_csv\" and \"get_data_from_view\" tasks. The term ID or URI from the Islandora Media Use vocabulary that identifies the file you want to export.","title":"Output CSV settings"},{"location":"configuration/#media-settings","text":"Setting Required Default value Description nodes_only false Include this option in create tasks, set to true , if you want to only create nodes and not their accompanying media. See the \"Creating nodes but not media\" section for more information. allow_missing_files false Determines if file values that point to missing (not found) files are allowed. Used in the create and add_media tasks. If set to true, file values that point to missing files are allowed. For create tasks, a true value will result in nodes without attached media. For add_media tasks, a true value will skip adding a media for the missing file CSV value. Defaults to false (which means all file values must name files that exist at their specified locations). Note that this setting has no effect on empty file values; these are always logged, and their corresponding media are simply not created. exit_on_first_missing_file_during_check true Removed as a configuration setting November 1, 2022. Use strict_check instead. strict_check Replaced with perform_soft_checks as of commit dfa60ff (July 14, 2023). media_use_tid http://pcdm.org/use#OriginalFile The term ID for the term from the \"Islandora Media Use\" vocabulary you want to apply to the media being created in create and add_media tasks. You can provide a term URI instead of a term ID, for example \"http://pcdm.org/use#OriginalFile\" . You can specify multiple values for this setting by joining them with the subdelimiter configured in the subdelimiter setting; for example, media_use_tid: 17|18 . You can also set this at the object level by including media_use_tid in your CSV file; values there will override the value set in your configuration file. If you are \" Adding multiple media \", you define media use term IDs in a slightly different way. media_type \u2714\ufe0f in add_media and update_media tasks Overrides, for all media being created, Workbench's default definition of whether the media being created is an image, file, document, audio, or video. Used in the create , create_from_files , add_media , and update_media , tasks. More detail provided in the \" Configuring Media Types \" section. Required in all update_media and add_media tasks, not to override Workbench's defaults, but to explicitly indicate the media type being updated/added. media_types_override Overrides default media type definitions on a per file extension basis. Used in the create , add_media , and create_from_files tasks. More detail provided in the \" Configuring Media Types \" section. media_type_file_fields Defines the name of the media field that references media's file (i.e., the field on the Media type). Usually used with custom media types and accompanied by either the media_type or media_types_override option. For more information, see the \" Configuring Media Types \" section. mimetype_extensions Overrides Workbench's default mappings between MIME types and file extensions. Usually used with remote files where the remote web server returns a MIME type that is not standard. For more information, see the \" Configuring Media Types \" section. extensions_to_mimetypes Overrides Workbench's default mappings between file extension and MIME types. For more information, see the \" Configuring Media Types \" section. delete_media_with_nodes true When a node is deleted using a delete task, by default, all if its media are automatically deleted. Set this option to false to not delete all of a node's media (you do not generally want to keep the media without the node). use_node_title_for_media_title true If set to true (default), name media the same as the parent node's title value. Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_nid_in_media_title false If set to true , assigns a name to the media following the pattern {node_id}-Original File . Set to true to use the parent node's node ID as the media title. Applies to both create and add_media tasks. field_for_media_title Identifies a CSV field name (machine name, not human readable) that will be used as the media title in create tasks. For example, field_for_media_title: id . Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. use_node_title_for_remote_filename false Set to true to use a version of the parent node's title as the filename for a remote (http[s]) file. Replaces all non-alphanumeric characters with an underscore ( _ ). Truncates the value of the field to 255 characters. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) use_nid_in_media_filename setting. field_for_remote_filename Identifies a CSV field name (machine name, not human readable) that will be used as the filename for a remote (http[s]) file. For example, field_for_remote_filename: id . Truncates the value of the field to 255 characters. If the field is empty in the CSV, the CSV ID field value will be used. Applies to both create and add_media tasks. Note: this setting replaces (the previously undocumented) field_for_media_filename setting. delete_tmp_upload false For remote files, if set to true , the temporary copy of the remote file is deleted after it is used to create media. If false , the copy will remain in the location defined in your temp_dir setting. If the file cannot be deleted (e.g. a virus scanner is scanning it), it will remain and an error message will be added to the log file. additional_files Maps a set of CSV field names to media use terms IDs to create additional media (additional to the media created from the file named in the \"file\" column, that is) in create and add_media tasks. See \" Adding multiple media \" for more information. fixity_algorithm None Checksum/hash algorithm to use during transmission fixity checking. Must be one of \"md5\", \"sha1\", or \"sha256\". See \" Fixity checking \" for more information. validate_fixity_during_check false Perform checksum validation during --check . See \" Fixity checking \" for more information. delete_media_by_node_media_use_tids [] (empty list) During delete_media_by_node tasks, allows you to specify which media to delete. Only media with the listed terms IDs from the Islandora Media Use vocabulary will be deleted. By default (an empty list), all media are deleted. See \" Deleting Media \" for more information.","title":"Media settings"},{"location":"configuration/#islandora-model-settings","text":"Setting Required Default value Description model [singular] Used in the create_from_files task only. Defines the term ID from the the \"Islandora Models\" vocabulary for all nodes created using this task. Note: one of model or models is required. More detail provided in the \" Creating nodes from files \" section. models [plural] Used in the create_from_files task only. Provides a mapping between file extensions and terms in the \"Islandora Models\" vocabulary. Note: one of model or models is required. More detail provided in the Creating nodes from files \" section.","title":"Islandora model settings"},{"location":"configuration/#paged-and-compound-content-settings","text":"See the section \" Creating paged content \" for more information. Setting Required Default value Description paged_content_from_directories false Set to true if you are using the \" Using subdirectories \" method of creating paged content. page_files_source_dir_field id [or whatever is defined as your id column using the id_field configuration setting] Set to directory if your input CSV contains a \"directory\" column that names each row's page, if are using the \" Using subdirectories \" method of creating paged content. paged_content_sequence_separator - [hyphen] The character used to separate the page sequence number from the rest of the filename. Used when creating paged content with the \" Using subdirectories \" method. Note: this configuration option was originally misspelled \"paged_content_sequence_seprator\". paged_content_page_model_tid Required if paged_content_from_directories is true. The the term ID from the Islandora Models (or its URI) taxonomy to assign to pages. paged_content_image_file_extension If the subdirectories containing your page image files also contain other files (that are not page images), you need to use this setting to tell Workbench which files to create pages from. Common values include tif , tiff , and jpg . paged_content_additional_page_media A mapping of Media Use term IDs (or URIs) to file extensions telling Workbench which Media Use term to apply to media created from additional files such as OCR text files. paged_content_page_display_hints The term ID from the Islandora Display taxonomy to assign to pages. If not included, defaults to the value of the field_display_hints in the parent's record in the CSV file. paged_content_page_content_type Set to the machine name of the Drupal node content type for pages created using the \" Using subdirectories \" method if it is different than the content type of the parent (which is specified in the content_type setting). page_title_template '$parent_title, page $weight' Template used to generate the titles of pages/members in the \" Using subdirectories \" method. paged_content_ignore_files [\"Thumbs.db\"] List of filenames that you want Workbench to ignore when it scans directories to create page- or child-level nodes. See \" Using subdirectories \" for more inforation.","title":"Paged and compound content settings"},{"location":"configuration/#logging-settings","text":"See the \" Logging \" section for more information. Setting Required Default value Description log_file_path workbench.log The path to the log file, absolute or relative to the directory Workbench is run from. log_file_mode a [append] Set to \"w\" to overwrite the log file, if it exists. log_request_url false Whether or not to log the request URL (and its method). Useful for debugging. log_json false Whether or not to log the raw request JSON POSTed, PUT, or PATCHed to Drupal. Useful for debugging. log_headers false Whether or not to log the raw HTTP headers used in all requests. Useful for debugging. log_response_status_code false Whether or not to log the HTTP response code. Useful for debugging. log_response_time false Whether or not to log the response time of each request that is slower than the average response time for the last 20 HTTP requests Workbench makes to the Drupal server. Useful for debugging. log_response_body false Whether or not to log the raw HTTP response body. Useful for debugging. log_file_name_and_line_number false Whether or not to add the filename and line number that created the log entry. Useful for debugging. log_term_creation true Whether or not to log the creation of new terms during \"create\" and \"update\" tasks (does not apply to the \"create_terms\" task). --check will still report that terms in the CSV do not exist in their respective vocabularies.","title":"Logging settings"},{"location":"configuration/#http-settings","text":"Setting Required Default value Description user_agent Islandora Workbench String to use as the User-Agent header in HTTP requests. allow_redirects true Whether or not to allow Islandora Workbench to respond to HTTP redirects. secure_ssl_only true Whether or not to require valid SSL certificates. Set to false if you want to ignore SSL certificates. enable_http_cache true Whether or not to enable Workbench's client-side request cache. Set to false if you want to disable the cache during troubleshooting, etc. http_cache_storage memory The backend storage type for the client-side cache. Set to sqlite if you are getting out of memory errors while running Islandora Workbench. http_cache_storage_expire_after 1200 Length of the client-side cache lifespan (in seconds). Reduce this number if you are using the sqlite storage backend and the database is using too much disk space. Note that reducing the cache lifespan will result in increased load on your Drupal server.","title":"HTTP settings"},{"location":"configuration/#rollback-configuration-and-csv-file-settings","text":"See \" Rolling back \" for more information. Setting Required Default value Description rollback_dir Value of input_dir setting Absolute path to the directory where you want your \"rollback.csv\" file to be written. rollback_config_file_path rollback.yml Relative (to workbench) or absolute path to write the rollback config YAML file to. rollback_csv_file_path [input_dir/rollback.csv] Relative (to workbench) or absolute path to write the rollback CSV file to. timestamp_rollback false Set to true to add a timestamp to the \"rollback.yml\" and corresponding \"rollback.csv\" generated in \"create\" and \"create_from_files\" tasks. rollback_config_filename_template Defines a template that will be used to create the rollback configuration file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_csv_filename_template Defines a template that will be used to create the rollback CSV file. The two placeholders availalble in this template are $config_filename and $input_csv_filename . rollback_file_comments Defines a list of lines to be added to both the rollback config and CSV file as comments. include_password_in_rollback_config_file false Set to true to include the value of the password configuration setting in your rollback config YAML file.","title":"Rollback configuration and CSV file settings"},{"location":"configuration/#miscellaneous-settings","text":"Setting Required Default value Description perform_soft_checks false If set to true, --check will not exit when it encounters an error with parent/child order, file extension registration with Drupal media file fields, missing files named in the files CSV column, or EDTF validation. Instead, it will log any errors it encounters and exit after it has checked all rows in the CSV input file. Note: this setting replaces strict_check as of commit dfa60ff (July 14, 2023). temp_dir Value of the temporary directory defined by your system as defined by Python's gettempdir() function. Relative or absolute path to the directory where you want Workbench's temporary files to be written. These include the \".preprocessed\" version of the your input CSV, remote files temporarily downloaded to create media, and the CSV ID to node ID map database. sqlite_db_filename [temp_dir]/workbench_temp_data.db Name of the SQLite database filename used to store session data. pause Defines the number of seconds to pause between all 'POST', 'PUT', 'PATCH', 'DELETE' requests to Drupal. Include it in your configuration to lessen the impact of Islandora Workbench on your site during large jobs, for example pause: 1.5. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause Defines the number of seconds to pause between each REST request to Drupal. Works like \"pause\" but only takes effect when the Drupal server's response to the most recent request is slower (determined by the \"adaptive_pause_threshold\" value) than the average response time for the last 20 requests. More information is available in the \" Reducing Workbench's impact on Drupal \" documentation. adaptive_pause_threshold 2 A weighting of the response time for the most recent request, relative to the average response times of the last 20 requests. This weighting determines how much slower the Drupal server's response to the most recent Workbench request must be in order for adaptive pausing to take effect for the next request. For example, if set to \"1\", adaptive pausing will happen when the response time is equal to the average of the last 20 response times; if set to \"2\", adaptive pausing will take effect if the last requests's response time is double the average. progress_bar false Show a progress bar when running Workbench instead of row-by-row output. bootstrap List of absolute paths to one or more scripts that execute prior to Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. shutdown List of absolute paths to one or more scripts that execute after Workbench connecting to Drupal. More information is available in the \" Hooks \" documentation. preprocessors List of absolute paths to one or more scripts that are applied to CSV values prior to the values being ingested into Drupal. More information is available in the \" Hooks \" documentation. node_post_create List of absolute paths to one or more scripts that execute after a node is created. More information is available in the \" Hooks \" documentation. node_post_update List of absolute paths to one or more scripts that execute after a node is updated. More information is available in the \" Hooks \" documentation. media_post_create List of absolute paths to one or more scripts that execute after a media is created. More information is available in the \" Hooks \" documentation. drupal_8 Deprecated. path_to_python python Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". path_to_workbench_script workbench Used in create tasks that also use the secondary_tasks option. Tells Workbench the path to the python interpreter. For details on when to use this option, refer to the end of the \"Secondary Tasks\" section of \" Creating paged, compound, and collection content \". contact_sheet_output_dir contact_sheet_output Used in create tasks to specify the name of the directory where contact sheet output is written. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. contact_sheet_css_path assets/contact_sheet/contact-sheet.css Used in create tasks to specify the path of the CSS stylesheet to use in contact sheets. Can be relative (to the Workbench directory) or absolute. See \" Generating a contact sheet \" for more information. node_exists_verification_view_endpoint Used in create tasks to tell Workbench to check whether a node with a matching value in your input CSV already exists. See \" Checking if nodes already exist \" for more information. redirect_status_code 301 Used in create_redirect tasks to set the HTTP response code in redirects. See \" Prompting the user \" for more information. remind_user_to_run_check false Prompt the user to confirm whether they have run Workbench with --check . See \" Prompting the user \" for more information. user_prompts [] (empty list) List of custom prompts to present to the user. See \" Prompting the user \" for more information. check_lock_file_path Defines a \"lock file\" that contains data generated by a successfule --check . See \" Checking configuration and input data \" for more information. secondary_tasks A list of configuration file names that are executed as secondary tasks after the primary task to create compound objects. See \" Using a secondary task \" for more information. csv_id_to_node_id_map_path [temp_dir]/csv_id_to_node_id_map.db Name of the SQLite database filename used to store CSV row ID to node ID mappings for nodes created during create and create_from_files tasks. If you want to store your map database outside of a temporary directory, use an absolute path to the database file for this setting. If you want to disable population of this database, set this config setting to false . See \" Generating CSV files \" for more information. query_csv_id_to_node_id_map_for_parents false Queries the CSV ID to node ID map when creating compound content. Set to true if you want to use parent IDs from previous Workbench sessions. Note: this setting is automatically set to true in secondary task config files. See \" Creating parent/child relationships across Workbench sessions for more information.\" ignore_duplicate_parent_ids true Tells Workbench to ignore entries in the CSV ID to node ID map that have the same parent ID value. Set to false if you want Workbench to warn you that there are duplicate parent IDs in your CSV ID to node ID map. See \" Creating parent/child relationships across Workbench sessions for more information.\" When you run Islandora Workbench with the --check argument, it will verify that all configuration options required for the current task are present, and if they aren't tell you so. Note Islandora Workbench automatically converts any configuration keys to lowercase, e.g., Task is equivalent to task .","title":"Miscellaneous settings"},{"location":"configuration/#validating-the-syntax-of-the-configuration-file","text":"When you run Workbench, it confirms that your configuration file is valid YAML. This is a syntax check only, not a content check. If the file is valid YAML, Workbench then goes on to perform a long list of application-specific checks . If this syntax check fails, some detail about the problem will be displayed to the user. The same information plus the entire Python stack trace is also logged to a file named \"workbench.log\" in the directory Islandora Workbench is run from. This file name is Workbench's default log file name, but in this case (validating the config file's YAML syntax), that file name is used regardless of the log file location defined in the configuration's log_file_path option. The reason the error is logged in the default location instead of the value in the configuration file (if one is present) is that the configuration file isn't valid YAML and therefore can't be parsed.","title":"Validating the syntax of the configuration file"},{"location":"configuration/#example-configuration-files","text":"These examples provide inline annotations explaining why the settings are included in the configuration file. Blank rows/lines are included for readability.","title":"Example configuration files"},{"location":"configuration/#create-nodes-only-no-media","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to create nodes with no media. # Also, this tells --check to skip all validation of \"file\" locations. # Other media settings, like \"media_use_tid\", are also ignored. nodes_only: true","title":"Create nodes only, no media"},{"location":"configuration/#use-a-custom-log-file-location","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # This setting tells Workbench to write its log file to the location specified # instead of the default \"workbench.log\" within the directory Workbench is run from. log_file_path: /home/mark/workbench_log.txt","title":"Use a custom log file location"},{"location":"configuration/#include-some-csv-field-templates","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora # The values in this list of field templates are applied to every row in the # input CSV file before the CSV file is used to populate Drupal fields. The # field templates are also applied during the \"--check\" in order to validate # the values of the fields. csv_field_templates: - field_member_of: 205 - field_model: 25","title":"Include some CSV field templates"},{"location":"configuration/#use-a-google-sheet-as-input-csv","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit # You only need to specify the google_sheets_gid option if the worksheet in the Google Sheet # is not the default one. google_sheets_gid: 1867618389","title":"Use a Google Sheet as input CSV"},{"location":"configuration/#create-nodes-and-media-from-files-no-input-csv-file","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # The files to create the nodes from are in this directory. input_dir: /tmp/sample_files # This tells Workbench to write a CSV file containing node IDs of the # created nodes, plus the field names used in the target content type # (\"islandora_object\" by default). output_csv: /tmp/sample_files.csv # All nodes should get the \"Model\" value corresponding to this URI. model: 'https://schema.org/DigitalDocument'","title":"Create nodes and media from files (no input CSV file)"},{"location":"configuration/#create-taxonomy-terms","text":"task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary","title":"Create taxonomy terms"},{"location":"configuration/#ignore-some-columns-in-your-input-csv-file","text":"task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: input.csv # This tells Workbench to ignore the 'date_generated' and 'batch_id' # columns in the input.csv file. ignore_csv_columns: ['date_generated', 'batch_id']","title":"Ignore some columns in your input CSV file"},{"location":"configuration/#generating-sample-islandora-content","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora # This directory must match the on defined in the script's 'dest_dir' variable. input_dir: /tmp/autogen_input media_use_tid: 17 output_csv: /tmp/my_sample_content_csv.csv model: http://purl.org/coar/resource_type/c_c513 # This is the script that generates the sample content. bootstrap: - \"/home/mark/Documents/hacking/workbench/generate_image_files.py\"","title":"Generating sample Islandora content"},{"location":"configuration/#running-a-post-action-script","text":"ask: create host: \"http://localhost:8000\" username: admin password: islandora node_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # node_post_update: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py'] # media_post_create: ['/home/mark/hacking/islandora_workbench/scripts/entity_post_task_example.py']","title":"Running a post-action script"},{"location":"configuration/#create-nodes-and-media-from-files","text":"task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora input_dir: path/to/files models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4']","title":"Create nodes and media from files"},{"location":"configuration_settings_on_the_command_line/","text":"You can override the following settings in configuration files by including them as command-line arguments to Workbench: input_dir input_csv google_sheets_gid log_file_path rollback_csv_file_path rollback_config_file_path csv_start_row csv_stop_row In all cases, you need to add -- to conform with Python's command-line argument syntax. For example, if you want to specify an input CSV file different from the one registered in your configuration file, include --input_csv as a command-line argument to Worbkench: python workbench --config example.yml --input_csv alternate.csv To specify a log file path, include --log_file_path as an argument: python workbench --config example.yml --check --log_file_path custom_log_during_check.log","title":"Specifying configuration settings on the command line"},{"location":"contact_sheet/","text":"During --check in create tasks, you can generate a contact sheet from your input CSV by adding the command-line option --contactsheet , like this: ./workbench --config somecreatetask.yml --check --contactsheet If you use this option, Workbench will tell you the location of your contact sheet, e.g. Contact sheet is at /home/mark/hacking/islandora_workbench/contact_sheet_output/contact_sheet.htm , which you can then open in your web browser. Contact sheets looks like this: The icons (courtesy of icons8 ) indicate if the item in the CSV is an: image video audio PDF miscellaneous binary file compound item There is also an icon that indicates that there is no file in the input CSV for the item. Fields with long values are indicated by link formatted like \"[...]\". Placing your pointer over this link will show a popup that contains the full value. Also, in fields that have multiple values, the subdelimiter defined in your configuration file is replaced by a small square (\u25a1), to improve readability. Compound items in the input CSV sheet are represented by a \"compound\" icon and a link to a separate contact sheet containing the item's members: The members contact sheet looks like this: You can define where your contact sheet is written to using the contact_sheet_output_dir configuration setting. If the directory doesn't exist, Workbench will create it. All the HTML, CSS, and image files that are part of the contact sheet(s) is written to this directory, so if you want to share it with someone (for quality assurance purposes, for example), you can zip up the directory and its contents and give them a copy. All the files required to render the contact sheet are in this directory: You can also define the path to an alternative CSS stylesheet to use for your contact sheets. The default CSS file is a assets/contact_sheet/contact-sheet.css within the Workbench directory. To specify your own CSS file, include its relative (to the Workbench directory) or absolute path in the contact_sheet_css_path config setting. If you use an alternative CSS file, you should base yours on a copy of the default one, since the layout of the contact sheets depends heavily on the default CSS.","title":"Generating a contact sheet"},{"location":"creating_nodes_from_files/","text":"If you want to ingest some files without a metadata CSV you can do so using the create_from_files task. A common application of this ability is in automated workflows where Islandora objects are created from files saved to a watch folder , and metadata is added later. Nodes created using this task have only the following properties/fields populated: Content type: this is defined in the configuration file, using the content_type setting. Title: this is derived from the filename minus the extension. Spaces are allowed in the filenames. Published: published by default, or overridden in the configuration file using the published setting. Model: defined in the configuration file using either the model or models setting. The media attached to the nodes is the file, with its type (image, document, audio, video, file) assigned by the media_types_override configuration setting and its Media Use tag defined in the media_use_tid setting. The configuration options for the create_from_files task are the same as the options used in the create task (with one exception: input_csv is not required). The only option specific to this task is models , which is a mapping from terms IDs (or term URIs) in the \"Islandora Models\" vocabulary to file extensions. Note that either the models or model configuration option is required in the create_from_files task. Use models when your nodes will have different Islandora Model values. Here is a sample configuration file for this task: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv models: - 23: ['zip', 'tar', ''] - 27: ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 25: ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 22: ['mp3', 'wav', 'aac'] - 26: ['mp4'] Using model is convenient when all of the objects you are creating are the same Islandora Model: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv model: 25 You can also use the URIs assigned to terms in the Islandora Models vocabulary, for example: task: create_from_files host: \"http://localhost:8000\" username: admin password: islandora output_csv: /tmp/output.csv models: - 'http://purl.org/coar/resource_type/c_1843': ['zip', 'tar', ''] - 'https://schema.org/DigitalDocument': ['pdf', 'doc', 'docx', 'ppt', 'pptx'] - 'http://purl.org/coar/resource_type/c_c513': ['tif', 'tiff', 'jp2', 'png', 'gif', 'jpg', 'jpeg'] - 'http://purl.org/coar/resource_type/c_18cc': ['mp3', 'wav', 'aac'] - 'http://purl.org/coar/resource_type/c_12ce': ['mp4'] Note In the workflow described at the beginning of this section, you might want to include the output_csv option in the configuration file, since the resulting CSV file can be populated with metadata later and used in an update task to add it to the stub nodes.","title":"Creating nodes from files"},{"location":"creating_taxonomy_terms/","text":"Islandora Workbench lets you create vocabulary terms from CSV files. This ability is separate from creating vocabulary terms while creating the nodes in a create task, as described in the \" Field data (Drupal and CSV) \" documentation. You should create vocabulary terms using the options described here if any of these situations applies to you: you are working with a vocabulary that has fields in addition to term name you are working with a vocabulary that is hierarchical you want terms to exist before you create nodes using a create task. If you want to create terms during a create task, and if the terms you are creating don't have any additional fields or hierarchical relationships to other terms, then you don't need to use the task described here. You can use the method you can create terms as described in as described in the \"Taxonomy reference fields\" section of \" Field data (Drupal and CSV) .\" The configuration and input CSV files To add terms to a vocabulary, you use a create_terms task. A typical configuration file looks like this: task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary The vocab_id config option is required. It contains the machine name of the vocabulary you are adding the terms to. The CSV file identified in the input_csv option has one required column, term_name , which contains each term's name: term_name Automobiles Sports cars SUVs Jaguar Porche Land Rover Note Unlike input CSV files used during create tasks, input CSV files for create_terms tasks do not have an \"id\" column. Instead, term_name is the column whose values are the unique identifier for each term. Workbench assumes that term names are unique within a vocabulary. If the terms in the term_name column aren't unique, Workbench only creates the term the first time it encounters it in the CSV file. Two reserved but optional columns, weight , and description , are described next. A third reserved column header, parent is described in the \"Hierarchical vocabularies\" section. You can also add columns that correspond to a vocabulary's field names, just like you do when you assemble your CSV for create tasks, as described in the \"Vocabularies with custom fields\" section below. Term weight, description, and published columns Three other reserved CSV column headers are weight , description , and published . All Drupal taxonomy terms have these two fields but populating them is optional. weight is used to sort the terms in the vocabulary overview page in relation to their parent term (or the vocabulary root if a term has no parent). Values in the weight field are integers. The lower the weight, the earlier the term sorts. For example, a value of \"0\" (zero) sorts the term at the top in relation to its parent, and a value of \"100\" sorts the term much lower. description is, as the name suggests, a field that contains a description of the term. published takes a \"0\" or a \"1\" (like it does in create tasks for nodes) to indicate that the term is published or not. If absent, defaults to \"1\" (published). Note If you are going to use published in your create_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. If you do not add weight values, Drupal sorts the terms in the vocabulary alphabetically. Vocabularies with custom fields Example column headers in a CSV file for use in create_terms tasks that has two additional fields, \"field_example\" and \"field_second_example\", in addition to the optional \"description\" column, would look like this: term_name,field_example,field_second_example,description Here is a sample CSV input file with headers for description and field_external_uri fields, and two records for terms named \"Program file\" and \"Data set\": term_name,description,field_external_uri Program file,A program file is executable source code or a binary executable file.,http://id.loc.gov/vocabulary/mfiletype/program Data set,\"A data set is raw, often tabular, data.\",https://www.wikidata.org/wiki/Q1172284 Optional fields don't need to be included in our CSV if you are not populating them, but fields that are configured as required in the vocabulary settings do need to be present, and populated (just like required fields on content types in create tasks). Running --check on a create_terms task will detect any required fields that are missing from your input CSV file. Hierarchical vocabularies If you want to create a vocabulary that is hierarchical, like this: you can add a parent column to your CSV and for each row, include the term name of the term you want as the parent. For example, the above sample vocabulary was created using this CSV input file: term_name,parent Automobiles, Sports cars,Automobiles SUVs,Automobiles Jaguar,Sports cars Porche,Sports cars Land Rover,SUVs One important aspect of creating a hierarchical vocabulary is that all parents must exist before their children are added. That means that within your CSV file, the rows for terms used as parents should be placed earlier in the file than the rows for their children. If a term is named as a parent but doesn't exist yet because it came after the child term in the CSV, Workbench will create the child term and write a warning in the log indicating that the parent didn't exist at the time of creating the child. In these cases, you can manually assign a parent to the terms using Drupal's taxonomy administration tools. You can include the parent column in your CSV along with Drupal field names. Workbench will not only create the hierarchy, it will also add the field data to the terms: term_name,parent,description,field_external_uri Automobiles,,, Sports cars,Automobiles,\"Sports cars focus on performance, handling, and driver experience.\",https://en.wikipedia.org/wiki/Sports_car SUVs,Automobiles,\"SUVs, or Sports Utility Vehicles, are the most popular type of automobile.\",https://en.wikipedia.org/wiki/Sport_utility_vehicle Jaguar,Sports cars,, Porche,Sports cars,, Land Rover,SUVs,,","title":"Creating taxonomy terms"},{"location":"creating_taxonomy_terms/#the-configuration-and-input-csv-files","text":"To add terms to a vocabulary, you use a create_terms task. A typical configuration file looks like this: task: create_terms host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_term_data.csv vocab_id: myvocabulary The vocab_id config option is required. It contains the machine name of the vocabulary you are adding the terms to. The CSV file identified in the input_csv option has one required column, term_name , which contains each term's name: term_name Automobiles Sports cars SUVs Jaguar Porche Land Rover Note Unlike input CSV files used during create tasks, input CSV files for create_terms tasks do not have an \"id\" column. Instead, term_name is the column whose values are the unique identifier for each term. Workbench assumes that term names are unique within a vocabulary. If the terms in the term_name column aren't unique, Workbench only creates the term the first time it encounters it in the CSV file. Two reserved but optional columns, weight , and description , are described next. A third reserved column header, parent is described in the \"Hierarchical vocabularies\" section. You can also add columns that correspond to a vocabulary's field names, just like you do when you assemble your CSV for create tasks, as described in the \"Vocabularies with custom fields\" section below.","title":"The configuration and input CSV files"},{"location":"creating_taxonomy_terms/#term-weight-description-and-published-columns","text":"Three other reserved CSV column headers are weight , description , and published . All Drupal taxonomy terms have these two fields but populating them is optional. weight is used to sort the terms in the vocabulary overview page in relation to their parent term (or the vocabulary root if a term has no parent). Values in the weight field are integers. The lower the weight, the earlier the term sorts. For example, a value of \"0\" (zero) sorts the term at the top in relation to its parent, and a value of \"100\" sorts the term much lower. description is, as the name suggests, a field that contains a description of the term. published takes a \"0\" or a \"1\" (like it does in create tasks for nodes) to indicate that the term is published or not. If absent, defaults to \"1\" (published). Note If you are going to use published in your create_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. If you do not add weight values, Drupal sorts the terms in the vocabulary alphabetically.","title":"Term weight, description, and published columns"},{"location":"creating_taxonomy_terms/#vocabularies-with-custom-fields","text":"Example column headers in a CSV file for use in create_terms tasks that has two additional fields, \"field_example\" and \"field_second_example\", in addition to the optional \"description\" column, would look like this: term_name,field_example,field_second_example,description Here is a sample CSV input file with headers for description and field_external_uri fields, and two records for terms named \"Program file\" and \"Data set\": term_name,description,field_external_uri Program file,A program file is executable source code or a binary executable file.,http://id.loc.gov/vocabulary/mfiletype/program Data set,\"A data set is raw, often tabular, data.\",https://www.wikidata.org/wiki/Q1172284 Optional fields don't need to be included in our CSV if you are not populating them, but fields that are configured as required in the vocabulary settings do need to be present, and populated (just like required fields on content types in create tasks). Running --check on a create_terms task will detect any required fields that are missing from your input CSV file.","title":"Vocabularies with custom fields"},{"location":"creating_taxonomy_terms/#hierarchical-vocabularies","text":"If you want to create a vocabulary that is hierarchical, like this: you can add a parent column to your CSV and for each row, include the term name of the term you want as the parent. For example, the above sample vocabulary was created using this CSV input file: term_name,parent Automobiles, Sports cars,Automobiles SUVs,Automobiles Jaguar,Sports cars Porche,Sports cars Land Rover,SUVs One important aspect of creating a hierarchical vocabulary is that all parents must exist before their children are added. That means that within your CSV file, the rows for terms used as parents should be placed earlier in the file than the rows for their children. If a term is named as a parent but doesn't exist yet because it came after the child term in the CSV, Workbench will create the child term and write a warning in the log indicating that the parent didn't exist at the time of creating the child. In these cases, you can manually assign a parent to the terms using Drupal's taxonomy administration tools. You can include the parent column in your CSV along with Drupal field names. Workbench will not only create the hierarchy, it will also add the field data to the terms: term_name,parent,description,field_external_uri Automobiles,,, Sports cars,Automobiles,\"Sports cars focus on performance, handling, and driver experience.\",https://en.wikipedia.org/wiki/Sports_car SUVs,Automobiles,\"SUVs, or Sports Utility Vehicles, are the most popular type of automobile.\",https://en.wikipedia.org/wiki/Sport_utility_vehicle Jaguar,Sports cars,, Porche,Sports cars,, Land Rover,SUVs,,","title":"Hierarchical vocabularies"},{"location":"csv_value_templates/","text":"Note This section describes using CSV value templates in your configuration file. For information on CSV field templates, see the \" CSV field templates \". For information on CSV file templates, see the \" CSV file templates \" section. Applying CSV value templates to rows in your input CSV In create and update tasks, you can configure templates that are applied to all the values in a CSV column (or to each subvalue if you have multiple values in a single field) as if the templated text were present in the values in your CSV file. The templates are configured in the csv_value_templates option. An example looks like this: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value For each value in the named CSV field (in this example, field_linked_agent ), Workbench will apply the template text (in this example, relators:aut:person: ) to the CSV value, represented in the template as $csv_value . An input CSV file that looks like this: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"Jordan, Mark|Cantelo, M.\" will have the template relators:aut:person:$csv_value applied to so it is converted to: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"relators:aut:person:Jordan, Mark|relators:aut:person:Cantelo, M.\" This example configuration defines only one field/template pair, but csv_value_templates can define multiple field/template pairs, for example: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value - field_subject: subject:$csv_value - field_local_identifier: SFU-$uuid_string The following template variables are available: $csv_value : The verbatim string value of the field. $file : The verbatim value of the file column in the row. $filename_without_extension : The filename portion only (with no leading directory path or file extension) in the file column in the row. $weight : The value of the field_weight column in the CSV, or for paged content created using the \" Using subdirectories \" option, the sequence number embedded in the page's filenme. $random_alphanumeric_string : A randomly generated string containing numbers and mixed-case letters. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $random_number_string : A randomly generated string containing only numbers. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $uuid_string : A valid version 4 UUID. A few things to note about CSV value templates: The variable can be anywhere in your template (beginning, middle, or end). You can only define a single template for a given field in csv_value_templates , but you can include multiple variables in a single template. If multiple variables are present in a template, they are applied in the order listed above. If a CSV field contains multiple subvalues, the same template is applied to all subvalues in the field (as illustrated above). Values in the templated CSV output are validated against Drupal's configuration in the same way that values present in your CSV are validated. By default, CSV value templates won't be applied to empty fields. However, if you want a template to be applied to a field if that field is empty, you can include the allow_csv_value_templates_if_field_empty setting in your config file defining a list of column names. For example, allow_csv_value_templates_if_field_empty: [field_identifier] will apply any templates defined for field_identifier in your csv_value_templates setting, even if field_identifier is empty in your input CSV; for example, the following will apply the template defined in the above example configuration even if the named fields are empty: allow_csv_value_templates_if_field_empty: ['field_local_identifier', 'field_subject'] Applying CSV value templates to paged content Paged content (or as sometimes referred to, children) created using the \" Using subdirectories \" method do not have their own rows in input CSV files. Any fields that are configured to be \"required\" in the parent and child's content type are copied from the parent's CSV row and applied to all that parent's pages/children. If you want to add non-required field data to pages/children, However, you can use CSV value templates to do that. In this case: the CSV row that is used as the source of $csv_value is the page's (or child's) parent row; in other words, the value of $csv_value is inherited from a page/child's parent row the $file variable is the name of the page/child's filename (and $filename_without_extension is derived from this value) the $weight variable is taken from the page/child's sequence indicator, e.g. a filename of page-002.jpg would result in a $weight value of \"2\". If you want to apply CSV field templates to page/child items using this technique, register the templates in the csv_value_templates_for_paged_content config setting. The structure of the field-to-template mappings is identical to those used in the csv_value_templates setting as illustrated above. For example: csv_value_templates_for_paged_content: - field_linked_agent: relators:aut:person:$csv_value - field_edtf_date_issued: $csv_value - field_local_identifier: $csv_value-$weight Even though this section documents how to apply templates with variables, you can also apply \"templates\" to pages/child items that are complete values, that do not use variables. For example, if you want to add the term \"Newspapers\" to the field_genre field in each page in a newspaper issue, you can register that string as your \"template\", e.g. csv_value_templates_for_paged_content: - field_genre: Newspapers","title":"CSV value templates"},{"location":"csv_value_templates/#applying-csv-value-templates-to-rows-in-your-input-csv","text":"In create and update tasks, you can configure templates that are applied to all the values in a CSV column (or to each subvalue if you have multiple values in a single field) as if the templated text were present in the values in your CSV file. The templates are configured in the csv_value_templates option. An example looks like this: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value For each value in the named CSV field (in this example, field_linked_agent ), Workbench will apply the template text (in this example, relators:aut:person: ) to the CSV value, represented in the template as $csv_value . An input CSV file that looks like this: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"Jordan, Mark|Cantelo, M.\" will have the template relators:aut:person:$csv_value applied to so it is converted to: file,id,title,field_model,field_linked_agent IMG_2940.JPG,03,Looking across Burrard Inlet,25,\"relators:aut:person:Jordan, Mark|relators:aut:person:Cantelo, M.\" This example configuration defines only one field/template pair, but csv_value_templates can define multiple field/template pairs, for example: csv_value_templates: - field_linked_agent: relators:aut:person:$csv_value - field_subject: subject:$csv_value - field_local_identifier: SFU-$uuid_string The following template variables are available: $csv_value : The verbatim string value of the field. $file : The verbatim value of the file column in the row. $filename_without_extension : The filename portion only (with no leading directory path or file extension) in the file column in the row. $weight : The value of the field_weight column in the CSV, or for paged content created using the \" Using subdirectories \" option, the sequence number embedded in the page's filenme. $random_alphanumeric_string : A randomly generated string containing numbers and mixed-case letters. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $random_number_string : A randomly generated string containing only numbers. This string's default length is 5 characters, but this can be overridden by including csv_value_templates_rand_length in your configuration file, e.g. csv_value_templates_rand_length: 10 . $uuid_string : A valid version 4 UUID. A few things to note about CSV value templates: The variable can be anywhere in your template (beginning, middle, or end). You can only define a single template for a given field in csv_value_templates , but you can include multiple variables in a single template. If multiple variables are present in a template, they are applied in the order listed above. If a CSV field contains multiple subvalues, the same template is applied to all subvalues in the field (as illustrated above). Values in the templated CSV output are validated against Drupal's configuration in the same way that values present in your CSV are validated. By default, CSV value templates won't be applied to empty fields. However, if you want a template to be applied to a field if that field is empty, you can include the allow_csv_value_templates_if_field_empty setting in your config file defining a list of column names. For example, allow_csv_value_templates_if_field_empty: [field_identifier] will apply any templates defined for field_identifier in your csv_value_templates setting, even if field_identifier is empty in your input CSV; for example, the following will apply the template defined in the above example configuration even if the named fields are empty: allow_csv_value_templates_if_field_empty: ['field_local_identifier', 'field_subject']","title":"Applying CSV value templates to rows in your input CSV"},{"location":"csv_value_templates/#applying-csv-value-templates-to-paged-content","text":"Paged content (or as sometimes referred to, children) created using the \" Using subdirectories \" method do not have their own rows in input CSV files. Any fields that are configured to be \"required\" in the parent and child's content type are copied from the parent's CSV row and applied to all that parent's pages/children. If you want to add non-required field data to pages/children, However, you can use CSV value templates to do that. In this case: the CSV row that is used as the source of $csv_value is the page's (or child's) parent row; in other words, the value of $csv_value is inherited from a page/child's parent row the $file variable is the name of the page/child's filename (and $filename_without_extension is derived from this value) the $weight variable is taken from the page/child's sequence indicator, e.g. a filename of page-002.jpg would result in a $weight value of \"2\". If you want to apply CSV field templates to page/child items using this technique, register the templates in the csv_value_templates_for_paged_content config setting. The structure of the field-to-template mappings is identical to those used in the csv_value_templates setting as illustrated above. For example: csv_value_templates_for_paged_content: - field_linked_agent: relators:aut:person:$csv_value - field_edtf_date_issued: $csv_value - field_local_identifier: $csv_value-$weight Even though this section documents how to apply templates with variables, you can also apply \"templates\" to pages/child items that are complete values, that do not use variables. For example, if you want to add the term \"Newspapers\" to the field_genre field in each page in a newspaper issue, you can register that string as your \"template\", e.g. csv_value_templates_for_paged_content: - field_genre: Newspapers","title":"Applying CSV value templates to paged content"},{"location":"deleting_media/","text":"Deleting media using media IDs Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete_media tasks, the value of username will need to be the same as the username used to create the media. If the username defined in a delete_media task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs. You can delete media and their associate files by providing a CSV file with a media_id column that contains the Drupal IDs of media you want to delete. For example, your CSV file could look like this: media_id 100 103 104 The config file looks like this (note the task option is 'delete_media'): task: delete_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete_media.csv Deleting media using node IDs If you want to delete media from specific nodes without having to know the media IDs as described above, you can use the delete_media_by_node task. This task takes a list of node IDs as input, like this: node_id 345 367 246 The configuration file for this task looks like this: task: delete_media_by_node host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data input_csv: delete_node_media.csv This configuration will delete all media attached to nodes 345, 367, and 246. By default, all media attached to the specified nodes are deleted. A \"delete_media_by_node\" configuration file can include a delete_media_by_node_media_use_tids option that lets you specify a list of Islandora Media Use term IDs that a media must have to be deleted: delete_media_by_node_media_use_tids: [17, 1] Before using this option, consult your Islandora's Islandora Media Use vocabulary page at /admin/structure/taxonomy/manage/islandora_media_use/overview to get the term IDs you need to use.","title":"Deleting media"},{"location":"deleting_media/#deleting-media-using-media-ids","text":"Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete_media tasks, the value of username will need to be the same as the username used to create the media. If the username defined in a delete_media task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs. You can delete media and their associate files by providing a CSV file with a media_id column that contains the Drupal IDs of media you want to delete. For example, your CSV file could look like this: media_id 100 103 104 The config file looks like this (note the task option is 'delete_media'): task: delete_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete_media.csv","title":"Deleting media using media IDs"},{"location":"deleting_media/#deleting-media-using-node-ids","text":"If you want to delete media from specific nodes without having to know the media IDs as described above, you can use the delete_media_by_node task. This task takes a list of node IDs as input, like this: node_id 345 367 246 The configuration file for this task looks like this: task: delete_media_by_node host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data input_csv: delete_node_media.csv This configuration will delete all media attached to nodes 345, 367, and 246. By default, all media attached to the specified nodes are deleted. A \"delete_media_by_node\" configuration file can include a delete_media_by_node_media_use_tids option that lets you specify a list of Islandora Media Use term IDs that a media must have to be deleted: delete_media_by_node_media_use_tids: [17, 1] Before using this option, consult your Islandora's Islandora Media Use vocabulary page at /admin/structure/taxonomy/manage/islandora_media_use/overview to get the term IDs you need to use.","title":"Deleting media using node IDs"},{"location":"deleting_nodes/","text":"You can delete nodes by providing a CSV file that contains a single column, node_id , like this: node_id 95 96 200 Values in the node_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The config file for update operations looks like this (note the task option is 'delete'): task: delete host: \"http://localhost:8000\" username: admin password: islandora input_csv: delete.csv Note that when you delete nodes using this method, all media associated with the nodes are also deleted, unless the delete_media_with_nodes configuration option is set to false (it defaults to true ). Typical output produced by a delete task looks like this: Node http://localhost:8000/node/89 deleted. + Media http://localhost:8000/media/329 deleted. + Media http://localhost:8000/media/331 deleted. + Media http://localhost:8000/media/335 deleted. Note that taxonomy terms created with new nodes are not removed when you delete the nodes. Note Drupal does not allow a user to delete or modify media files unless the user originally created (or is the owner) of the file. This means that if you created a media using \"user1\" in your Workbench configuration file, only \"user1\" can delete or modify those files. For delete tasks, the value of username will need to be the same as the username used to create the original media attached to nodes. If the username defined in a delete task is not the same as the Drupal user who owns the files, Drupal will return a 403 response, which you will see in your Workbench logs.","title":"Deleting nodes"},{"location":"development_guide/","text":"This documentation is aimed at developers who want to contribute to Islandora Workbench. General Bug reports, improvements, feature requests, and PRs are welcome. Before you open a pull request, please open an issue so we can discuss your idea . In cases where the PR introduces a potentially disruptive change to Workbench, we usually want to start a discussion about its impact on users in the #islandoraworkbench Slack channel. When you open a PR, you will be asked to complete the Workbench PR template . All code must be formatted using Black . You can automatically style your code using Black in your IDE of choice . Where applicable, unit and integration tests to accompany your code are very appreciated. Tests in Workbench fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . workbench_utils.py provides a lot of utility functions. Before writing a new utility function, please ensure that something similar to what you want to write doesn't already exist. Running tests While developing code for Islandora Workbench, you should run tests frequently to ensure that your code hasn't introduced any regression errors. Workbench is a fairly large and complex application, and has many configuration settings. Even minor changes in the code can break things. To run unit tests that do not require a live Islandora instance: Unit tests in tests/unit_tests.py (run with python tests/unit_tests.py ) Unit tests for Workbench's Drupal fields handlers in tests/field_tests.py (run with python tests/field_tests.py ) Note that these tests are run automatically as GitHub actions when you push to the Islandora Workbench repo or when a merge request is merged. To run integration tests that require a live Islandora instance running at https://islandora.traefik.me/ tests/islandora_tests.py , tests/islandora_tests_check.py , tests/islandora_tests_hooks.py , and tests/islandora_tests_paged_content.py can be run with python tests/islandora_tests.py , etc. The Islandora Starter Site deployed with ISLE is recommended way to deploy the Islandora used in these tests. Integration tests remove all nodes and media added during the tests, unless a test fails. Taxonomy terms created by tests are not removed. Some integration and field tests output text that beings with \"Error:.\" This is normal, it's the text that Workbench outputs when it finds something wrong (which is probably what the test is testing). Successful test (whether they test for success or failure) runs will exit with \"OK\". If you can figure out how to suppress this output, please visit this issue . If you want to run the tests within a specific class, include the class name as an argument like this: python tests/unit_tests.py TestCompareStings You can also specify multiple test classes within a single test file: python tests/islandora_tests.py TestMyNewTest TestMyOtherNewTest Writing tests Islandora Workbench's tests are written using the Python built-in module unittest , and as explained above, fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . unittest groups tests into classes. A single test file can contain one or more test classes. Within each test class, you can put one or more test methods. As shown in the second example below, two reserved methods, setUp() and tearDown() , are reserved for setup and cleanup tasks, respectively, within each class. If you are new to using unittest , this is a good tutorial. A simple unit test Islandora Workbench unit tests are much like unit tests in any Python application. The sample test below, from tests/unit_tests.py , tests the validate_latlong_value() method from the workbench_utils.py module. Since workbench_utils.validate_latlong_value() doesn't interact with Islandora, https://islandora.traefik.me/ doesn't need to be available to run this unit test. class TestValidateLatlongValue(unittest.TestCase): def test_validate_good_latlong_values(self): values = ['+90.0, -127.554334', '90.0, -127.554334', '-90,-180', '+50.25,-117.8', '+48.43333,-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertTrue(res) def test_validate_bad_latlong_values(self): values = ['+90.1 -100.111', '045, 180', '+5025,-117.8', '-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertFalse(res) This is a fairly standard Python unit test - we define a list of valid lat/long pairs and run them through the workbench_utils.validate_latlong_value() method expecting it to return True for each value, and then we define a list of bad lat/long pairs and run them through the method expecting it to return False for each value. A simple integration test The two sample integration tests provided below are copied from islandora_tests.py . The first sample integration test, TestCreate , looks like this (with line numbers added for easy reference). Configuration and CSV files used by this test are in tests/assets/create_test : 1. class TestCreate(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. 8. def test_create(self): 9. self.nids = list() 10. create_output = subprocess.check_output(self.create_cmd) 11. create_output = create_output.decode().strip() 12. create_lines = create_output.splitlines() 13. for line in create_lines: 14. if 'created at' in line: 15. nid = line.rsplit('/', 1)[-1] 16. nid = nid.strip('.') 17. self.nids.append(nid) 18. 19. self.assertEqual(len(self.nids), 2) 20. 21. def tearDown(self): 22. for nid in self.nids: 23. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 24. quick_delete_output = subprocess.check_output(quick_delete_cmd) 25. 26. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'rollback.csv') 27. if os.path.exists(self.rollback_file_path): 28. os.remove(self.rollback_file_path) 29. 30. self.preprocessed_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'metadata.csv.preprocessed') 31. if os.path.exists(self.preprocessed_file_path): 32. os.remove(self.preprocessed_file_path) As you can see, this test runs Workbench using the config file create.yml (line 10), which lives at tests/assets/create_test/create.yml , relative to the workbench directory. A tricky aspect of using real config files in tests is that all paths mentioned in the config file must be relative to the workbench directory. This create.yml defines the input_dir setting to be tests/assets/create_test : task: create host: https://islandora.traefik.me username: admin password: password input_dir: \"tests/assets/create_test\" media_type: image allow_missing_files: True The test's setUp() method prepares the file paths, etc. and within the test's only test method, test_create() , runs Workbench using Python's subprocess.check_output() method, grabs the node IDs from the output from the \"created at\" strings emitted by Workbench (lines 14-17), adds them to a list, and then counts the number of members in that list. If the number of nodes created matches the expected number, the test passes. Since this test creates some nodes, we use the test class's tearDown() method to put the target Drupal back into as close a state as we started with as possible. tearDown() basically takes the list of node IDs created in test_create() and runs Workbench with the --quick_delete_node option. It then removes any temporary files created during the test. A more complex integration test Since Workbench is essentially a specialized REST client, writing integration tests that require interaction with Drupal can get a bit complex. But, the overall pattern is: Create some entities (nodes, media, taxonomy terms). Confirm that they were created in the expected way (doing this usually involves keeping track of any node IDs needed to run tests or to clean up, and in some cases involves parsing out values from raw JSON returned by Drupal). Clean up by deleting any Drupal entities created during the tests and also any temporary local files. An integration test that checks data in the node JSON is TestUpdateWithMaxNodeTitleLength() . Files that accompany this test are in tests/assets/max_node_title_length_test . Here is a copy of the test's code: 1. class TestUpdateWithMaxNodeTitleLength(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. self.nids = list() 8. 9. self.update_csv_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update_max_node_title_length.csv') 10. self.update_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update.yml') 11. self.update_cmd = [\"./workbench\", \"--config\", self.update_config_file_path] 12. 13. self.temp_dir = tempfile.gettempdir() 14. 15. def test_create(self): 16. create_output = subprocess.check_output(self.create_cmd) 17. self.create_output = create_output.decode().strip() 18. 19. create_lines = self.create_output.splitlines() 20. for line in create_lines: 21. if 'created at' in line: 22. nid = line.rsplit('/', 1)[-1] 23. nid = nid.strip('.') 24. self.nids.append(nid) 25. 26. self.assertEqual(len(self.nids), 6) 27. 28. # Write out an update CSV file using the node IDs in self.nids. 29. update_csv_file_rows = list() 30. test_titles = ['This title is 37 chars___________long', 31. 'This title is 39 chars_____________long', 32. 'This title is 29 _ chars long', 33. 'This title is 42 chars________________long', 34. 'This title is 44 chars__________________long', 35. 'This title is 28 chars long.'] 36. update_csv_file_rows.append('node_id,title') 37. i = 0 38. while i <= 5: 39. update_csv_file_rows.append(f'{self.nids[i]},{test_titles[i]}') 40. i = i + 1 41. with open(self.update_csv_file_path, mode='wt') as update_csv_file: 42. update_csv_file.write('\\n'.join(update_csv_file_rows)) 43. 44. # Run the update command. 45. check_output = subprocess.check_output(self.update_cmd) 46. 47. # Fetch each node in self.nids and check to see if its title is <= 30 chars long. All should be. 48. for nid_to_update in self.nids: 49. node_url = 'https://islandora.traefik.me/node/' + str(self.nids[0]) + '?_format=json' 50. node_response = requests.get(node_url) 51. node = json.loads(node_response.text) 52. updated_title = str(node['title'][0]['value']) 53. self.assertLessEqual(len(updated_title), 30, '') 54. 55. def tearDown(self): 56. for nid in self.nids: 57. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 58. quick_delete_output = subprocess.check_output(quick_delete_cmd) 59. 60. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'rollback.csv') 61. if os.path.exists(self.rollback_file_path): 62. os.remove(self.rollback_file_path) 63. 64. self.preprocessed_file_path = os.path.join(self.temp_dir, 'create_max_node_title_length.csv.preprocessed') 65. if os.path.exists(self.preprocessed_file_path): 66. os.remove(self.preprocessed_file_path) 67. 68. # Update test: 1) delete the update CSV file, 2) delete the update .preprocessed file. 69. if os.path.exists(self.update_csv_file_path): 70. os.remove(self.update_csv_file_path) 71. 72. self.preprocessed_update_file_path = os.path.join(self.temp_dir, 'update_max_node_title_length.csv.preprocessed') 73. if os.path.exists(self.preprocessed_update_file_path): 74. os.remove(self.preprocessed_update_file_path) This test: (line 16) creates some nodes that will be updated within the same test class (i.e. in line 45) (lines 28-42) writes out a temporary CSV file which will be used as the input_csv file in a subsequent update task containing the new node IDs plus some titles that are longer than max_node_title_length: 30 setting in the assets/max_node_title_length_test/update.yml file (line 45) runs self.update_cmd to execute the update task (lines 47-53) fetches the title values for each of the updated nodes and tests the length of each title string to confirm that it does not exceed the maximum allowed length of 30 characters. tearDown() removes all nodes created by the test and removes all temporary local files. Adding a new Drupal field type Overview of how Workbench handles field types Workbench and Drupal exchange field data represented as JSON, via Drupal's REST API. The specific JSON structure depends on the field type (text, entity reference, geolocation, etc.). Handling the details of these structures when adding new field data during create tasks, updating existing field data during update tasks, etc. is performed by code in the workbench_fields.py module. Each distinct field structure has its own class in that file, and each of the classes has the methods create() , update() , dedupe_values() , remove_invalid_values() , and serialize() . The create() and update() methods convert CSV field data in Workbench input files to Python dictionaries, which are subsequently converted into JSON for pushing up to Drupal. The serialize() method reverses this conversion, taking the JSON field data fetched from Drupal and converting it into a dictionary, and from that, into CSV data. dedupe_values() and remove_invalid_values() are utility methods that do what their names suggest. Currently, Workbench supports the following field types: \"simple\" fields for strings (for string or text fields) integers binary values 1 and 0 existing Drupal-generated entity IDs entity reference fields entity reference revision fields typed relation fields link fields geolocation fields authority link fields media track fields Eventually, classes for new Drupal field types will need to be added to Workbench as the community adopts more field types provided by Drupal contrib modules or creates new field types specific to Islandora. Note If new field types are added to workbench_utils.py, corresponding logic must be added to functions in other Workbench modules (e.g. workbench_utils, workbench) that create, update, or export Drupal entities. Those places are commented in the code with either: \"Assemble Drupal field structures from CSV data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" or \"Assemble CSV output Drupal field data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" Field class methods The most complex aspect of handling field data is cardinality, or in other words, whether a given field's configuration setting \"Allowed number of values\" allows for a single value, a maximum number of values (for example, a maximum of 3 values), or an unlimited number of values. Each field type's cardinality configuration complicates creating and updating field instances because Workbench must deal with situations where input CSV data for a field contains more values than are allowed, or when the user wants to append a value to an existing field instance rather than replace existing values, potentially exceeding the field's configured cardinality. Drupal's REST interface is very strict about cardinality, and if Workbench tries to push up a field's JSON that violates the field's cardinality, the HTTP request fails and returns a 422 Unprocessable Content response for the node containing the malformed field data. To prevent this from happening, code within field type classes needs to contain logic to account for the three different types of cardinality (cardinality of 1, cardinality of more than 1 but with a maximum, and unlimited) and for the specific JSON structure created from the field data in the input CSV. When Workbench detects that the number of field instances in the CSV data surpasses the field's maximum configured cardinality, it will truncate the incoming data and log that it did so via the log_field_cardinality_violation() utility function. However, truncating data is a fallback behavior; --check will report intances of this cardinality violation, providing you with an opportunity to correct the CSV data before committing it to Drupal. To illustrate this complexity, let's look at the update() method within the SimpleField class, which handles field types that have the Python structure [{\"value\": value}] or, for \"formatted\" text, [{\"value\": value, \"format\": text_format}] . Note Note that the example structure in the preceding paragraph shows a single value for that field. It's a list, but a list containing a single dictionary. If there were two values in a field, the structure would be a list containing two dictionaries, like [{\"value\": value}, {\"value\": value}] . If the field contained three values, the structure would be [{\"value\": value}, {\"value\": value}, {\"value\": value}] Lines 47-167 in the sample update() method apply when the field is configured to have a limited cardinality, either 1 or a specific number higher than 1. Within that range of lines, 49-113 apply if the update_mode configuration setting is \"append\", and lines 115-167 apply if the update_mode setting is \"replace\". Lines 169-255 apply when the field's cardinality is unlimited. Within that range of lines, 171-214 apply if the update_mode is \"append\", and lines 215-255 apply if it is \"replace\". An update_mode setting of \"delete\" simply removes all values from the field, in lines 28-30. 1. def update( 2. self, config, field_definitions, entity, row, field_name, entity_field_values 3. ): 4. \"\"\"Note: this method appends incoming CSV values to existing values, replaces existing field 5. values with incoming values, or deletes all values from fields, depending on whether 6. config['update_mode'] is 'append', 'replace', or 'delete'. It does not replace individual 7. values within fields. 8. \"\"\" 9. \"\"\"Parameters 10. ---------- 11. config : dict 12. The configuration settings defined by workbench_config.get_config(). 13. field_definitions : dict 14. The field definitions object defined by get_field_definitions(). 15. entity : dict 16. The dict that will be POSTed to Drupal as JSON. 17. row : OrderedDict. 18. The current CSV record. 19. field_name : string 20. The Drupal fieldname/CSV column header. 21. entity_field_values : list 22. List of dictionaries containing existing value(s) for field_name in the entity being updated. 23. Returns 24. ------- 25. dictionary 26. A dictionary representing the entity that is PATCHed to Drupal as JSON. 27. \"\"\" 28. if config[\"update_mode\"] == \"delete\": 29. entity[field_name] = [] 30. return entity 31. 32. if row[field_name] is None: 33. return entity 34. 35. if field_name in config[\"field_text_format_ids\"]: 36. text_format = config[\"field_text_format_ids\"][field_name] 37. else: 38. text_format = config[\"text_format_id\"] 39. 40. if config[\"task\"] == \"update_terms\": 41. entity_id_field = \"term_id\" 42. if config[\"task\"] == \"update\": 43. entity_id_field = \"node_id\" 44. if config[\"task\"] == \"update_media\": 45. entity_id_field = \"media_id\" 46. 47. # Cardinality has a limit. 48. if field_definitions[field_name][\"cardinality\"] > 0: 49. if config[\"update_mode\"] == \"append\": 50. if config[\"subdelimiter\"] in row[field_name]: 51. subvalues = row[field_name].split(config[\"subdelimiter\"]) 52. subvalues = self.remove_invalid_values( 53. config, field_definitions, field_name, subvalues 54. ) 55. for subvalue in subvalues: 56. subvalue = truncate_csv_value( 57. field_name, 58. row[entity_id_field], 59. field_definitions[field_name], 60. subvalue, 61. ) 62. if ( 63. \"formatted_text\" in field_definitions[field_name] 64. and field_definitions[field_name][\"formatted_text\"] is True 65. ): 66. entity[field_name].append( 67. {\"value\": subvalue, \"format\": text_format} 68. ) 69. else: 70. entity[field_name].append({\"value\": subvalue}) 71. entity[field_name] = self.dedupe_values(entity[field_name]) 72. if len(entity[field_name]) > int( 73. field_definitions[field_name][\"cardinality\"] 74. ): 75. log_field_cardinality_violation( 76. field_name, 77. row[entity_id_field], 78. field_definitions[field_name][\"cardinality\"], 79. ) 80. entity[field_name] = entity[field_name][ 81. : field_definitions[field_name][\"cardinality\"] 82. ] 83. else: 84. row[field_name] = self.remove_invalid_values( 85. config, field_definitions, field_name, row[field_name] 86. ) 87. row[field_name] = truncate_csv_value( 88. field_name, 89. row[entity_id_field], 90. field_definitions[field_name], 91. row[field_name], 92. ) 93. if ( 94. \"formatted_text\" in field_definitions[field_name] 95. and field_definitions[field_name][\"formatted_text\"] is True 96. ): 97. entity[field_name].append( 98. {\"value\": row[field_name], \"format\": text_format} 99. ) 100. else: 101. entity[field_name].append({\"value\": row[field_name]}) 102. entity[field_name] = self.dedupe_values(entity[field_name]) 103. if len(entity[field_name]) > int( 104. field_definitions[field_name][\"cardinality\"] 105. ): 106. log_field_cardinality_violation( 107. field_name, 108. row[entity_id_field], 109. field_definitions[field_name][\"cardinality\"], 110. ) 111. entity[field_name] = entity[field_name][ 112. : field_definitions[field_name][\"cardinality\"] 113. ] 114. 115. if config[\"update_mode\"] == \"replace\": 116. if config[\"subdelimiter\"] in row[field_name]: 117. field_values = [] 118. subvalues = row[field_name].split(config[\"subdelimiter\"]) 119. subvalues = self.remove_invalid_values( 120. config, field_definitions, field_name, subvalues 121. ) 122. subvalues = self.dedupe_values(subvalues) 123. if len(subvalues) > int( 124. field_definitions[field_name][\"cardinality\"] 125. ): 126. log_field_cardinality_violation( 127. field_name, 128. row[entity_id_field], 129. field_definitions[field_name][\"cardinality\"], 130. ) 131. subvalues = subvalues[ 132. : field_definitions[field_name][\"cardinality\"] 133. ] 134. for subvalue in subvalues: 135. subvalue = truncate_csv_value( 136. field_name, 137. row[entity_id_field], 138. field_definitions[field_name], 139. subvalue, 140. ) 141. if ( 142. \"formatted_text\" in field_definitions[field_name] 143. and field_definitions[field_name][\"formatted_text\"] is True 144. ): 145. field_values.append( 146. {\"value\": subvalue, \"format\": text_format} 147. ) 148. else: 149. field_values.append({\"value\": subvalue}) 150. field_values = self.dedupe_values(field_values) 151. entity[field_name] = field_values 152. else: 153. row[field_name] = truncate_csv_value( 154. field_name, 155. row[entity_id_field], 156. field_definitions[field_name], 157. row[field_name], 158. ) 159. if ( 160. \"formatted_text\" in field_definitions[field_name] 161. and field_definitions[field_name][\"formatted_text\"] is True 162. ): 163. entity[field_name] = [ 164. {\"value\": row[field_name], \"format\": text_format} 165. ] 166. else: 167. entity[field_name] = [{\"value\": row[field_name]}] 168. 169. # Cardinality is unlimited. 170. else: 171. if config[\"update_mode\"] == \"append\": 172. if config[\"subdelimiter\"] in row[field_name]: 173. field_values = [] 174. subvalues = row[field_name].split(config[\"subdelimiter\"]) 175. subvalues = self.remove_invalid_values( 176. config, field_definitions, field_name, subvalues 177. ) 178. for subvalue in subvalues: 179. subvalue = truncate_csv_value( 180. field_name, 181. row[entity_id_field], 182. field_definitions[field_name], 183. subvalue, 184. ) 185. if ( 186. \"formatted_text\" in field_definitions[field_name] 187. and field_definitions[field_name][\"formatted_text\"] is True 188. ): 189. field_values.append( 190. {\"value\": subvalue, \"format\": text_format} 191. ) 192. else: 193. field_values.append({\"value\": subvalue}) 194. entity[field_name] = entity_field_values + field_values 195. entity[field_name] = self.dedupe_values(entity[field_name]) 196. else: 197. row[field_name] = truncate_csv_value( 198. field_name, 199. row[entity_id_field], 200. field_definitions[field_name], 201. row[field_name], 202. ) 203. if ( 204. \"formatted_text\" in field_definitions[field_name] 205. and field_definitions[field_name][\"formatted_text\"] is True 206. ): 207. entity[field_name] = entity_field_values + [ 208. {\"value\": row[field_name], \"format\": text_format} 209. ] 210. else: 211. entity[field_name] = entity_field_values + [ 212. {\"value\": row[field_name]} 213. ] 214. entity[field_name] = self.dedupe_values(entity[field_name]) 215. if config[\"update_mode\"] == \"replace\": 216. if config[\"subdelimiter\"] in row[field_name]: 217. field_values = [] 218. subvalues = row[field_name].split(config[\"subdelimiter\"]) 219. subvalues = self.remove_invalid_values( 220. config, field_definitions, field_name, subvalues 221. ) 222. for subvalue in subvalues: 223. subvalue = truncate_csv_value( 224. field_name, 225. row[entity_id_field], 226. field_definitions[field_name], 227. subvalue, 228. ) 229. if ( 230. \"formatted_text\" in field_definitions[field_name] 231. and field_definitions[field_name][\"formatted_text\"] is True 232. ): 233. field_values.append( 234. {\"value\": subvalue, \"format\": text_format} 235. ) 236. else: 237. field_values.append({\"value\": subvalue}) 238. entity[field_name] = field_values 239. entity[field_name] = self.dedupe_values(entity[field_name]) 240. else: 241. row[field_name] = truncate_csv_value( 242. field_name, 243. row[entity_id_field], 244. field_definitions[field_name], 245. row[field_name], 246. ) 247. if ( 248. \"formatted_text\" in field_definitions[field_name] 249. and field_definitions[field_name][\"formatted_text\"] is True 250. ): 251. entity[field_name] = [ 252. {\"value\": row[field_name], \"format\": text_format} 253. ] 254. else: 255. entity[field_name] = [{\"value\": row[field_name]}] 256. 257. return entity Each field type has its own structure. Within the field classes, the field structure is represented in Python dictionaries and converted to JSON when pushed up to Drupal as part of REST requests. These Python dictionaries are converted to JSON automatically as part of the HTTP request to Drupal (you do not do this within the field classes) so we'll focus only on the Python dictionary structure here. Some examples are: SimpleField fields have the Python structure {\"value\": value} or, for \"formatted\" text, {\"value\": value, \"format\": text_format} GeolocationField fields have the Python structure {\"lat\": lat_value, \"lon\": lon_value} LinkField fields have the python structure {\"uri\": uri, \"title\": title} EntityReferenceField fields have the Python structure {\"target_id\": id, \"target_type\": target_type} TypedRelationField fields have the Python structure {\"target_id\": id, \"rel_type\": rel_type:rel_value, \"target_type\": target_type} the value of the rel_type key is the relator type (e.g. MARC relators) and the relator value (e.g. 'art') joined with a colon, e.g. relators:art AuthorityLinkField fields have the Python structure {\"source\": source, \"uri\": uri, \"title\": title} MediaTrackField fields have the Python structure {\"label\": label, \"kind\": kind, \"srclang\": lang, \"file_path\": path} To add a support for a new field type, you will need to figure out the field type's JSON structure and convert that structure into the Python dictionary equivalent in the new field class methods. The best way to inspect a field type's JSON structure is to view the JSON representation of a node that contains instances of the field by tacking ?_format=json to the end of the node's URL. Once you have an example of the field type's JSON, you will need to write the necessary class methods to convert between Workbench CSV data and the field's JSON structure as applicable in all of the field class methods, making sure you account for the field's configured cardinality, account for the update mode within the update() method, etc. Writing field classes is one aspect of Workbench development that demonstrates the value of unit tests. Without writing unit tests to accompany the development of these field classes, you will lose your mind. tests/field_tests.py contains over 80 tests in more than 5,000 lines of test code for a good reason. Islandora Workbench Integration Drupal module Islandora Workbench Integration is a Drupal module that allows Islandora Workbench to communicate with Drupal efficiently and reliably. It enables some Views and REST endpoints that Workbench expects, and also provides a few custom REST endpoints (see the module's README for details). Generally speaking, the only situation where the Integration module will need to be updated (apart from requirements imposed by new versions of Drupal) is if we add a new feature to Workbench that requires a specific View or a specific REST endpoint to be enabled and configured in the target Drupal. If a change is required in the Integration module, it is very important to communicate this to Workbench users, since if the Integration module is not updated to align with the change in Workbench, the new feature won't work. A defensive coding strategy to ensure that changes in the client-side Workbench code that depend on changes in the server-side Integration module will work is, within the Workbench code, invoke the check_integration_module_version() function to check the Integration module's version number and use conditional logic to execute the new Workbench code only if the Integration module's version number meets or exceeds a version number string defined in that section of the Workbench code (e.g., the new Workbench feature requires version 1.1.3 of the Integration module). Under the hood, this function queries /islandora_workbench_integration/version on the target Drupal to get the Integration module's version number, although as a developer all you need to do is invoke the check_integration_module_version() function and inspect its return value. In this example, your code would compare the Integration module's version number with 1.1.3 (possibly using the convert_semver_to_number() utility function) and include logic so that it only executes if the minimum Integration module version is met. Note Some Views used by Islandora Workbench are defined by users and not by the Islandora Workbench Integration module. Specifically, Views described in the \" Generating CSV files \" documentation are created by users.","title":"Development guide"},{"location":"development_guide/#general","text":"Bug reports, improvements, feature requests, and PRs are welcome. Before you open a pull request, please open an issue so we can discuss your idea . In cases where the PR introduces a potentially disruptive change to Workbench, we usually want to start a discussion about its impact on users in the #islandoraworkbench Slack channel. When you open a PR, you will be asked to complete the Workbench PR template . All code must be formatted using Black . You can automatically style your code using Black in your IDE of choice . Where applicable, unit and integration tests to accompany your code are very appreciated. Tests in Workbench fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . workbench_utils.py provides a lot of utility functions. Before writing a new utility function, please ensure that something similar to what you want to write doesn't already exist.","title":"General"},{"location":"development_guide/#running-tests","text":"While developing code for Islandora Workbench, you should run tests frequently to ensure that your code hasn't introduced any regression errors. Workbench is a fairly large and complex application, and has many configuration settings. Even minor changes in the code can break things. To run unit tests that do not require a live Islandora instance: Unit tests in tests/unit_tests.py (run with python tests/unit_tests.py ) Unit tests for Workbench's Drupal fields handlers in tests/field_tests.py (run with python tests/field_tests.py ) Note that these tests are run automatically as GitHub actions when you push to the Islandora Workbench repo or when a merge request is merged. To run integration tests that require a live Islandora instance running at https://islandora.traefik.me/ tests/islandora_tests.py , tests/islandora_tests_check.py , tests/islandora_tests_hooks.py , and tests/islandora_tests_paged_content.py can be run with python tests/islandora_tests.py , etc. The Islandora Starter Site deployed with ISLE is recommended way to deploy the Islandora used in these tests. Integration tests remove all nodes and media added during the tests, unless a test fails. Taxonomy terms created by tests are not removed. Some integration and field tests output text that beings with \"Error:.\" This is normal, it's the text that Workbench outputs when it finds something wrong (which is probably what the test is testing). Successful test (whether they test for success or failure) runs will exit with \"OK\". If you can figure out how to suppress this output, please visit this issue . If you want to run the tests within a specific class, include the class name as an argument like this: python tests/unit_tests.py TestCompareStings You can also specify multiple test classes within a single test file: python tests/islandora_tests.py TestMyNewTest TestMyOtherNewTest","title":"Running tests"},{"location":"development_guide/#writing-tests","text":"Islandora Workbench's tests are written using the Python built-in module unittest , and as explained above, fall into two categories: Unit tests that do not require a live Islandora instance. Integration tests that require a live Islandora instance running at https://islandora.traefik.me/ . unittest groups tests into classes. A single test file can contain one or more test classes. Within each test class, you can put one or more test methods. As shown in the second example below, two reserved methods, setUp() and tearDown() , are reserved for setup and cleanup tasks, respectively, within each class. If you are new to using unittest , this is a good tutorial.","title":"Writing tests"},{"location":"development_guide/#a-simple-unit-test","text":"Islandora Workbench unit tests are much like unit tests in any Python application. The sample test below, from tests/unit_tests.py , tests the validate_latlong_value() method from the workbench_utils.py module. Since workbench_utils.validate_latlong_value() doesn't interact with Islandora, https://islandora.traefik.me/ doesn't need to be available to run this unit test. class TestValidateLatlongValue(unittest.TestCase): def test_validate_good_latlong_values(self): values = ['+90.0, -127.554334', '90.0, -127.554334', '-90,-180', '+50.25,-117.8', '+48.43333,-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertTrue(res) def test_validate_bad_latlong_values(self): values = ['+90.1 -100.111', '045, 180', '+5025,-117.8', '-123.36667'] for value in values: res = workbench_utils.validate_latlong_value(value) self.assertFalse(res) This is a fairly standard Python unit test - we define a list of valid lat/long pairs and run them through the workbench_utils.validate_latlong_value() method expecting it to return True for each value, and then we define a list of bad lat/long pairs and run them through the method expecting it to return False for each value.","title":"A simple unit test"},{"location":"development_guide/#a-simple-integration-test","text":"The two sample integration tests provided below are copied from islandora_tests.py . The first sample integration test, TestCreate , looks like this (with line numbers added for easy reference). Configuration and CSV files used by this test are in tests/assets/create_test : 1. class TestCreate(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. 8. def test_create(self): 9. self.nids = list() 10. create_output = subprocess.check_output(self.create_cmd) 11. create_output = create_output.decode().strip() 12. create_lines = create_output.splitlines() 13. for line in create_lines: 14. if 'created at' in line: 15. nid = line.rsplit('/', 1)[-1] 16. nid = nid.strip('.') 17. self.nids.append(nid) 18. 19. self.assertEqual(len(self.nids), 2) 20. 21. def tearDown(self): 22. for nid in self.nids: 23. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 24. quick_delete_output = subprocess.check_output(quick_delete_cmd) 25. 26. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'rollback.csv') 27. if os.path.exists(self.rollback_file_path): 28. os.remove(self.rollback_file_path) 29. 30. self.preprocessed_file_path = os.path.join(self.current_dir, 'assets', 'create_test', 'metadata.csv.preprocessed') 31. if os.path.exists(self.preprocessed_file_path): 32. os.remove(self.preprocessed_file_path) As you can see, this test runs Workbench using the config file create.yml (line 10), which lives at tests/assets/create_test/create.yml , relative to the workbench directory. A tricky aspect of using real config files in tests is that all paths mentioned in the config file must be relative to the workbench directory. This create.yml defines the input_dir setting to be tests/assets/create_test : task: create host: https://islandora.traefik.me username: admin password: password input_dir: \"tests/assets/create_test\" media_type: image allow_missing_files: True The test's setUp() method prepares the file paths, etc. and within the test's only test method, test_create() , runs Workbench using Python's subprocess.check_output() method, grabs the node IDs from the output from the \"created at\" strings emitted by Workbench (lines 14-17), adds them to a list, and then counts the number of members in that list. If the number of nodes created matches the expected number, the test passes. Since this test creates some nodes, we use the test class's tearDown() method to put the target Drupal back into as close a state as we started with as possible. tearDown() basically takes the list of node IDs created in test_create() and runs Workbench with the --quick_delete_node option. It then removes any temporary files created during the test.","title":"A simple integration test"},{"location":"development_guide/#a-more-complex-integration-test","text":"Since Workbench is essentially a specialized REST client, writing integration tests that require interaction with Drupal can get a bit complex. But, the overall pattern is: Create some entities (nodes, media, taxonomy terms). Confirm that they were created in the expected way (doing this usually involves keeping track of any node IDs needed to run tests or to clean up, and in some cases involves parsing out values from raw JSON returned by Drupal). Clean up by deleting any Drupal entities created during the tests and also any temporary local files. An integration test that checks data in the node JSON is TestUpdateWithMaxNodeTitleLength() . Files that accompany this test are in tests/assets/max_node_title_length_test . Here is a copy of the test's code: 1. class TestUpdateWithMaxNodeTitleLength(unittest.TestCase): 2. 3. def setUp(self): 4. self.current_dir = os.path.dirname(os.path.abspath(__file__)) 5. self.create_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'create.yml') 6. self.create_cmd = [\"./workbench\", \"--config\", self.create_config_file_path] 7. self.nids = list() 8. 9. self.update_csv_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update_max_node_title_length.csv') 10. self.update_config_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'update.yml') 11. self.update_cmd = [\"./workbench\", \"--config\", self.update_config_file_path] 12. 13. self.temp_dir = tempfile.gettempdir() 14. 15. def test_create(self): 16. create_output = subprocess.check_output(self.create_cmd) 17. self.create_output = create_output.decode().strip() 18. 19. create_lines = self.create_output.splitlines() 20. for line in create_lines: 21. if 'created at' in line: 22. nid = line.rsplit('/', 1)[-1] 23. nid = nid.strip('.') 24. self.nids.append(nid) 25. 26. self.assertEqual(len(self.nids), 6) 27. 28. # Write out an update CSV file using the node IDs in self.nids. 29. update_csv_file_rows = list() 30. test_titles = ['This title is 37 chars___________long', 31. 'This title is 39 chars_____________long', 32. 'This title is 29 _ chars long', 33. 'This title is 42 chars________________long', 34. 'This title is 44 chars__________________long', 35. 'This title is 28 chars long.'] 36. update_csv_file_rows.append('node_id,title') 37. i = 0 38. while i <= 5: 39. update_csv_file_rows.append(f'{self.nids[i]},{test_titles[i]}') 40. i = i + 1 41. with open(self.update_csv_file_path, mode='wt') as update_csv_file: 42. update_csv_file.write('\\n'.join(update_csv_file_rows)) 43. 44. # Run the update command. 45. check_output = subprocess.check_output(self.update_cmd) 46. 47. # Fetch each node in self.nids and check to see if its title is <= 30 chars long. All should be. 48. for nid_to_update in self.nids: 49. node_url = 'https://islandora.traefik.me/node/' + str(self.nids[0]) + '?_format=json' 50. node_response = requests.get(node_url) 51. node = json.loads(node_response.text) 52. updated_title = str(node['title'][0]['value']) 53. self.assertLessEqual(len(updated_title), 30, '') 54. 55. def tearDown(self): 56. for nid in self.nids: 57. quick_delete_cmd = [\"./workbench\", \"--config\", self.create_config_file_path, '--quick_delete_node', 'https://islandora.traefik.me/node/' + nid] 58. quick_delete_output = subprocess.check_output(quick_delete_cmd) 59. 60. self.rollback_file_path = os.path.join(self.current_dir, 'assets', 'max_node_title_length_test', 'rollback.csv') 61. if os.path.exists(self.rollback_file_path): 62. os.remove(self.rollback_file_path) 63. 64. self.preprocessed_file_path = os.path.join(self.temp_dir, 'create_max_node_title_length.csv.preprocessed') 65. if os.path.exists(self.preprocessed_file_path): 66. os.remove(self.preprocessed_file_path) 67. 68. # Update test: 1) delete the update CSV file, 2) delete the update .preprocessed file. 69. if os.path.exists(self.update_csv_file_path): 70. os.remove(self.update_csv_file_path) 71. 72. self.preprocessed_update_file_path = os.path.join(self.temp_dir, 'update_max_node_title_length.csv.preprocessed') 73. if os.path.exists(self.preprocessed_update_file_path): 74. os.remove(self.preprocessed_update_file_path) This test: (line 16) creates some nodes that will be updated within the same test class (i.e. in line 45) (lines 28-42) writes out a temporary CSV file which will be used as the input_csv file in a subsequent update task containing the new node IDs plus some titles that are longer than max_node_title_length: 30 setting in the assets/max_node_title_length_test/update.yml file (line 45) runs self.update_cmd to execute the update task (lines 47-53) fetches the title values for each of the updated nodes and tests the length of each title string to confirm that it does not exceed the maximum allowed length of 30 characters. tearDown() removes all nodes created by the test and removes all temporary local files.","title":"A more complex integration test"},{"location":"development_guide/#adding-a-new-drupal-field-type","text":"","title":"Adding a new Drupal field type"},{"location":"development_guide/#overview-of-how-workbench-handles-field-types","text":"Workbench and Drupal exchange field data represented as JSON, via Drupal's REST API. The specific JSON structure depends on the field type (text, entity reference, geolocation, etc.). Handling the details of these structures when adding new field data during create tasks, updating existing field data during update tasks, etc. is performed by code in the workbench_fields.py module. Each distinct field structure has its own class in that file, and each of the classes has the methods create() , update() , dedupe_values() , remove_invalid_values() , and serialize() . The create() and update() methods convert CSV field data in Workbench input files to Python dictionaries, which are subsequently converted into JSON for pushing up to Drupal. The serialize() method reverses this conversion, taking the JSON field data fetched from Drupal and converting it into a dictionary, and from that, into CSV data. dedupe_values() and remove_invalid_values() are utility methods that do what their names suggest. Currently, Workbench supports the following field types: \"simple\" fields for strings (for string or text fields) integers binary values 1 and 0 existing Drupal-generated entity IDs entity reference fields entity reference revision fields typed relation fields link fields geolocation fields authority link fields media track fields Eventually, classes for new Drupal field types will need to be added to Workbench as the community adopts more field types provided by Drupal contrib modules or creates new field types specific to Islandora. Note If new field types are added to workbench_utils.py, corresponding logic must be added to functions in other Workbench modules (e.g. workbench_utils, workbench) that create, update, or export Drupal entities. Those places are commented in the code with either: \"Assemble Drupal field structures from CSV data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\" or \"Assemble CSV output Drupal field data. If new field types are added to workbench_fields.py, they need to be registered in the following if/elif/else block.\"","title":"Overview of how Workbench handles field types"},{"location":"development_guide/#field-class-methods","text":"The most complex aspect of handling field data is cardinality, or in other words, whether a given field's configuration setting \"Allowed number of values\" allows for a single value, a maximum number of values (for example, a maximum of 3 values), or an unlimited number of values. Each field type's cardinality configuration complicates creating and updating field instances because Workbench must deal with situations where input CSV data for a field contains more values than are allowed, or when the user wants to append a value to an existing field instance rather than replace existing values, potentially exceeding the field's configured cardinality. Drupal's REST interface is very strict about cardinality, and if Workbench tries to push up a field's JSON that violates the field's cardinality, the HTTP request fails and returns a 422 Unprocessable Content response for the node containing the malformed field data. To prevent this from happening, code within field type classes needs to contain logic to account for the three different types of cardinality (cardinality of 1, cardinality of more than 1 but with a maximum, and unlimited) and for the specific JSON structure created from the field data in the input CSV. When Workbench detects that the number of field instances in the CSV data surpasses the field's maximum configured cardinality, it will truncate the incoming data and log that it did so via the log_field_cardinality_violation() utility function. However, truncating data is a fallback behavior; --check will report intances of this cardinality violation, providing you with an opportunity to correct the CSV data before committing it to Drupal. To illustrate this complexity, let's look at the update() method within the SimpleField class, which handles field types that have the Python structure [{\"value\": value}] or, for \"formatted\" text, [{\"value\": value, \"format\": text_format}] . Note Note that the example structure in the preceding paragraph shows a single value for that field. It's a list, but a list containing a single dictionary. If there were two values in a field, the structure would be a list containing two dictionaries, like [{\"value\": value}, {\"value\": value}] . If the field contained three values, the structure would be [{\"value\": value}, {\"value\": value}, {\"value\": value}] Lines 47-167 in the sample update() method apply when the field is configured to have a limited cardinality, either 1 or a specific number higher than 1. Within that range of lines, 49-113 apply if the update_mode configuration setting is \"append\", and lines 115-167 apply if the update_mode setting is \"replace\". Lines 169-255 apply when the field's cardinality is unlimited. Within that range of lines, 171-214 apply if the update_mode is \"append\", and lines 215-255 apply if it is \"replace\". An update_mode setting of \"delete\" simply removes all values from the field, in lines 28-30. 1. def update( 2. self, config, field_definitions, entity, row, field_name, entity_field_values 3. ): 4. \"\"\"Note: this method appends incoming CSV values to existing values, replaces existing field 5. values with incoming values, or deletes all values from fields, depending on whether 6. config['update_mode'] is 'append', 'replace', or 'delete'. It does not replace individual 7. values within fields. 8. \"\"\" 9. \"\"\"Parameters 10. ---------- 11. config : dict 12. The configuration settings defined by workbench_config.get_config(). 13. field_definitions : dict 14. The field definitions object defined by get_field_definitions(). 15. entity : dict 16. The dict that will be POSTed to Drupal as JSON. 17. row : OrderedDict. 18. The current CSV record. 19. field_name : string 20. The Drupal fieldname/CSV column header. 21. entity_field_values : list 22. List of dictionaries containing existing value(s) for field_name in the entity being updated. 23. Returns 24. ------- 25. dictionary 26. A dictionary representing the entity that is PATCHed to Drupal as JSON. 27. \"\"\" 28. if config[\"update_mode\"] == \"delete\": 29. entity[field_name] = [] 30. return entity 31. 32. if row[field_name] is None: 33. return entity 34. 35. if field_name in config[\"field_text_format_ids\"]: 36. text_format = config[\"field_text_format_ids\"][field_name] 37. else: 38. text_format = config[\"text_format_id\"] 39. 40. if config[\"task\"] == \"update_terms\": 41. entity_id_field = \"term_id\" 42. if config[\"task\"] == \"update\": 43. entity_id_field = \"node_id\" 44. if config[\"task\"] == \"update_media\": 45. entity_id_field = \"media_id\" 46. 47. # Cardinality has a limit. 48. if field_definitions[field_name][\"cardinality\"] > 0: 49. if config[\"update_mode\"] == \"append\": 50. if config[\"subdelimiter\"] in row[field_name]: 51. subvalues = row[field_name].split(config[\"subdelimiter\"]) 52. subvalues = self.remove_invalid_values( 53. config, field_definitions, field_name, subvalues 54. ) 55. for subvalue in subvalues: 56. subvalue = truncate_csv_value( 57. field_name, 58. row[entity_id_field], 59. field_definitions[field_name], 60. subvalue, 61. ) 62. if ( 63. \"formatted_text\" in field_definitions[field_name] 64. and field_definitions[field_name][\"formatted_text\"] is True 65. ): 66. entity[field_name].append( 67. {\"value\": subvalue, \"format\": text_format} 68. ) 69. else: 70. entity[field_name].append({\"value\": subvalue}) 71. entity[field_name] = self.dedupe_values(entity[field_name]) 72. if len(entity[field_name]) > int( 73. field_definitions[field_name][\"cardinality\"] 74. ): 75. log_field_cardinality_violation( 76. field_name, 77. row[entity_id_field], 78. field_definitions[field_name][\"cardinality\"], 79. ) 80. entity[field_name] = entity[field_name][ 81. : field_definitions[field_name][\"cardinality\"] 82. ] 83. else: 84. row[field_name] = self.remove_invalid_values( 85. config, field_definitions, field_name, row[field_name] 86. ) 87. row[field_name] = truncate_csv_value( 88. field_name, 89. row[entity_id_field], 90. field_definitions[field_name], 91. row[field_name], 92. ) 93. if ( 94. \"formatted_text\" in field_definitions[field_name] 95. and field_definitions[field_name][\"formatted_text\"] is True 96. ): 97. entity[field_name].append( 98. {\"value\": row[field_name], \"format\": text_format} 99. ) 100. else: 101. entity[field_name].append({\"value\": row[field_name]}) 102. entity[field_name] = self.dedupe_values(entity[field_name]) 103. if len(entity[field_name]) > int( 104. field_definitions[field_name][\"cardinality\"] 105. ): 106. log_field_cardinality_violation( 107. field_name, 108. row[entity_id_field], 109. field_definitions[field_name][\"cardinality\"], 110. ) 111. entity[field_name] = entity[field_name][ 112. : field_definitions[field_name][\"cardinality\"] 113. ] 114. 115. if config[\"update_mode\"] == \"replace\": 116. if config[\"subdelimiter\"] in row[field_name]: 117. field_values = [] 118. subvalues = row[field_name].split(config[\"subdelimiter\"]) 119. subvalues = self.remove_invalid_values( 120. config, field_definitions, field_name, subvalues 121. ) 122. subvalues = self.dedupe_values(subvalues) 123. if len(subvalues) > int( 124. field_definitions[field_name][\"cardinality\"] 125. ): 126. log_field_cardinality_violation( 127. field_name, 128. row[entity_id_field], 129. field_definitions[field_name][\"cardinality\"], 130. ) 131. subvalues = subvalues[ 132. : field_definitions[field_name][\"cardinality\"] 133. ] 134. for subvalue in subvalues: 135. subvalue = truncate_csv_value( 136. field_name, 137. row[entity_id_field], 138. field_definitions[field_name], 139. subvalue, 140. ) 141. if ( 142. \"formatted_text\" in field_definitions[field_name] 143. and field_definitions[field_name][\"formatted_text\"] is True 144. ): 145. field_values.append( 146. {\"value\": subvalue, \"format\": text_format} 147. ) 148. else: 149. field_values.append({\"value\": subvalue}) 150. field_values = self.dedupe_values(field_values) 151. entity[field_name] = field_values 152. else: 153. row[field_name] = truncate_csv_value( 154. field_name, 155. row[entity_id_field], 156. field_definitions[field_name], 157. row[field_name], 158. ) 159. if ( 160. \"formatted_text\" in field_definitions[field_name] 161. and field_definitions[field_name][\"formatted_text\"] is True 162. ): 163. entity[field_name] = [ 164. {\"value\": row[field_name], \"format\": text_format} 165. ] 166. else: 167. entity[field_name] = [{\"value\": row[field_name]}] 168. 169. # Cardinality is unlimited. 170. else: 171. if config[\"update_mode\"] == \"append\": 172. if config[\"subdelimiter\"] in row[field_name]: 173. field_values = [] 174. subvalues = row[field_name].split(config[\"subdelimiter\"]) 175. subvalues = self.remove_invalid_values( 176. config, field_definitions, field_name, subvalues 177. ) 178. for subvalue in subvalues: 179. subvalue = truncate_csv_value( 180. field_name, 181. row[entity_id_field], 182. field_definitions[field_name], 183. subvalue, 184. ) 185. if ( 186. \"formatted_text\" in field_definitions[field_name] 187. and field_definitions[field_name][\"formatted_text\"] is True 188. ): 189. field_values.append( 190. {\"value\": subvalue, \"format\": text_format} 191. ) 192. else: 193. field_values.append({\"value\": subvalue}) 194. entity[field_name] = entity_field_values + field_values 195. entity[field_name] = self.dedupe_values(entity[field_name]) 196. else: 197. row[field_name] = truncate_csv_value( 198. field_name, 199. row[entity_id_field], 200. field_definitions[field_name], 201. row[field_name], 202. ) 203. if ( 204. \"formatted_text\" in field_definitions[field_name] 205. and field_definitions[field_name][\"formatted_text\"] is True 206. ): 207. entity[field_name] = entity_field_values + [ 208. {\"value\": row[field_name], \"format\": text_format} 209. ] 210. else: 211. entity[field_name] = entity_field_values + [ 212. {\"value\": row[field_name]} 213. ] 214. entity[field_name] = self.dedupe_values(entity[field_name]) 215. if config[\"update_mode\"] == \"replace\": 216. if config[\"subdelimiter\"] in row[field_name]: 217. field_values = [] 218. subvalues = row[field_name].split(config[\"subdelimiter\"]) 219. subvalues = self.remove_invalid_values( 220. config, field_definitions, field_name, subvalues 221. ) 222. for subvalue in subvalues: 223. subvalue = truncate_csv_value( 224. field_name, 225. row[entity_id_field], 226. field_definitions[field_name], 227. subvalue, 228. ) 229. if ( 230. \"formatted_text\" in field_definitions[field_name] 231. and field_definitions[field_name][\"formatted_text\"] is True 232. ): 233. field_values.append( 234. {\"value\": subvalue, \"format\": text_format} 235. ) 236. else: 237. field_values.append({\"value\": subvalue}) 238. entity[field_name] = field_values 239. entity[field_name] = self.dedupe_values(entity[field_name]) 240. else: 241. row[field_name] = truncate_csv_value( 242. field_name, 243. row[entity_id_field], 244. field_definitions[field_name], 245. row[field_name], 246. ) 247. if ( 248. \"formatted_text\" in field_definitions[field_name] 249. and field_definitions[field_name][\"formatted_text\"] is True 250. ): 251. entity[field_name] = [ 252. {\"value\": row[field_name], \"format\": text_format} 253. ] 254. else: 255. entity[field_name] = [{\"value\": row[field_name]}] 256. 257. return entity Each field type has its own structure. Within the field classes, the field structure is represented in Python dictionaries and converted to JSON when pushed up to Drupal as part of REST requests. These Python dictionaries are converted to JSON automatically as part of the HTTP request to Drupal (you do not do this within the field classes) so we'll focus only on the Python dictionary structure here. Some examples are: SimpleField fields have the Python structure {\"value\": value} or, for \"formatted\" text, {\"value\": value, \"format\": text_format} GeolocationField fields have the Python structure {\"lat\": lat_value, \"lon\": lon_value} LinkField fields have the python structure {\"uri\": uri, \"title\": title} EntityReferenceField fields have the Python structure {\"target_id\": id, \"target_type\": target_type} TypedRelationField fields have the Python structure {\"target_id\": id, \"rel_type\": rel_type:rel_value, \"target_type\": target_type} the value of the rel_type key is the relator type (e.g. MARC relators) and the relator value (e.g. 'art') joined with a colon, e.g. relators:art AuthorityLinkField fields have the Python structure {\"source\": source, \"uri\": uri, \"title\": title} MediaTrackField fields have the Python structure {\"label\": label, \"kind\": kind, \"srclang\": lang, \"file_path\": path} To add a support for a new field type, you will need to figure out the field type's JSON structure and convert that structure into the Python dictionary equivalent in the new field class methods. The best way to inspect a field type's JSON structure is to view the JSON representation of a node that contains instances of the field by tacking ?_format=json to the end of the node's URL. Once you have an example of the field type's JSON, you will need to write the necessary class methods to convert between Workbench CSV data and the field's JSON structure as applicable in all of the field class methods, making sure you account for the field's configured cardinality, account for the update mode within the update() method, etc. Writing field classes is one aspect of Workbench development that demonstrates the value of unit tests. Without writing unit tests to accompany the development of these field classes, you will lose your mind. tests/field_tests.py contains over 80 tests in more than 5,000 lines of test code for a good reason.","title":"Field class methods"},{"location":"development_guide/#islandora-workbench-integration-drupal-module","text":"Islandora Workbench Integration is a Drupal module that allows Islandora Workbench to communicate with Drupal efficiently and reliably. It enables some Views and REST endpoints that Workbench expects, and also provides a few custom REST endpoints (see the module's README for details). Generally speaking, the only situation where the Integration module will need to be updated (apart from requirements imposed by new versions of Drupal) is if we add a new feature to Workbench that requires a specific View or a specific REST endpoint to be enabled and configured in the target Drupal. If a change is required in the Integration module, it is very important to communicate this to Workbench users, since if the Integration module is not updated to align with the change in Workbench, the new feature won't work. A defensive coding strategy to ensure that changes in the client-side Workbench code that depend on changes in the server-side Integration module will work is, within the Workbench code, invoke the check_integration_module_version() function to check the Integration module's version number and use conditional logic to execute the new Workbench code only if the Integration module's version number meets or exceeds a version number string defined in that section of the Workbench code (e.g., the new Workbench feature requires version 1.1.3 of the Integration module). Under the hood, this function queries /islandora_workbench_integration/version on the target Drupal to get the Integration module's version number, although as a developer all you need to do is invoke the check_integration_module_version() function and inspect its return value. In this example, your code would compare the Integration module's version number with 1.1.3 (possibly using the convert_semver_to_number() utility function) and include logic so that it only executes if the minimum Integration module version is met. Note Some Views used by Islandora Workbench are defined by users and not by the Islandora Workbench Integration module. Specifically, Views described in the \" Generating CSV files \" documentation are created by users.","title":"Islandora Workbench Integration Drupal module"},{"location":"drupal_and_workbench/","text":"This page highlights the most important Drupal and Islandora features relevant to the use of Workbench. Its audience is managers of Islandora repositories who want a primer on how Drupal, Islandora, and Workbench relate to each other. The Workbench-specific ideas introduced here are documented in detail elsewhere on this site. This page is not intended to be a replacement for the official Islandora documentation , which provides comprehensive and detailed information about how Islandora is structured, and about how to install, configure, and use it. Help improve this page! Your feedback on the usefulness of this page is very important! Join the #islandoraworkbench channel in the Islandora Slack , or leave a comment on this Github issue . Why would I want to use Islandora Workbench? Islandora Workbench lets you manage content in an Islandora repository at scale. Islandora provides web forms for creating and editing content on an item-by-item basis, but if you want to load a large number of items into an Islandora repository (or update or delete content in large numbers), you need a batch-oriented tool like Workbench. Simply put, Islandora Workbench enables you to get batches of content into an Islandora repository, and also update or delete content in large batches. How do I use Islandora Workbench? Islandora Workbench provides the ability to perform a set of \" tasks \". The focus of this page is the create task, but other tasks Workbench enables include update , delete , and add_media . To use Islandora Workbench to create new content, you need to assemble a CSV file containing metadata describing your content, and arrange the accompanying image, video, PDF, and other files in specific ways so that Workbench knows where to find them. Here is a very simple sample Workbench CSV file: file,id,title,field_model,field_description,date_generated,quality control by IMG_1410.tif,01,Small boats in Havana Harbour,Image,Taken on vacation in Cuba.,2021-02-12,MJ IMG_2549.jp2,02,Manhatten Island,Image,\"Taken from the ferry from downtown New York to Highlands, NJ.\",2021-02-12,MJ IMG_2940.JPG,03,Looking across Burrard Inlet,Image,View from Deep Cove to Burnaby Mountain.,2021-02-18,SP IMG_2958.JPG,04,Amsterdam waterfront,Image,Amsterdam waterfront on an overcast day.,2021-02-12,MJ IMG_5083.JPG,05,Alcatraz Island,Image,\"Taken from Fisherman's Wharf, San Francisco.\",2021-02-18,SP Then, you need to create a configuration file to tell Workbench the URL of your Islandora, which Drupal account credentials to use, and the location of your CSV file. You can customize many other aspects of Islandora Workbench by including various settings in your configuration file. Here is a very simple configuration file: task: create host: http://localhost:8000 username: admin password: islandora input_csv: input.csv content_type: islandora_object Relevance to using Workbench It is very important to run --check before you commit to having Workbench add content to your Drupal. Doing so lets Workbench find issues with your configuration and input CSV and files. See the \" Checking configuration and input data \" documentation for a complete list of the things --check looks for. When you have all these things ready, you tell Workbench to \"check\" your input data and configuration: ./workbench --config test.yml --check Workbench will provide a summary of what passed the check and what needs to be fixed. When your checks are complete, you use Workbench to push your content into your Islandora repository: ./workbench --config test.yml You can use a raw CSV file, a Google Sheet , or an Excel file as input, and your image, PDF, video, and other files can be stored locally , or at URLs on the web. Content types, fields, and nodes Below are the Drupal and Islandora concepts that will help you use Workbench effectively. Content types Relevance to using Workbench Generally speaking, Islandora Workbench can only work with a single content type at a time. You define this content type in the content_type configuration setting. Drupal categorizes what people see as \"pages\" on a Drupal website into content types. By default, Drupal provides \"Article\" and \"Basic Page\" content types, but site administrators can create custom content types. You can see the content types configured on your Drupal by logging in as an admin user and visiting /admin/structure/types . Or, you can navigate to the list of your site's content types by clicking on the Structure menu item, then the Content Types entry: Islandora, by default, creates a content type called a \"Repository Item\". But, many Islandora sites use additional content types, such as \"Collection\". To find the machine name of the content type you want to use with Workbench, visit the content type's configuration page. The machine name will be the last segment of the URL. In the following example, it's islandora_object : Fields Relevance to using Workbench The columns in your CSV file correspond to fields in your Islandora content type. The main structural difference between content types in Drupal is that each content type is configured to use a unique set of fields. A field in Drupal the same as a \"field\" in metadata - it is a container for an individual piece of data. For example, all content types have a \"title\" field (although it might be labeled differently) to hold the page's title. Islandora's default content type, the Repository Item, uses metadata-oriented fields like \"Copyright date\", \"Extent\", \"Resource type\", and \"Subject\". Fields have two properties which you need to be familiar with in order to use Islandora Workbench: machine name type To help explain how these two properties work, we will use the following screenshot showing a few of the default fields in the \"Repository item\" content type: Relevance to using Workbench In most cases you can use a fields' human-readable labels as column headers in your CSV file, but within Islandora Workbench configuration files, you must use field machine names. A field has a human-readable label, such as \"Copyright date\", but that label can change or can be translated, and, more significantly, doesn't need to be unique within a Drupal website. Drupal assigns each field a machine name that is more reliable for software to use than human-readable labels. These field machine names are all lower case, use underscores instead of spaces, and are guaranteed by Drupal to be unique within a content type. In the screenshot above, you can see the machine names in the middle column (you might need to zoom in!). For example, the machine name for the \"Copyright date\" field is field_copyright_date . A field's \"type\" determines the structure of the data it can hold. Some common field types used in Islandora are \"Text\" (and its derivatives \"Text (plain)\" and \"Text (plain, long)\"), \"Entity Reference\", \"Typed Relation\", \"EDTF\", and \"Link\". These field types are explained in the \" Field Data (CSV and Drupal) \" documentation, but the important point here is that they are all represented differently in your Workbench CSV. For example: EDTF fields take dates in the Library of Congress' Extended Date/Time Format (an example CSV entry is 1964/2008 ) Entity reference fields are used for taxonomy terms (an example entry is cats:Tabby , where \"cats\" is the name of the taxonomy and \"Tabby\" is the term) Typed relation fields are used for taxonomy entries that contain additional data indicating what \"type\" they are, such as using MARC relators to indicate the relationship of the taxonomy term to the item being described. An example typed relation CSV value is relators:aut:Jordan, Mark , where \"relators:aut\" indicates the term \"Jordan, Mark\" uses the MARC relator \"aut\", which stands for \"author\". Link fields take two pieces of information, a URL and the link text, like http://acme.com%%Acme Products Inc. Relevance to using Workbench Drupal fields can be configured to have multiple values. Another important aspect of Drupal fields is their cardinality, or in other words, how many individual values they are configured to have. This is similar to the \"repeatability\" of fields in metadata schemas. Some fields are configured to hold only a single value, others to hold a a maximum number of values (three, for example), and others can hold an unlimited number of values. You can find each field's cardinality in its \"Field settings\" tab. Here is an example showing a field with unlimited cardinality: Drupal enforces cardinality very strictly. For this reason, if your CSV file contains more values for a field than the field's configuration allows, Workbench will truncate the number of values to match the maximum number allowed for the field. If it does this, it will leave an entry in its log so you know that it didn't add all the values in your CSV data. See the Islandora documentation for additional information about Drupal fields. Nodes Relevance to using Workbench In Islandora, a node is a metadata description - a grouping of data, contained in fields, that describe an item. Each row in your input CSV contains the field data that is used to create a node. Think of a \"node\" as a specific page in a Drupal website. Every node has a content type (e.g. \"Article\" or \"Repository Item\") containing content in the fields defined by its content type. It has a URL in the Drupal website, like https://mysite.org/node/3798 . The \"3798\" at the end of the URL is the node ID (also known as the \"nid\") and uniquely identifies the node within its website. In Islandora, a node is less like a \"web page\" and more like a \"catalogue record\" since Islandora-specific content types generally contain a lot of metadata-oriented fields rather than long discursive text like a blog would have. In create tasks, each row in your input CSV will create a single node. Islandora Workbench uses the node ID column in your CSV for some operations, for example updating nodes or adding media to nodes. Content in Islandora can be hierarchical. For example, collections contain items, newspapers contain issues which in turn contain pages, and compound items can contain a top-level \"parent\" node and many \"child\" nodes. Islandora defines a specific field, field_member_of (or in English, \"Member Of\") that contains the node ID of another node's parent. If this field is empty in a node, it has no parent; if this field contains a value, the node with that node ID is the first node's parent. Islandora Workbench provides several ways for you to create hierarchical content. If you want to learn more about how Drupal nodes work, consult the Islandora documentation . Taxonomies Relevance to using Workbench Drupal's taxonomy system lets you create local authority lists for names, subjects, genre terms, and other types of data. One of Drupal's most powerful features is its support for structured taxonomies (sometimes referred to as \"vocabularies\"). These can be used to maintain local authority lists of personal and corporate names, subjects, and other concepts, just like in other library/archives/museum tools. Islandora Workbench lets you create taxonomy terms in advance of the nodes they are attached to, or at the same time as the nodes. Also, within your CSV file, you can use term IDs, term URIs, or term names. You can use term names both when you are creating new terms on the fly, or if you are assigning existing terms to newly created nodes. Drupal assigns each term an ID, much like it assigns each node an ID. These are called \"term IDs\" (or \"tids\"). Like node IDs, they are unique within a single Drupal instance but they are not unique across Drupal instances. Islandora uses several specific taxonomies extensively as part of its data model. These include Islandora Models (which determines how derivatives are generated for example) and Islandora Media Use (which indicates if a file is an \"Original file\" or a \"Service file\", for example). The taxonomies created by Islandora, such as Islandora Models and Islandora Media Use, include Linked Data URIs in the taxonomy term entries. These URIs are useful because they uniquely and reliably identify taxonomy terms across Drupal instances. For example, the taxonomy term with the Linked Data URI http://pcdm.org/use#OriginalFile is the same in two Drupal instances even if the term ID for the term is 589 in one instance and 23 in the other, or if the name of the term is in different languages. If you create your own taxonomies, you can also assign each term a Linked Data URI. Media Relevance to using Workbench By default, the file you upload using Islandora Workbench will be assigned the \"Original file\" media use term. Islandora will then automatically generate derivatives, such as thumbnails and extracted text where applicable, from that file and create additional media. However, you can use Workbench to upload additional files or pregenerated derivatives by assigning them other media use terms. Media in Islandora are the image, video, audio, PDF, and other content files that are attached to nodes. Together, a node and its attached media make up a resource or item. Media have types. Standard media types defined by Islandora are: Audio Document Extracted text FITS Technical Metadata File Image Remote video Video In general when using Workbench you don't need to worry about assigning a type to a file. Workbench infers a media's type from the file extensions, but you can override this if necessary. Media are also assigned terms from the Islandora Media Use vocabulary. These terms, combined with the media type, determine how the files are displayed to end users and how and what types of derivatives Islandora generates. They can also be useful in exporting content from Islandora and in digital preservation workflows (for example). A selection of terms from this vocabulary is: Original file Intermediate file Preservation Master File Service file Thumbnail image Transcript Extracted text This is an example of a video media showing how the media use terms are applied: The Islandora documentation provides additional information on media . Views Relevance to using Workbench You usually don't need to know anything about Views when using Islandora Workbench, but you can use Workbench to export CSV data from Drupal via a View. Views are another powerful Drupal feature that Islandora uses extensively. A View is a Drupal configuration that generates a list of things managed by Drupal, most commonly nodes. As a Workbench user, you will probably only use a View if you want to export data from Islandora via a get_data_from_view Workbench task. Behind the scenes, Workbench depends on a Drupal module called Islandora Workbench Integration that creates a number of custom Views that Workbench uses to interact with Drupal. So even though you might only use Views directly when exporting CSV data from Islandora, behind the scenes Workbench is getting information from Drupal constantly using a set of custom Views. REST Relevance to using Workbench As a Workbench user, you don't need to know anything about REST, but if you encounter a problem using Workbench and reach out for help, you might be asked to provide your log file, which will likely contain some raw REST data. REST is the protocol that Workbench uses to interact with Drupal. Fear not: as a user of Workbench, you don't need to know anything about REST - it's Workbench's job to shield you from REST's complexity. However, if things go wrong, Workbench will include in its log file some details about the particular REST request that didn't work (such as HTTP response codes and raw JSON). If you reach out for help , you might be asked to provide your Workbench log file to aid in troubleshooting. It's normal to see the raw data used in REST communication between Workbench and Drupal in the log.","title":"Workbench's relationship to Drupal and Islandora"},{"location":"drupal_and_workbench/#why-would-i-want-to-use-islandora-workbench","text":"Islandora Workbench lets you manage content in an Islandora repository at scale. Islandora provides web forms for creating and editing content on an item-by-item basis, but if you want to load a large number of items into an Islandora repository (or update or delete content in large numbers), you need a batch-oriented tool like Workbench. Simply put, Islandora Workbench enables you to get batches of content into an Islandora repository, and also update or delete content in large batches.","title":"Why would I want to use Islandora Workbench?"},{"location":"drupal_and_workbench/#how-do-i-use-islandora-workbench","text":"Islandora Workbench provides the ability to perform a set of \" tasks \". The focus of this page is the create task, but other tasks Workbench enables include update , delete , and add_media . To use Islandora Workbench to create new content, you need to assemble a CSV file containing metadata describing your content, and arrange the accompanying image, video, PDF, and other files in specific ways so that Workbench knows where to find them. Here is a very simple sample Workbench CSV file: file,id,title,field_model,field_description,date_generated,quality control by IMG_1410.tif,01,Small boats in Havana Harbour,Image,Taken on vacation in Cuba.,2021-02-12,MJ IMG_2549.jp2,02,Manhatten Island,Image,\"Taken from the ferry from downtown New York to Highlands, NJ.\",2021-02-12,MJ IMG_2940.JPG,03,Looking across Burrard Inlet,Image,View from Deep Cove to Burnaby Mountain.,2021-02-18,SP IMG_2958.JPG,04,Amsterdam waterfront,Image,Amsterdam waterfront on an overcast day.,2021-02-12,MJ IMG_5083.JPG,05,Alcatraz Island,Image,\"Taken from Fisherman's Wharf, San Francisco.\",2021-02-18,SP Then, you need to create a configuration file to tell Workbench the URL of your Islandora, which Drupal account credentials to use, and the location of your CSV file. You can customize many other aspects of Islandora Workbench by including various settings in your configuration file. Here is a very simple configuration file: task: create host: http://localhost:8000 username: admin password: islandora input_csv: input.csv content_type: islandora_object Relevance to using Workbench It is very important to run --check before you commit to having Workbench add content to your Drupal. Doing so lets Workbench find issues with your configuration and input CSV and files. See the \" Checking configuration and input data \" documentation for a complete list of the things --check looks for. When you have all these things ready, you tell Workbench to \"check\" your input data and configuration: ./workbench --config test.yml --check Workbench will provide a summary of what passed the check and what needs to be fixed. When your checks are complete, you use Workbench to push your content into your Islandora repository: ./workbench --config test.yml You can use a raw CSV file, a Google Sheet , or an Excel file as input, and your image, PDF, video, and other files can be stored locally , or at URLs on the web.","title":"How do I use Islandora Workbench?"},{"location":"drupal_and_workbench/#content-types-fields-and-nodes","text":"Below are the Drupal and Islandora concepts that will help you use Workbench effectively.","title":"Content types, fields, and nodes"},{"location":"drupal_and_workbench/#content-types","text":"Relevance to using Workbench Generally speaking, Islandora Workbench can only work with a single content type at a time. You define this content type in the content_type configuration setting. Drupal categorizes what people see as \"pages\" on a Drupal website into content types. By default, Drupal provides \"Article\" and \"Basic Page\" content types, but site administrators can create custom content types. You can see the content types configured on your Drupal by logging in as an admin user and visiting /admin/structure/types . Or, you can navigate to the list of your site's content types by clicking on the Structure menu item, then the Content Types entry: Islandora, by default, creates a content type called a \"Repository Item\". But, many Islandora sites use additional content types, such as \"Collection\". To find the machine name of the content type you want to use with Workbench, visit the content type's configuration page. The machine name will be the last segment of the URL. In the following example, it's islandora_object :","title":"Content types"},{"location":"drupal_and_workbench/#fields","text":"Relevance to using Workbench The columns in your CSV file correspond to fields in your Islandora content type. The main structural difference between content types in Drupal is that each content type is configured to use a unique set of fields. A field in Drupal the same as a \"field\" in metadata - it is a container for an individual piece of data. For example, all content types have a \"title\" field (although it might be labeled differently) to hold the page's title. Islandora's default content type, the Repository Item, uses metadata-oriented fields like \"Copyright date\", \"Extent\", \"Resource type\", and \"Subject\". Fields have two properties which you need to be familiar with in order to use Islandora Workbench: machine name type To help explain how these two properties work, we will use the following screenshot showing a few of the default fields in the \"Repository item\" content type: Relevance to using Workbench In most cases you can use a fields' human-readable labels as column headers in your CSV file, but within Islandora Workbench configuration files, you must use field machine names. A field has a human-readable label, such as \"Copyright date\", but that label can change or can be translated, and, more significantly, doesn't need to be unique within a Drupal website. Drupal assigns each field a machine name that is more reliable for software to use than human-readable labels. These field machine names are all lower case, use underscores instead of spaces, and are guaranteed by Drupal to be unique within a content type. In the screenshot above, you can see the machine names in the middle column (you might need to zoom in!). For example, the machine name for the \"Copyright date\" field is field_copyright_date . A field's \"type\" determines the structure of the data it can hold. Some common field types used in Islandora are \"Text\" (and its derivatives \"Text (plain)\" and \"Text (plain, long)\"), \"Entity Reference\", \"Typed Relation\", \"EDTF\", and \"Link\". These field types are explained in the \" Field Data (CSV and Drupal) \" documentation, but the important point here is that they are all represented differently in your Workbench CSV. For example: EDTF fields take dates in the Library of Congress' Extended Date/Time Format (an example CSV entry is 1964/2008 ) Entity reference fields are used for taxonomy terms (an example entry is cats:Tabby , where \"cats\" is the name of the taxonomy and \"Tabby\" is the term) Typed relation fields are used for taxonomy entries that contain additional data indicating what \"type\" they are, such as using MARC relators to indicate the relationship of the taxonomy term to the item being described. An example typed relation CSV value is relators:aut:Jordan, Mark , where \"relators:aut\" indicates the term \"Jordan, Mark\" uses the MARC relator \"aut\", which stands for \"author\". Link fields take two pieces of information, a URL and the link text, like http://acme.com%%Acme Products Inc. Relevance to using Workbench Drupal fields can be configured to have multiple values. Another important aspect of Drupal fields is their cardinality, or in other words, how many individual values they are configured to have. This is similar to the \"repeatability\" of fields in metadata schemas. Some fields are configured to hold only a single value, others to hold a a maximum number of values (three, for example), and others can hold an unlimited number of values. You can find each field's cardinality in its \"Field settings\" tab. Here is an example showing a field with unlimited cardinality: Drupal enforces cardinality very strictly. For this reason, if your CSV file contains more values for a field than the field's configuration allows, Workbench will truncate the number of values to match the maximum number allowed for the field. If it does this, it will leave an entry in its log so you know that it didn't add all the values in your CSV data. See the Islandora documentation for additional information about Drupal fields.","title":"Fields"},{"location":"drupal_and_workbench/#nodes","text":"Relevance to using Workbench In Islandora, a node is a metadata description - a grouping of data, contained in fields, that describe an item. Each row in your input CSV contains the field data that is used to create a node. Think of a \"node\" as a specific page in a Drupal website. Every node has a content type (e.g. \"Article\" or \"Repository Item\") containing content in the fields defined by its content type. It has a URL in the Drupal website, like https://mysite.org/node/3798 . The \"3798\" at the end of the URL is the node ID (also known as the \"nid\") and uniquely identifies the node within its website. In Islandora, a node is less like a \"web page\" and more like a \"catalogue record\" since Islandora-specific content types generally contain a lot of metadata-oriented fields rather than long discursive text like a blog would have. In create tasks, each row in your input CSV will create a single node. Islandora Workbench uses the node ID column in your CSV for some operations, for example updating nodes or adding media to nodes. Content in Islandora can be hierarchical. For example, collections contain items, newspapers contain issues which in turn contain pages, and compound items can contain a top-level \"parent\" node and many \"child\" nodes. Islandora defines a specific field, field_member_of (or in English, \"Member Of\") that contains the node ID of another node's parent. If this field is empty in a node, it has no parent; if this field contains a value, the node with that node ID is the first node's parent. Islandora Workbench provides several ways for you to create hierarchical content. If you want to learn more about how Drupal nodes work, consult the Islandora documentation .","title":"Nodes"},{"location":"drupal_and_workbench/#taxonomies","text":"Relevance to using Workbench Drupal's taxonomy system lets you create local authority lists for names, subjects, genre terms, and other types of data. One of Drupal's most powerful features is its support for structured taxonomies (sometimes referred to as \"vocabularies\"). These can be used to maintain local authority lists of personal and corporate names, subjects, and other concepts, just like in other library/archives/museum tools. Islandora Workbench lets you create taxonomy terms in advance of the nodes they are attached to, or at the same time as the nodes. Also, within your CSV file, you can use term IDs, term URIs, or term names. You can use term names both when you are creating new terms on the fly, or if you are assigning existing terms to newly created nodes. Drupal assigns each term an ID, much like it assigns each node an ID. These are called \"term IDs\" (or \"tids\"). Like node IDs, they are unique within a single Drupal instance but they are not unique across Drupal instances. Islandora uses several specific taxonomies extensively as part of its data model. These include Islandora Models (which determines how derivatives are generated for example) and Islandora Media Use (which indicates if a file is an \"Original file\" or a \"Service file\", for example). The taxonomies created by Islandora, such as Islandora Models and Islandora Media Use, include Linked Data URIs in the taxonomy term entries. These URIs are useful because they uniquely and reliably identify taxonomy terms across Drupal instances. For example, the taxonomy term with the Linked Data URI http://pcdm.org/use#OriginalFile is the same in two Drupal instances even if the term ID for the term is 589 in one instance and 23 in the other, or if the name of the term is in different languages. If you create your own taxonomies, you can also assign each term a Linked Data URI.","title":"Taxonomies"},{"location":"drupal_and_workbench/#media","text":"Relevance to using Workbench By default, the file you upload using Islandora Workbench will be assigned the \"Original file\" media use term. Islandora will then automatically generate derivatives, such as thumbnails and extracted text where applicable, from that file and create additional media. However, you can use Workbench to upload additional files or pregenerated derivatives by assigning them other media use terms. Media in Islandora are the image, video, audio, PDF, and other content files that are attached to nodes. Together, a node and its attached media make up a resource or item. Media have types. Standard media types defined by Islandora are: Audio Document Extracted text FITS Technical Metadata File Image Remote video Video In general when using Workbench you don't need to worry about assigning a type to a file. Workbench infers a media's type from the file extensions, but you can override this if necessary. Media are also assigned terms from the Islandora Media Use vocabulary. These terms, combined with the media type, determine how the files are displayed to end users and how and what types of derivatives Islandora generates. They can also be useful in exporting content from Islandora and in digital preservation workflows (for example). A selection of terms from this vocabulary is: Original file Intermediate file Preservation Master File Service file Thumbnail image Transcript Extracted text This is an example of a video media showing how the media use terms are applied: The Islandora documentation provides additional information on media .","title":"Media"},{"location":"drupal_and_workbench/#views","text":"Relevance to using Workbench You usually don't need to know anything about Views when using Islandora Workbench, but you can use Workbench to export CSV data from Drupal via a View. Views are another powerful Drupal feature that Islandora uses extensively. A View is a Drupal configuration that generates a list of things managed by Drupal, most commonly nodes. As a Workbench user, you will probably only use a View if you want to export data from Islandora via a get_data_from_view Workbench task. Behind the scenes, Workbench depends on a Drupal module called Islandora Workbench Integration that creates a number of custom Views that Workbench uses to interact with Drupal. So even though you might only use Views directly when exporting CSV data from Islandora, behind the scenes Workbench is getting information from Drupal constantly using a set of custom Views.","title":"Views"},{"location":"drupal_and_workbench/#rest","text":"Relevance to using Workbench As a Workbench user, you don't need to know anything about REST, but if you encounter a problem using Workbench and reach out for help, you might be asked to provide your log file, which will likely contain some raw REST data. REST is the protocol that Workbench uses to interact with Drupal. Fear not: as a user of Workbench, you don't need to know anything about REST - it's Workbench's job to shield you from REST's complexity. However, if things go wrong, Workbench will include in its log file some details about the particular REST request that didn't work (such as HTTP response codes and raw JSON). If you reach out for help , you might be asked to provide your Workbench log file to aid in troubleshooting. It's normal to see the raw data used in REST communication between Workbench and Drupal in the log.","title":"REST"},{"location":"exporting_islandora_7_content/","text":"Overview Islandora Workbench's main purpose is to load batches of content into an Islandora 2 repository. However, loading content can also be the last step in migrating from Islandora 7 to Islandora 2. As noted in the \" Workflows \" documentation, Workbench can be used in the \"load\" phase of a typical extract, transform, load (ETL) process. Workbench comes with a standalone script, get_islandora_7_content.py , that can be used to extract (a.k.a. \"export\") metadata and OBJ datastreams from an Islandora 7 instance. This data can form the basis for Workbench input data. To run the script, change into the Workbench \"i7Import\" directory and run: python3 get_islandora_7_content.py --config The script uses a number of configuration variables, all of which come with sensible defaults. Any of the following parameters can be changed in the user-supplied config file. Parameter Default Value Description solr_base_url http://localhost:8080/solr URL of your source Islandora 7.x Solr instance. islandora_base_url http://localhost:8000 URL of your source Islandora instance. csv_output_path islandora7_metadata.csv Path to the CSV file containing values from Solr. obj_directory /tmp/objs Path to the directory where datastream files will be saved. log_file_path islandora_content.log Path to the log file. fetch_files true Whether or not to fetch and save the datastream files from the source Islandora 7.x instance. get_file_uri false Whether or not to write datastream file URLs to the CSV file instead of fetching the files. One of `get_file_uri` or `fetch_files` can be set to `true`, but not both. field_pattern mods_.*(_s|_ms)$ A regular expression pattern to matching Solr fields to include in the CSV. field_pattern_do_not_want (marcrelator|isSequenceNumberOf) A regular expression pattern to matching Solr fields to not include in the CSV. standard_fields ['PID', 'RELS_EXT_hasModel_uri_s', 'RELS_EXT_isMemberOfCollection_uri_ms', 'RELS_EXT_isMemberOf_uri_ms', 'RELS_EXT_isConstituentOf_uri_ms', 'RELS_EXT_isPageOf_uri_ms'] List of fields to Solr fields to include in the CSV not matched by the regular expression in `field_pattern`. id_field PID The Solr field that uniquely identifies each object in the source Islandora 7.x instance. id_start_number 1 The number to use as the first Workbench ID within the CSV file. datastreams ['OBJ', 'PDF'] List of datastream IDs to fetch from the source Islandora 7.x instance. namespace * The namespace of objects you want to export from the source Islandora 7.x instance. collection PID of a single collection limiting the objects to fetch from the source Islandora 7.x instance. Only matches objects that have the specified collection as their immediate parent. For recursive collection membership, add `ancestors_ms` as a `solr_filter`, as documented below. Note: the colon in the collection PID must be escaped with a backslash (`\\`), e.g., `cartoons\\:collection`. content_model PID of a single content model limiting the objects to fetch from the source Islandora 7.x instance. Note: the colon in the content model PID must be escaped with a backslash (`\\`), e.g., `islandora\\:sp_large_image_cmodel`. solr_filters key:value pairs to add as filters to the Solr query. See examples below. pids_to_skip List of PIDs to not export, e.g. `pids_to_skip: [\"foo:234\", \"bar:7890\"]`. Useful if you are aware of problems with specifis Islandora 7 source objects and you don't want those objects to crash out the script. debug false Print debug information to the console. deep_debug false Print additional debug information to the console. secure_ssl_only true Whether or not to require valid SSL certificates. Set to `false` if you want to ignore SSL certificates. Analyzing your Islandora 7 Solr index In order to use the configuration options outlined above, you will need to know what fields are in your Islandora 7 Solr index. Not all Islandoras are indexed in the same way, and since most Solr field names are derived from MODS or other XML datastream element names, there is enough variability in Solr fieldnames across Islandora 7 instances to make reliably predicting Solr fieldnames impractical. Two ways you can see the specific fields in your Solr index are 1) using the Islandora Metadata Extras module, and 2) issuing raw requests to your Islandora 7's Solr, both to get Solr content for sample objects and to get a list of all fields in your index. Using the Islandora Metadata Extras module The Islandora Metadata Extras module provides a \"Solr Metadata\" tab in each object's Manage menu that shows the raw Solr document for the object. The Solr field names and the values for the current object are easy to identify. Here are two screenshots showing the top section of the output and a sample from the middle of the output (broken into top and middle excerpts for illustration purposes here since the entire Solr document is very long): Top: Middle: Fetching sample Solr documents If you can't or don't want to install the Islandora Metadata Extras module, you can query Solr directly to get a sample document. To get the entire Solr document for an object in JSON format, issue the following request to your Solr, replacing km\\:10571 with the PID of your object. The -o option in the curl command (for \"output\") tells curl to save the response to the named file: curl -o km_10571.json \"http://localhost:8080/solr/select?q=PID:km\\:10571&wt=json\" The resulting JSON file will look like this . To get Solr XML document for a specific object, remove the wt parameter from the request URL: curl -o km_10571.xml \"http://localhost:8080/solr/select?q=PID:km\\:10571\" The resulting file will look like this . In the XML version, the Solr fieldnames are the values of the \"name\" attribute of each element. Warning Solr documents for individual objects will not necessarily contain all the fieldnames in your index. In general, empty fields in the source XML (e.g. MODS) are not added to a Solr document. This means that even though inspecting individual Solr documents may not reveal all of the Solr fields you want to include in your Workbench CSV. You should get samples from a number of objects that you think will represent all of the Solr fields you are interested in. Fetching a CSV list of all Solr fieldnames To get a 1-row CSV file containing all of the fieldnames in your Solr index, issue the following request: curl -o allfields.csv \"http://localhost:8080/solr/select?q=*:*&wt=csv&rows=0&fl=*\" Unlike the queries for individual Islandora 7 objects' Solr documents, the results of this query will contain all the fields in your index. Configuring which Solr fields to include in the CSV As we can see from the examples above, Islandora 7's Solr schema contains a lot of fields, mirroring the richness of MODS (or other XML-based metadata) and the Fedora 3.x RELS-EXT properties. By default, this script fetches all the fields in the Islandora 7's Solr's index, which will invariably be many, many more fields that you will normally want in the output CSV. You will need to tell the script which fields to exclude and which to include. The script takes the following approach to providing control over what fields end up in the CSV data it generates: It fetches a list of all fieldnames used in the Solr index. It then matches each fieldname against the regular expression pattern defined in the script's field_pattern variable, and if the match is successful, includes the fieldname in the CSV. For example, field_pattern = 'mods_.*(_s|_ms)$' will match every Solr field that starts with \"mods_\" and ends with either \"_s\" or \"_ms\". Next, it matches each remaining fieldname against the regular expression patterns defined in the script's field_pattern_do_not_want variable, and if the match is successful, removes the fieldname from the CSV. For example, field_pattern_do_not_want = '(marcrelator|isSequenceNumberOf)' will remove all fieldnames that contain either the string \"marcrelator\" or \"isSequenceNumberOf\". Note that the regular expression used in this configuration variable is not a negative pattern; in other words, if a fieldname matches this pattern, it is excluded from the field list. Finally, it adds to the start of the remaining list of fieldnames every Solr fieldname defined in the standard_fields configuration variable. This configuration variable provides a mechanism to ensure than any fields that are not included in step 2 are present in the generated CSV file. Warning You will always want at least the Solr fields \"PID\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_hasModel_uri_s\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_isConstituentOf_uri_ms\", and \"RELS_EXT_isPageOf_uri_ms\" in your standard_fields configuration variable since these fields contain information about objects' relationships to each other. Even with a well-configured set of pattern variables, the column headers are ugly, and there are a lot of them. Here is a sample from a minimal Islandora 7.x: file,PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isConstituentOf_uri_ms,RELS_EXT_isPageOf_uri_ms,mods_recordInfo_recordOrigin_ms,mods_name_personal_author_ms,mods_abstract_s,mods_name_aut_role_roleTerm_code_s,mods_name_personal_author_s,mods_typeOfResource_s,mods_subject_geographic_ms,mods_identifier_local_ms,mods_genre_ms,mods_name_photographer_role_roleTerm_code_s,mods_physicalDescription_form_all_ms,mods_physicalDescription_extent_ms,mods_subject_topic_ms,mods_name_namePart_s,mods_physicalDescription_form_authority_marcform_ms,mods_name_pht_s,mods_identifier_uuid_ms,mods_language_languageTerm_code_s,mods_physicalDescription_form_s,mods_accessCondition_use_and_reproduction_s,mods_name_personal_role_roleTerm_text_s,mods_name__role_roleTerm_code_ms,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_ms,mods_name_aut_s,mods_originInfo_encoding_iso8601_dateIssued_s,mods_originInfo_dateIssued_ms,mods_name_photographer_namePart_s,mods_name_pht_role_roleTerm_text_ms,mods_identifier_all_ms,mods_name_namePart_ms,mods_subject_geographic_s,mods_originInfo_publisher_ms,mods_subject_descendants_all_ms,mods_titleInfo_title_all_ms,mods_name_photographer_role_roleTerm_text_ms,mods_name_role_roleTerm_text_s,mods_titleInfo_title_ms,mods_name_photographer_s,mods_originInfo_place_placeTerm_text_s,mods_name_role_roleTerm_code_ms,mods_name_pht_role_roleTerm_code_s,mods_name_pht_namePart_s,mods_name_pht_namePart_ms,mods_name_role_roleTerm_code_s,mods_genre_all_ms,mods_physicalDescription_form_authority_marcform_s,mods_name_pht_role_roleTerm_code_ms,mods_extension_display_date_ms,mods_name_photographer_namePart_ms,mods_genre_authority_bgtchm_ms,mods_name_personal_role_roleTerm_text_ms,mods_name_pht_ms,mods_name_photographer_role_roleTerm_text_s,mods_language_languageTerm_code_ms,mods_originInfo_place_placeTerm_text_ms,mods_titleInfo_title_s,mods_identifier_uuid_s,mods_language_languageTerm_code_authority_iso639-2b_s,mods_genre_s,mods_name_aut_role_roleTerm_code_ms,mods_typeOfResource_ms,mods_originInfo_encoding_iso8601_dateIssued_ms,mods_name_personal_author_role_roleTerm_text_ms,mods_abstract_ms,mods_language_languageTerm_text_s,mods_genre_authority_bgtchm_s,mods_language_languageTerm_s,mods_language_languageTerm_ms,mods_subject_topic_s,mods_name_photographer_ms,mods_name_pht_role_roleTerm_text_s,mods_recordInfo_recordOrigin_s,mods_name_aut_ms,mods_originInfo_publisher_s,mods_identifier_local_s,mods_language_languageTerm_text_ms,mods_physicalDescription_extent_s,mods_language_languageTerm_code_authority_iso639-2b_ms,mods_name__role_roleTerm_code_s,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_s,mods_name_photographer_role_roleTerm_code_ms,mods_name_role_roleTerm_text_ms,mods_name_personal_author_role_roleTerm_text_s,mods_accessCondition_use_and_reproduction_ms,mods_physicalDescription_form_ms,sequence The script-generated solr request may not in most cases be useful or even workable. You may need to experiment with the field_pattern and field_pattern_do_not_want configuration settings to reduce the number of Solr fields. Adding filters to your Solr query to limit the objects fetched from the source Islandora The namespace , collection , content_model , and solr_filters options documented above allow you to scope the set of objects exported from the source Islandora 7.x instance. The first three take simple, single values. The last option allows you to add arbitrary filters to the query sent to Solr in the form of key:value pairs, like this: solr_filters: - ancestors_ms: 'some\\:collection' - fgs_state_s: 'Active' Warning Colons within PIDs used in filters must be escaped with a backslash ( \\ ), e.g., some\\:collection . Putting your Solr request in a file You have the option of providing their own solr query in a text file and pointing to the file using the --metadata_solr_request option when running the script: python3 get_islandora_7_content.py --config --metadata_solr_request The contents of the file must contain a full HTTP request to Solr, e.g.: http://localhost:8080/solr/select?q=PID:*&wt=csv&rows=1000000&fl=PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isMemberOf_uri_ms,mods_originInfo_encoding_iso8601_dateIssued_mdt The advantage of putting your Solr request in its own file is that you have complete control over the Solr query. Requests to Solr in this file should always include the \"wt=csv\" and \"rows=1000000\" parameters (the \"rows\" parameter should have a value that is greater than the number of objects in your repository, otherwise Solr won't return all objects). Using the CSV as input for Workbench The CSV file generated by this script will almost certainly contain many more columns than you will want to ingest into Islandora 2. You will probably want to delete columns you don't need, combine the contents of several columns into one, and edit the contents of others. As we can see from the example above, the column headings in the CSV are Solr fieldnames ( RELS_EXT_hasModel_uri_s , mods_titleInfo_title_ms , etc.). You will need to replace those column headers with the equivalent fields as defined in your Drupal 9 content type . In addition, the metadata stored in Islandora 7's Solr index does not in many cases have the structure Workbench requires, so the data in the CSV file will need to be edited before it can be used by Workbench to create nodes. The content of Islandora 7 Solr fields is derived from MODS (or other) XML elements, and, with the exception of text-type fields, will not necessarily map cleanly to Drupal fields' data types. In other words, to use the CSV data generated by get_islandora_7_content.py , you will need to do some work to prepare it (or \"transform\" it, to use ETL language) to use it as input for Workbench. However, the script adds three columns to the CSV file that do not use Solr fieldnames and whose contents you should not edit but that you may need to rename: file , PID , and sequence : Do not edit name Rename 'sequence' column (e.g., or contents of to 'field_weight') but do not 'file' column. edit its contents. | / Every other column | | / Rename 'PID' to the | will need to be | | | value of your 'id_field' | deleted, renamed, | | | setting but do not edit | or its content | | | column contents. | edited. | v v v v --------------------------------------------------------------------- file,PID,RELS_EXT_hasModel_uri_s,[...],mods_typeOfResource_s,sequence First, the required Workbench column \"file\" is added to the beginning of each CSV row, and is populated with the filename of the OBJ datastream. This filename is based on the object's PID, with the the : replaced with an underscore, and has an extension determined by the OBJ datastream's MIME type. Second, \"PID\" is the Islandora 7.x PID of each object in the CSV file. This column header can be changed to \"id\" or whatever you have defined in your Workbench configuration file's id_field setting. Alternatively, you can set the value of id_field to PID and not rename that CSV column. Third, a \"sequence\" column is added at the end of each CSV row. This is where the get_islandora_7_content.py script stores the sequence number of each child object/page in relation to its parent . If an Islandora 7.x object has a property in its RELS-EXT datastream islandora:sSequenceNumberOfxxx (where \"xxx\" is the object's parent), the value of that property is added to the \"sequence\" column at the end of each row in the CSV. For paged content, this value is taken from the islandora:isSequenceNumber RELS-EXT property. These values are ready for use in the \"field_weight\" Drupal field created by the Islandora Defaults module; you can simply rename the \"sequence\" column header to \" field_weight\" when you use the CSV as input for Islandora Workbench. Note that you don't need to configure the script to include fields that contain \"isSequenceNumberOf\" or \"isSequenceNumber\" in your CSV; in fact, because there are so many of them in a typical Islandora 7 Solr index, you will want to exclude them using the field_pattern_do_not_want configuration variable. Excluding them is safe, since the script fetches the sequence information separately from the other CSV data. A fourth column in your Workbench CSV, field_member_of , is not added automatically. It contains the PID of the parent object, whether it is a collection, the top level object (parent) in a compound object, book object that has pages, etc. If an Islandora object has a value that should be in the field_member_of column, it will be in one or more (usually just one) of the following columns in the CSV created by the get_islandora_7_content.py script: RELS_EXT_isMemberOfCollection RELS_EXT_isPageOf RELS_EXT_isSequenceNumberOfXXX (where XXX is the PID of the parent object) RELS_EXT_isConstituentOf All of these columns will likely be present in your CSV, but it is possible that some may not be, for example if your Islandora 7 repository did not have a module enabled that uses one of those RELS-EXT properties. Note In general, the CSV that you need to end up with to ingest content into Islandora 2 using Workbench needs to have a structure similar to that described in the \" With page/child-level metadata \" section of the Workbench documentation for \"Creating paged, compound, and collection content.\" You should review the points in the \"Some important things to note\" section of that documentation. Note Solr escapes commas in its exported CSV with backslashes ( \\ ). Should look for and replace these escaped commas (e.g. \\, ) with regular commas before using the CSV with Workbench.","title":"Exporting Islandora 7 content"},{"location":"exporting_islandora_7_content/#overview","text":"Islandora Workbench's main purpose is to load batches of content into an Islandora 2 repository. However, loading content can also be the last step in migrating from Islandora 7 to Islandora 2. As noted in the \" Workflows \" documentation, Workbench can be used in the \"load\" phase of a typical extract, transform, load (ETL) process. Workbench comes with a standalone script, get_islandora_7_content.py , that can be used to extract (a.k.a. \"export\") metadata and OBJ datastreams from an Islandora 7 instance. This data can form the basis for Workbench input data. To run the script, change into the Workbench \"i7Import\" directory and run: python3 get_islandora_7_content.py --config The script uses a number of configuration variables, all of which come with sensible defaults. Any of the following parameters can be changed in the user-supplied config file. Parameter Default Value Description solr_base_url http://localhost:8080/solr URL of your source Islandora 7.x Solr instance. islandora_base_url http://localhost:8000 URL of your source Islandora instance. csv_output_path islandora7_metadata.csv Path to the CSV file containing values from Solr. obj_directory /tmp/objs Path to the directory where datastream files will be saved. log_file_path islandora_content.log Path to the log file. fetch_files true Whether or not to fetch and save the datastream files from the source Islandora 7.x instance. get_file_uri false Whether or not to write datastream file URLs to the CSV file instead of fetching the files. One of `get_file_uri` or `fetch_files` can be set to `true`, but not both. field_pattern mods_.*(_s|_ms)$ A regular expression pattern to matching Solr fields to include in the CSV. field_pattern_do_not_want (marcrelator|isSequenceNumberOf) A regular expression pattern to matching Solr fields to not include in the CSV. standard_fields ['PID', 'RELS_EXT_hasModel_uri_s', 'RELS_EXT_isMemberOfCollection_uri_ms', 'RELS_EXT_isMemberOf_uri_ms', 'RELS_EXT_isConstituentOf_uri_ms', 'RELS_EXT_isPageOf_uri_ms'] List of fields to Solr fields to include in the CSV not matched by the regular expression in `field_pattern`. id_field PID The Solr field that uniquely identifies each object in the source Islandora 7.x instance. id_start_number 1 The number to use as the first Workbench ID within the CSV file. datastreams ['OBJ', 'PDF'] List of datastream IDs to fetch from the source Islandora 7.x instance. namespace * The namespace of objects you want to export from the source Islandora 7.x instance. collection PID of a single collection limiting the objects to fetch from the source Islandora 7.x instance. Only matches objects that have the specified collection as their immediate parent. For recursive collection membership, add `ancestors_ms` as a `solr_filter`, as documented below. Note: the colon in the collection PID must be escaped with a backslash (`\\`), e.g., `cartoons\\:collection`. content_model PID of a single content model limiting the objects to fetch from the source Islandora 7.x instance. Note: the colon in the content model PID must be escaped with a backslash (`\\`), e.g., `islandora\\:sp_large_image_cmodel`. solr_filters key:value pairs to add as filters to the Solr query. See examples below. pids_to_skip List of PIDs to not export, e.g. `pids_to_skip: [\"foo:234\", \"bar:7890\"]`. Useful if you are aware of problems with specifis Islandora 7 source objects and you don't want those objects to crash out the script. debug false Print debug information to the console. deep_debug false Print additional debug information to the console. secure_ssl_only true Whether or not to require valid SSL certificates. Set to `false` if you want to ignore SSL certificates.","title":"Overview"},{"location":"exporting_islandora_7_content/#analyzing-your-islandora-7-solr-index","text":"In order to use the configuration options outlined above, you will need to know what fields are in your Islandora 7 Solr index. Not all Islandoras are indexed in the same way, and since most Solr field names are derived from MODS or other XML datastream element names, there is enough variability in Solr fieldnames across Islandora 7 instances to make reliably predicting Solr fieldnames impractical. Two ways you can see the specific fields in your Solr index are 1) using the Islandora Metadata Extras module, and 2) issuing raw requests to your Islandora 7's Solr, both to get Solr content for sample objects and to get a list of all fields in your index.","title":"Analyzing your Islandora 7 Solr index"},{"location":"exporting_islandora_7_content/#using-the-islandora-metadata-extras-module","text":"The Islandora Metadata Extras module provides a \"Solr Metadata\" tab in each object's Manage menu that shows the raw Solr document for the object. The Solr field names and the values for the current object are easy to identify. Here are two screenshots showing the top section of the output and a sample from the middle of the output (broken into top and middle excerpts for illustration purposes here since the entire Solr document is very long): Top: Middle:","title":"Using the Islandora Metadata Extras module"},{"location":"exporting_islandora_7_content/#fetching-sample-solr-documents","text":"If you can't or don't want to install the Islandora Metadata Extras module, you can query Solr directly to get a sample document. To get the entire Solr document for an object in JSON format, issue the following request to your Solr, replacing km\\:10571 with the PID of your object. The -o option in the curl command (for \"output\") tells curl to save the response to the named file: curl -o km_10571.json \"http://localhost:8080/solr/select?q=PID:km\\:10571&wt=json\" The resulting JSON file will look like this . To get Solr XML document for a specific object, remove the wt parameter from the request URL: curl -o km_10571.xml \"http://localhost:8080/solr/select?q=PID:km\\:10571\" The resulting file will look like this . In the XML version, the Solr fieldnames are the values of the \"name\" attribute of each element. Warning Solr documents for individual objects will not necessarily contain all the fieldnames in your index. In general, empty fields in the source XML (e.g. MODS) are not added to a Solr document. This means that even though inspecting individual Solr documents may not reveal all of the Solr fields you want to include in your Workbench CSV. You should get samples from a number of objects that you think will represent all of the Solr fields you are interested in.","title":"Fetching sample Solr documents"},{"location":"exporting_islandora_7_content/#fetching-a-csv-list-of-all-solr-fieldnames","text":"To get a 1-row CSV file containing all of the fieldnames in your Solr index, issue the following request: curl -o allfields.csv \"http://localhost:8080/solr/select?q=*:*&wt=csv&rows=0&fl=*\" Unlike the queries for individual Islandora 7 objects' Solr documents, the results of this query will contain all the fields in your index.","title":"Fetching a CSV list of all Solr fieldnames"},{"location":"exporting_islandora_7_content/#configuring-which-solr-fields-to-include-in-the-csv","text":"As we can see from the examples above, Islandora 7's Solr schema contains a lot of fields, mirroring the richness of MODS (or other XML-based metadata) and the Fedora 3.x RELS-EXT properties. By default, this script fetches all the fields in the Islandora 7's Solr's index, which will invariably be many, many more fields that you will normally want in the output CSV. You will need to tell the script which fields to exclude and which to include. The script takes the following approach to providing control over what fields end up in the CSV data it generates: It fetches a list of all fieldnames used in the Solr index. It then matches each fieldname against the regular expression pattern defined in the script's field_pattern variable, and if the match is successful, includes the fieldname in the CSV. For example, field_pattern = 'mods_.*(_s|_ms)$' will match every Solr field that starts with \"mods_\" and ends with either \"_s\" or \"_ms\". Next, it matches each remaining fieldname against the regular expression patterns defined in the script's field_pattern_do_not_want variable, and if the match is successful, removes the fieldname from the CSV. For example, field_pattern_do_not_want = '(marcrelator|isSequenceNumberOf)' will remove all fieldnames that contain either the string \"marcrelator\" or \"isSequenceNumberOf\". Note that the regular expression used in this configuration variable is not a negative pattern; in other words, if a fieldname matches this pattern, it is excluded from the field list. Finally, it adds to the start of the remaining list of fieldnames every Solr fieldname defined in the standard_fields configuration variable. This configuration variable provides a mechanism to ensure than any fields that are not included in step 2 are present in the generated CSV file. Warning You will always want at least the Solr fields \"PID\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_hasModel_uri_s\", \"RELS_EXT_isMemberOfCollection_uri_ms\", \"RELS_EXT_isConstituentOf_uri_ms\", and \"RELS_EXT_isPageOf_uri_ms\" in your standard_fields configuration variable since these fields contain information about objects' relationships to each other. Even with a well-configured set of pattern variables, the column headers are ugly, and there are a lot of them. Here is a sample from a minimal Islandora 7.x: file,PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isConstituentOf_uri_ms,RELS_EXT_isPageOf_uri_ms,mods_recordInfo_recordOrigin_ms,mods_name_personal_author_ms,mods_abstract_s,mods_name_aut_role_roleTerm_code_s,mods_name_personal_author_s,mods_typeOfResource_s,mods_subject_geographic_ms,mods_identifier_local_ms,mods_genre_ms,mods_name_photographer_role_roleTerm_code_s,mods_physicalDescription_form_all_ms,mods_physicalDescription_extent_ms,mods_subject_topic_ms,mods_name_namePart_s,mods_physicalDescription_form_authority_marcform_ms,mods_name_pht_s,mods_identifier_uuid_ms,mods_language_languageTerm_code_s,mods_physicalDescription_form_s,mods_accessCondition_use_and_reproduction_s,mods_name_personal_role_roleTerm_text_s,mods_name__role_roleTerm_code_ms,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_ms,mods_name_aut_s,mods_originInfo_encoding_iso8601_dateIssued_s,mods_originInfo_dateIssued_ms,mods_name_photographer_namePart_s,mods_name_pht_role_roleTerm_text_ms,mods_identifier_all_ms,mods_name_namePart_ms,mods_subject_geographic_s,mods_originInfo_publisher_ms,mods_subject_descendants_all_ms,mods_titleInfo_title_all_ms,mods_name_photographer_role_roleTerm_text_ms,mods_name_role_roleTerm_text_s,mods_titleInfo_title_ms,mods_name_photographer_s,mods_originInfo_place_placeTerm_text_s,mods_name_role_roleTerm_code_ms,mods_name_pht_role_roleTerm_code_s,mods_name_pht_namePart_s,mods_name_pht_namePart_ms,mods_name_role_roleTerm_code_s,mods_genre_all_ms,mods_physicalDescription_form_authority_marcform_s,mods_name_pht_role_roleTerm_code_ms,mods_extension_display_date_ms,mods_name_photographer_namePart_ms,mods_genre_authority_bgtchm_ms,mods_name_personal_role_roleTerm_text_ms,mods_name_pht_ms,mods_name_photographer_role_roleTerm_text_s,mods_language_languageTerm_code_ms,mods_originInfo_place_placeTerm_text_ms,mods_titleInfo_title_s,mods_identifier_uuid_s,mods_language_languageTerm_code_authority_iso639-2b_s,mods_genre_s,mods_name_aut_role_roleTerm_code_ms,mods_typeOfResource_ms,mods_originInfo_encoding_iso8601_dateIssued_ms,mods_name_personal_author_role_roleTerm_text_ms,mods_abstract_ms,mods_language_languageTerm_text_s,mods_genre_authority_bgtchm_s,mods_language_languageTerm_s,mods_language_languageTerm_ms,mods_subject_topic_s,mods_name_photographer_ms,mods_name_pht_role_roleTerm_text_s,mods_recordInfo_recordOrigin_s,mods_name_aut_ms,mods_originInfo_publisher_s,mods_identifier_local_s,mods_language_languageTerm_text_ms,mods_physicalDescription_extent_s,mods_language_languageTerm_code_authority_iso639-2b_ms,mods_name__role_roleTerm_code_s,mods_originInfo_encoding_w3cdtf_keyDate_yes_dateIssued_s,mods_name_photographer_role_roleTerm_code_ms,mods_name_role_roleTerm_text_ms,mods_name_personal_author_role_roleTerm_text_s,mods_accessCondition_use_and_reproduction_ms,mods_physicalDescription_form_ms,sequence The script-generated solr request may not in most cases be useful or even workable. You may need to experiment with the field_pattern and field_pattern_do_not_want configuration settings to reduce the number of Solr fields.","title":"Configuring which Solr fields to include in the CSV"},{"location":"exporting_islandora_7_content/#adding-filters-to-your-solr-query-to-limit-the-objects-fetched-from-the-source-islandora","text":"The namespace , collection , content_model , and solr_filters options documented above allow you to scope the set of objects exported from the source Islandora 7.x instance. The first three take simple, single values. The last option allows you to add arbitrary filters to the query sent to Solr in the form of key:value pairs, like this: solr_filters: - ancestors_ms: 'some\\:collection' - fgs_state_s: 'Active' Warning Colons within PIDs used in filters must be escaped with a backslash ( \\ ), e.g., some\\:collection .","title":"Adding filters to your Solr query to limit the objects fetched from the source Islandora"},{"location":"exporting_islandora_7_content/#putting-your-solr-request-in-a-file","text":"You have the option of providing their own solr query in a text file and pointing to the file using the --metadata_solr_request option when running the script: python3 get_islandora_7_content.py --config --metadata_solr_request The contents of the file must contain a full HTTP request to Solr, e.g.: http://localhost:8080/solr/select?q=PID:*&wt=csv&rows=1000000&fl=PID,RELS_EXT_hasModel_uri_s,RELS_EXT_isMemberOfCollection_uri_ms,RELS_EXT_isMemberOf_uri_ms,mods_originInfo_encoding_iso8601_dateIssued_mdt The advantage of putting your Solr request in its own file is that you have complete control over the Solr query. Requests to Solr in this file should always include the \"wt=csv\" and \"rows=1000000\" parameters (the \"rows\" parameter should have a value that is greater than the number of objects in your repository, otherwise Solr won't return all objects).","title":"Putting your Solr request in a file"},{"location":"exporting_islandora_7_content/#using-the-csv-as-input-for-workbench","text":"The CSV file generated by this script will almost certainly contain many more columns than you will want to ingest into Islandora 2. You will probably want to delete columns you don't need, combine the contents of several columns into one, and edit the contents of others. As we can see from the example above, the column headings in the CSV are Solr fieldnames ( RELS_EXT_hasModel_uri_s , mods_titleInfo_title_ms , etc.). You will need to replace those column headers with the equivalent fields as defined in your Drupal 9 content type . In addition, the metadata stored in Islandora 7's Solr index does not in many cases have the structure Workbench requires, so the data in the CSV file will need to be edited before it can be used by Workbench to create nodes. The content of Islandora 7 Solr fields is derived from MODS (or other) XML elements, and, with the exception of text-type fields, will not necessarily map cleanly to Drupal fields' data types. In other words, to use the CSV data generated by get_islandora_7_content.py , you will need to do some work to prepare it (or \"transform\" it, to use ETL language) to use it as input for Workbench. However, the script adds three columns to the CSV file that do not use Solr fieldnames and whose contents you should not edit but that you may need to rename: file , PID , and sequence : Do not edit name Rename 'sequence' column (e.g., or contents of to 'field_weight') but do not 'file' column. edit its contents. | / Every other column | | / Rename 'PID' to the | will need to be | | | value of your 'id_field' | deleted, renamed, | | | setting but do not edit | or its content | | | column contents. | edited. | v v v v --------------------------------------------------------------------- file,PID,RELS_EXT_hasModel_uri_s,[...],mods_typeOfResource_s,sequence First, the required Workbench column \"file\" is added to the beginning of each CSV row, and is populated with the filename of the OBJ datastream. This filename is based on the object's PID, with the the : replaced with an underscore, and has an extension determined by the OBJ datastream's MIME type. Second, \"PID\" is the Islandora 7.x PID of each object in the CSV file. This column header can be changed to \"id\" or whatever you have defined in your Workbench configuration file's id_field setting. Alternatively, you can set the value of id_field to PID and not rename that CSV column. Third, a \"sequence\" column is added at the end of each CSV row. This is where the get_islandora_7_content.py script stores the sequence number of each child object/page in relation to its parent . If an Islandora 7.x object has a property in its RELS-EXT datastream islandora:sSequenceNumberOfxxx (where \"xxx\" is the object's parent), the value of that property is added to the \"sequence\" column at the end of each row in the CSV. For paged content, this value is taken from the islandora:isSequenceNumber RELS-EXT property. These values are ready for use in the \"field_weight\" Drupal field created by the Islandora Defaults module; you can simply rename the \"sequence\" column header to \" field_weight\" when you use the CSV as input for Islandora Workbench. Note that you don't need to configure the script to include fields that contain \"isSequenceNumberOf\" or \"isSequenceNumber\" in your CSV; in fact, because there are so many of them in a typical Islandora 7 Solr index, you will want to exclude them using the field_pattern_do_not_want configuration variable. Excluding them is safe, since the script fetches the sequence information separately from the other CSV data. A fourth column in your Workbench CSV, field_member_of , is not added automatically. It contains the PID of the parent object, whether it is a collection, the top level object (parent) in a compound object, book object that has pages, etc. If an Islandora object has a value that should be in the field_member_of column, it will be in one or more (usually just one) of the following columns in the CSV created by the get_islandora_7_content.py script: RELS_EXT_isMemberOfCollection RELS_EXT_isPageOf RELS_EXT_isSequenceNumberOfXXX (where XXX is the PID of the parent object) RELS_EXT_isConstituentOf All of these columns will likely be present in your CSV, but it is possible that some may not be, for example if your Islandora 7 repository did not have a module enabled that uses one of those RELS-EXT properties. Note In general, the CSV that you need to end up with to ingest content into Islandora 2 using Workbench needs to have a structure similar to that described in the \" With page/child-level metadata \" section of the Workbench documentation for \"Creating paged, compound, and collection content.\" You should review the points in the \"Some important things to note\" section of that documentation. Note Solr escapes commas in its exported CSV with backslashes ( \\ ). Should look for and replace these escaped commas (e.g. \\, ) with regular commas before using the CSV with Workbench.","title":"Using the CSV as input for Workbench"},{"location":"field_templates/","text":"Note This section describes using CSV field templates in your configuration file. For information on CSV value templates, see \" CSV value templates \". For information on CSV file templates, see the \" CSV file templates \" section. In create and update tasks, you can configure field templates that are applied to each node as if the fields were present in your CSV file. The templates are configured in the csv_field_templates option. An example looks like this: csv_field_templates: - field_rights: \"The author of this work dedicates any and all copyright interest to the public domain.\" - field_member_of: 205 - field_model: 25 - field_tags: 231|257 Values in CSV field templates are structured the same as field values in your CSV (e.g., in the example above, field_tags is multivalued), and are validated against Drupal's configuration in the same way that values present in your CSV are validated. If a column with the field name used in a template is present in the CSV file, Workbench ignores the template and uses the data in the CSV file. If a column listed in the ignore_csv_columns setting , the value from the template is used.","title":"CSV field templates"},{"location":"fields/","text":"Workbench uses a CSV file to populate Islandora objects' metadata. This file contains the field values that is to be added to new or existing nodes, and some additional reserved columns specific to Workbench. Data in this CSV file can be: strings (for string or text fields) like Using Islandora Workbench for Fun and Profit integers like 7281 the binary values 1 or 0 Existing Drupal-generated entity IDs (term IDs for taxonomy terms or node IDs for collections and parents), which are integers like 10 or 3549 Workbench-specific structured strings for typed relation (e.g., relators:art:30 ), link fields (e.g., https://acme.net%%Acme Products ), geolocation fields (e.g., \"49.16667,-123.93333\" ), and authority link data (e.g., viaf%%http://viaf.org/viaf/10646807%%VIAF Record ) Note As is standard with CSV data, values do not need to be wrapped in double quotation marks ( \" ) unless they contain an instance of the delimiter character (e.g., a comma) or line breaks. Spreadsheet applications such as Google Sheets, LibreOffice Calc, and Excel will output valid CSV data. If you are using a spreadsheet application, it will take care of wrapping the CSV values in double quotation marks when they are necessary - you do not need to wrap the field values yourself. Reserved CSV columns The following CSV columns are used for specific purposes and in some cases are required in your CSV file, depending on the task you are performing (see below for specific cases). Data in them does not directly populate Drupal content-type fields. CSV field name Task(s) Note id create This CSV field is used by Workbench for internal purposes, and is not added to the metadata of your Islandora objects. Therefore, it doesn't need to have any relationship to the item described in the rest of the fields in the CSV file. You can configure this CSV field name to be something other than id by using the id_field option in your configuration file. Note that if the specified field contains multiple values, (e.g. 0001|spec-86389 ), the entire field value will be used as the internal Workbench identifier. Also note that it is important to use values in this column that are unique across input CSV files if you plan to create parent/child relationships accross Workbench sessions . parent_id create When creating paged or compound content, this column identifies the parent of the item described in the current row. For information on how to use this field, see \" With page/child-level metadata .\" Note that this field can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. node_id update, delete, add_media, export_csv, delete_media_by_node The ID of the node you are updating, deleting, or adding media to. Full URLs (including URL aliases) are also allowed in this CSV field. file create, add_media See detail in \"Values in the 'file' column\", below. media_use_tid create, add_media Tells Workbench which terms from the Islandora Media Use vocabulary to assign to media created in create and add_media tasks. This can be set for all new media in the configuration file; only include it in your CSV if you want row-level control over this value. More detail is available in the \" Configuration \" docs for media_use_tid . url_alias create, update See detail in \" Assigning URL aliases \". image_alt_text create See detail in \" Adding alt text to images \". checksum create See detail in \" Fixity checking \". term_name create_terms See detail in \" Creating taxonomy terms \". Values in the \"file\" column Values in the reserved file CSV field contain the location of files that are used to create Drupal Media. By default, Workbench pushes up to Drupal only one file, and creates only one resulting media per CSV record. However, it is possible to push up multiple files per CSV record (and create all of their corresponding media). File locations in the file field can be relative to the directory named in input_dir , absolute paths, or URLs. Examples of each: relative to directory named in the input_dir configuration setting: myfile.png absolute: /tmp/data/myfile.png . On Windows, you can use values like c:\\users\\mjordan\\files\\myfile.png or \\\\some.windows.file.share.org\\share_name\\files\\myfile.png . URL: http://example.com/files/myfile.png Things to note about file values in general: Relative, absolute, and URL file locations can exist within the same CSV file, or even within the same CSV value. By default, if the file value for a row is empty, Workbench will log the empty value, both in and outside of --check . file values that point to files that don't exist will result in Workbench logging the missing file and then exiting, unless allow_missing_files: true is present in your config file. Adding perform_soft_checks will also tell Workbench to not error out when the value in the file column can't be found. If you want do not want to create media for any of the rows in your CSV file, include nodes_only: true in your configuration file. More detail is available . file values that contain non-ASCII characters are normalized to their ASCII equivalents. See this issue for more information. The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Things to note about URLs as file values: Workbench downloads files identified by URLs and saves them in the directory named in temp_dir before processing them further; within this directory, each file is saved in a subdirectory named after the value in the row's id_field field. It does not delete the files from these locations after they have been ingested into Islandora unless the delete_tmp_upload configuration option is set to true . Files identified by URLs must be accessible to the Workbench script, which means they must not require a username/password; however, they can be protected by a firewall, etc. as long as the computer running Workbench is allowed to retrieve the files without authenticating. Currently Workbench requires that the URLs point directly to a file or a service that generates a file, and not a wrapper page or other indirect route to the file. Required columns A small number of columns are required in your CSV, depending on the task you are performing: Task Required in CSV Note create id See detail in \"Reserved CSV fields\", above. title The node title. file Empty values in the file field are allowed if allow_missing_files is present in your configuration file, in which case a node will be created but it will have no attached media. update node_id The node ID of an existing node you are updating. delete node_id The node ID of an existing node you are deleting. add_media node_id The node ID of an existing node you are attaching media to. file Must contain a filename, file path, or URL. allow_missing_files only works with the create task. If a required field is missing from your CSV, --check will tell you. Columns you want Workbench to ignore In some cases you might want to include columns in your CSV that you want Workbench to ignore. More information on this option is available in the \"Sharing the input CSV with other applications\" section of the Workflows documentation. CSV fields that contain Drupal field data These are of two types of Drupal fields, base fields and content-type specific fields. Base fields Base fields are basic node properties, shared by all content types. The base fields you can include in your CSV file are: title : This field is required for all rows in your CSV for the create task. Optional for the 'update' task. Drupal limits the title's length to 255 characters,unless the Node Title Length contrib module is installed. If that module is installed, you can set the maximum allowed title length using the max_node_title_length configuration setting. langcode : The language of the node. Optional. If included, use one of Drupal's language codes as values (common values are 'en', 'fr', and 'es'; the entire list can be seen here . If absent, Drupal sets the value to the default value for your content type. uid : The Drupal user ID to assign to the node and media created with the node. Optional. Only available in create tasks. If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). created : The timestamp to use in the node's \"created\" attribute and in the \"created\" attribute of the media created with the node. Optional, but if present, it must be in format 2020-11-15T23:49:22+00:00 (the +00:00 is the difference to Greenwich time/GMT). If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). published : Whether or not the node (and all accompanying media) is published. If present in add_media tasks, will override parent node's published value. Values in this field are either 1 (for published) or 0 (for unpublished). The default value for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . promote : Whether or not the node is promoted to the site's front page. 1 (for promoted) or 0 (for not promoted). The default vaue for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . All base fields other than uid can be included in both create and update tasks. Content type-specific fields These fields correspond directly to fields configured in Drupal nodes, and data you provide in them populates their equivalent field in Drupal entities. The column headings in the CSV file must match machine names of fields that exist in the target node content type. Fields' machine names are visible within the \"Manage fields\" section of each content type's configuration, here circled in red: These field names, plus the fields indicated in the \"Reserved CSV fields\" section above, are the column headers in your CSV file, like this: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Note If content-type field values apply to all of the rows in your CSV file, you can avoid including them in the CSV and instead use \" CSV field templates \". Using field labels as CSV column headers By default, Workbench requires that column headers in your CSV file use the machine name of Drupal fields. However, in \"create\", \"update\", and \"create_terms\" tasks, you can use the field labels if you include csv_headers: labels in your configuration file. If you do this, you can use CSV file like this: file,id,title,Model,Description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Some things to note about using field labels in your CSV: if the content type (or vocabulary) that you are populating uses the same label for multiple fields, you won't be able to use labels as your CSV column headers. --check will tell you if there are any duplicate field labels. Spaces in field labels are OK, e.g. Country of Publication . Spelling, capitalization, punctuation, etc. in CSV column headers must match the field labels exactly. If any field labels contain the character you are using as the CSV delimiter (defined in the delimiter config setting), you will need to wrap the column header in quotation marks, e.g. \"Height, length, weight\" . Single and multi-valued fields Drupal allows for fields to have a single value, a specific maximum number of values, or unlimited number of values. In the CSV input file, each Drupal field corresponds to a single CSV field. In other words, the CSV column names must be unique, even if a Drupal field allows multiple values. Populating multivalued fields is explained below. Single-valued fields In your CSV file, single-valued fields simply contain the value, which, depending on the field type, can be a string or an integer. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: file,title,id,field_model,field_description,field_rights,field_extent,field_access_terms,field_member_of myfile.jpg,My nice image,obj_00001,24,\"A fine image, yes?\",Do whatever you want with it.,There's only one image.,27,45 In this example, the term ID for the tag you want to assign in field_access_terms is 27, and the node ID of the collection you want to add the object to (in field_member_of ) is 45. Multivalued fields For multivalued fields, you separate the values within a field with a pipe ( | ), like this: file,title,field_something IMG_1410.tif,Small boats in Havana Harbour,One subvalue|Another subvalue IMG_2549.jp2,Manhatten Island,first subvalue|second subvalue|third subvalue This works for string fields as well as taxonomy reference fields, e.g.: file,title,field_my_multivalued_taxonomy_field IMG_1410.tif,Small boats in Havana Harbour,35|46 IMG_2549.jp2,Manhatten Island,34|56|28 Drupal strictly enforces the maximum number of values allowed in a field. If the number of values in your CSV file for a field exceed a field's configured maximum number of fields, Workbench will only populate the field to the field's configured limit. The subdelimiter character defaults to a pipe ( | ) but can be set in your config file using the subdelimiter configuration setting. Note Workbench will remove duplicate values in CSV fields. For example, if you accidentally use first subvalue|second subvalue|second subvalue in your CSV, Workbench will filter out the superfluous second subvalue . This applies to both create and update tasks, and within update tasks, replacing values and appending values to existing ones. Workbench deduplicates CVS values silently: it doesn't log the fact that it is doing it. Drupal field types The following types of Drupal fields can be populated from data in your input CSV file: text (plain, plain long, etc.) fields integer fields boolean fields, with values 1 or 0 EDTF date fields entity reference (taxonomy and linked node) fields typed relation (taxonomy) fields link fields geolocation fields Drupal is very strict about not accepting malformed data. Therefore, Islandora Workbench needs to provide data to Drupal that is consistent with field types (string, taxonomy reference, EDTF, etc.) we are populating. This applies not only to Drupal's base fields (as we saw above) but to all fields. A field's type is indicated in the same place as its machine name, within the \"Manage fields\" section of each content type's configuration. The field types are circled in red in the screen shot below: Below are guidelines for preparing CSV data that is compatible with common field types configured in Islandora repositories. Text fields Generally speaking, any Drupal field where the user enters free text into a node add/edit form is configured to be one of the Drupal \"Text\" field types. Islandora Workbench supports non-Latin characters in CSV, provided the CSV file is encoded as UTF-8. For example, the following non-Latin text will be added as expected to Drupal fields: \u4e00\u4e5d\u4e8c\u56db\u5e74\u516d\u6708\u5341\u4e8c\u65e5 (Traditional Chinese) \u0938\u0930\u0915\u093e\u0930\u0940 \u0926\u0938\u094d\u0924\u093e\u0935\u0947\u095b, \u0905\u0916\u092c\u093e\u0930\u094b\u0902 \u092e\u0947\u0902 \u091b\u092a\u0947 \u0932\u0947\u0916, \u0905\u0915\u093e\u0926\u092e\u093f\u0915 \u0915\u093f\u0924\u093e\u092c\u0947\u0902 (Hindi) \u140a\u1455\u1405\u14ef\u1585 \u14c4\u14c7, \u1405\u14c4\u1585\u1450\u1466 \u14c2\u1432\u1466 (Inuktitut) However, if all of your characters are Latin (basically, the characters found on a standard US keyboard) your CSV file can be encoded as ASCII. Some things to note about Drupal text fields: Some specialized forms of text fields, such as EDTF, enforce or prohibit the presence of specific types of characters (see below for EDTF's requirements). Islandora Workbench populates Drupal text fields verbatim with the content provided in the CSV file, with these exceptions . Plus, if a text value in your CSV is longer than its field's maximum configured length, Workbench will truncate the text (see the next point and warning below). Text fields may be configured to have a maximum length. Running Workbench with --check will produce a warning (both shown to the user and written to the Workbench log) if any of the values in your CSV file are longer than their field's configured maximum length. Warning If the CSV value for text field exceeds its configured maximum length, Workbench truncates the value to the maximum length before populating the Drupal field, leaving a log message indicating that it has done so. Text fields with markup Drupal text fields that are configured to contain \"formatted\" text (for example, text with line breaks or HTML markup) will have one of the available text formats, such as \"Full HTML\" or \"Basic HTML\", applied to them. Workbench treats these fields these fields the same as if they are populated using the node add/edit form, but you will have to tell Workbench, in your configuration file, which text format to apply to them. When you populate these fields using the node add/edit form, you need to select a text format within the WYSIWYG editor: When populating these fields using Workbench, you can configure which text format to use either 1) for all Drupal \"formatted\" text fields or 2) using a per-field configuration. 1) To configure the text format to use for all \"formatted\" text fields, include the text_format_id setting in your configuration file, indicating the ID of the text format to use, e.g., text_format_id: full_html . The default value for this setting is basic_html . 2) To configure text formats on a per-field basis, include the field_text_format_ids (plural) setting in your configuration file, along with a field machine name-to-format ID mapping, like this: field_text_format_ids: - field_description_long: full_html - field_abstract: restricted_html If you use both settings in your configuration file, field_text_format_ids takes precedence. You only need to configure text formats per field to override the global setting. Note Workbench has no way of knowing what text formats are configured in the target Drupal, and has no way of validating that the text format ID you use in your configuration file exists. However, if you use a text format ID that is invalid, Drupal will not allow nodes to be created or updated and will leave error messages in your Workbench log that contain text like Unprocessable Entity: validation failed.\\nfield_description_long.0.format: The value you selected is not a valid choice. By default, Drupal comes configured with three text formats, full_html , basic_html , and restricted_html . If you create your own text format at admin/config/content/formats , you can use its ID in the Workbench configuration settings described above. If you want to include line breaks in your CSV, they must be physical line breaks. \\n and other escaped line break characters are not recognized by Drupal's \"Convert line breaks into HTML (i.e.
    and

    )\" text filter. CSV data containing physical line breaks must be wrapped in quotation marks, like this: id,file,title,field_model,field_description_long 01,,I am a title,Image,\"Line breaks are awesome.\" Taxonomy reference fields Note In the list of a content type's fields, as pictured above, Drupal uses \"Entity reference\" for all types of entity reference fields, of which Taxonomy references are one. The other most common kind of entity reference field is a node reference field. Islandora Workbench lets you assign both existing and new taxonomy terms to nodes. Creating new terms on demand during node creation reduces the need to prepopulate your vocabularies prior to creating nodes. In CSV columns for taxonomy fields, you can use either term IDs (integers) or term names (strings). You can even mix IDs and names in the same field: file,title,field_my_multivalued_taxonomy_field img001.png,Picture of cats and yarn,Cats|46 img002.png,Picture of dogs and sticks,Dogs|Sticks img003.png,Picture of yarn and needles,\"Yarn, Balls of|Knitting needles\" By default, if you use a term name in your CSV data that doesn't match a term name that exists in the referenced taxonomy, Workbench will detect this when you use --check , warn you, and exit. This strict default is intended to prevent users from accidentally adding unwanted terms through data entry error. Terms can be from any level in a vocabulary's hierarchy. In other words, if you have a vocabulary whose structure looks like this: you can use the terms IDs or labels for \"Automobiles\", \"Sports cars\", or \"Land Rover\" in your CSV. The term name (or ID) is all you need; no indication of the term's place in the hierarchy is required. If you add allow_adding_terms: true to your configuration file for any of the entity \"create\" or \"update\" tasks, Workbench will create the new term the first time it is used in the CSV file following these rules: If multiple records in your CSV contain the same new term name in the same field, the term is only created once. When Workbench checks to see if the term with the new name exists in the target vocabulary, it queries Drupal for the new term name, looking for an exact match against an existing term in the specified vocabulary. Therefore it is important that term names used in your CSV are identical to existing term names. The query to find existing term names follows these two rules: Leading and trailing whitespace on term names is ignored. Internal whitespace is significant. Case is ignored. Note that Drupal does not distinguish between diacritics. For example, it does not distinguish between \"Chylek\" and \"Ch\u00fdlek\". If the term name you provide in the CSV file does not match an existing term name in its vocabulary, the term name from the CSV data is used to create a new term. If it does match, Workbench populates the field in your nodes with a reference to the matching term. allow_adding_terms applies to all vocabularies. In general, you do not want to add new terms to vocabularies used by Islandora for system functions such as Islandora Models and Islandora Media Use. In order to exclude vocabularies from being added to, you can register vocabulary machine names in the protected_vocabularies setting, like this: protected_vocabularies: - islandora_model - islandora_display - islandora_media_use Adding new terms has some constraints: Terms created in this way do not have any external URIs, other fields, or if they are hierarchical. If you want your terms that have any of these features, you will need to either create the terms manually, through a create_terms task, or using a third-party module like Taxonomy Import prior to using their term names in an input CSV. Workbench cannot distinguish between identical term names within the same vocabulary. This means you cannot create two different terms that have the same term name (for example, two terms in the Person vocabulary that are identical but refer to two different people). The workaround for this is to create one of the terms before using Workbench and use the term ID instead of the term string. If the same term name exists multiple times in the same vocabulary (again using the example of two Person terms that describe two different people) you should be aware that when you use these identical term names within the same vocabulary in your CSV, Workbench will always choose the first one it encounters when it converts from term names to term IDs while populating your nodes. The workaround for this is to use the term ID for one (or both) of the identical terms, or to use URIs for one (or both) of the identical terms. --check will identify any new terms that exceed Drupal's maximum allowed length for term names, 255 characters. If a term name is longer than 255 characters, Workbench will truncate it at that length, log that it has done so, and create the term. Taxonomy terms created with new nodes are not removed when you delete the nodes. Using term names in multi-vocabulary fields While most node taxonomy fields reference only a single vocabulary, Drupal does allow fields to reference multiple vocabularies. This ability poses a problem when we use term names instead of term IDs in our CSV files: in a multi-vocabulary field, Workbench can't be sure which term name belongs in which of the multiple vocabularies referenced by that field. This applies to both existing terms and to new terms we want to add when creating node content. To avoid this problem, we need to tell Workbench which of the multiple vocabularies each term name should (or does) belong to. We do this by namespacing terms with the applicable vocabulary ID. For example, let's imagine we have a node field whose name is field_sample_tags , and this field references two vocabularies, \"cats\" and \"dogs\". To use the terms Tuxedo , Tabby , German Shepherd in the CSV when adding new nodes, we need to namespace them with vocabulary IDs like this: field_sample_tags cats:Tabby cats:Tuxedo dogs:German Shepherd If you want to use multiple terms in a single field, you would namespace them all: cats:Tuxedo|cats:Misbehaving|dogs:German Shepherd To find the vocabulary ID (referred to above as the \"namespace\") to use, visit the list of your site's vocabularies at admin/structure/taxonomy : Hover your pointer over the \"List terms\" button for each vocabulary to reveal the URL to its overview page. The ID for the vocabulary is the string between \"manage\" and \"overview\" in the URL. For example, in the URL admin/structure/taxonomy/manage/person/overview , the vocabulary ID is \"person\". This is the namespace you need to use to indicate which vocabulary to add new terms to. CSV values containing term names that have commas ( , ) in multi-valued, multi-vocabulary fields need to be wrapped in quotation marks (like any CSV value containing a comma), and in addition, the need to specify the namespace within each of the subvalues: \"tags:gum, Bubble|tags:candy, Hard\" Using these conventions, Workbench will be certain which vocabulary the term names belong to. Workbench will remind you during its --check operation that you need to namespace terms. It determines 1) if the field references multiple vocabularies, and then checks to see 2) if the field's values in the CSV are term IDs or term names. If you use term names in multi-vocabulary fields, and the term names aren't namespaced, Workbench will warn you: Error: Term names in multi-vocabulary CSV field \"field_tags\" require a vocabulary namespace; value \"Dogs\" in row 4 does not have one. Note that since : is a special character when you use term names in multi-vocabulary CSV fields, you can't add a namespaced term that itself contains a : . You need to add it manually to Drupal and then use its term ID (or name, or URI) in your CSV file. Using term URIs instead of term IDs Islandora Workbench lets you use URIs assigned to terms instead of term IDs. You can use a term URI in the media_use_tid configuration option (for example, \"http://pcdm.org/use#OriginalFile\" ) and in taxonomy fields in your metadata CSV file: field_model https://schema.org/DigitalDocument http://purl.org/coar/resource_type/c_18cc During --check , Workbench will validate that URIs correspond to existing taxonomy terms. Using term URIs has some constraints: You cannot create a new term by providing a URI like you can by providing a term name. If the same URI is registered with more than one term, Workbench will choose one and write a warning to the log indicating which term it chose and which terms the URI is registered with. However, --check will detect that a URI is registered with more than one term and warn you. At that point you can edit your CSV file to use the correct term ID rather than the URI. Using numbers as term names If you want to use a term name like \"1990\" in your CSV, you need to tell Workbench to not interpret that term name as a term ID. To do this, add a list of CSV columns to your config file using the columns_with_term_names config setting that will only contain term names (and not IDs or URIs, as explained next): columns_with_term_names: - field_subject - field_tags If you register a column name here, it can contain only terms names. Any term ID or URIs will be interpreted as term names. Note that this is only necessary if your term names are comprised entirely of integers. If they contain decimals (like \"1990.3\"), Workbench will not interpret them as term IDs and you will not need to tell Workbench to do otherwise. Entity Reference Views fields Islandora Workbench fully supports taxonomy reference fields that use the \"Default\" reference type, but only partially supports \"Views: Filter by an entity reference View\" taxonomy reference fields. To populate this type of entity reference in \"create\" and \"update\" tasks, you have two options. Warning Regardless of whether you use term IDs or term names in your CSV, Workbench will not validate values in \"Views: Filter by an entity reference View\" taxonomy reference fields. Term IDs or term names that are not in the referenced View will result in the node not being created or updated (Drupal will return a 422 response). However, if allow_adding_terms is set to true , terms that are not in the referenced vocabulary will be added to the vocabulary if your CSV data contains a vocabulary ID/namespace in the form vocabid:newterm . The terms will be added regardless of whether they are within the referenced View. Therefore, for this type of Drupal field, you should not include vocabulary IDs/namespaces in your CSV data for that field. Further work on supporting this type of field is being tracked in this Github issue . First, if your input CSV contains only term IDs in this type of column, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Alternatively, if you prefer to use term names instead of term IDs in CSV column for this type of field, you will need to create a special Display for the View that is refererenced from that field. To do this: In the View that is referenced, duplicate the view display as a REST Export display. When making the changes to the resulting REST Export display, be careful to modify that display only (and not \"All displays\") by alaways choosing \"This rest_export (override)\" during every change. Format: Serializer Settings: json Path: some_path (do not include the leading / ) Authentication: Basic Auth Access restrictions: Role > View published content (the default; \"administrator vocabularies and terms\" is needed for other endpoints used by Workbench, but this view doesn't require this.) Filter Criteria: Add \"Taxonomy term / name\" from the list of fields Expose this filter Choose the \"Is equal to\" operator Leave the Value field empty In the Field identifier field, enter \"name\" Your configuration for the new \"Taxonomy term: name\" filter should look like this: Then in your Workbench configuration file, using the entity_reference_view_endpoints setting, provide a mapping between columns in your CSV file and the applicable Views REST Export display path value (configured in step 3 above). In this example, we define three field/path mappings: entity_reference_view_endpoints: - field_linked_agent: /taxonomy/linked_agents - field_language: /language/lookup - field_subject: /taxonomy/subjects During \"create\" and \"update\" tasks, --check will tell you if the View REST Export display path is accesible. Typed Relation fields Typed relation fields contain information about the relationship (or \"relation\") between a taxonomy term and the node it is attached to. For example, a term from the Person vocabulary, \"Jordan, Mark\", can be an author, illustrator, or editor of the book described in the node. In this example, \"author\", \"illustrator\", and \"editor\" are the typed relations. Note Although Islandora supports Typed Relation fields that allow adding relations to other nodes, currently Workbench only supports adding relations to taxonomies. If you need support for adding Typed Relations to other entities, please leave a comment on this issue . The Controlled Access Terms module allows the relations to be sets of terms from external authority lists (for example like the MARC Relators list maintained by the Library of Congress). Within a Typed Relation field's configuration, the configured relations look like this: In this screenshot, \"relators\" is a namespace for the MARC Relators authority list, the codes \"acp\", \"adi\", etc are the codes for each relator, and \"Art copyist\", \"Art director\", etc. are the human-readable label for each relator. Within the edit form of a node that has a Typed Relation field, the user interface adds a select list of the relation (the target taxonomy term here is \"Jordan, Mark (30))\", like this: To be able to populate Typed Relation fields using CSV data with the three pieces of required data (authority list, relation type, target term), Islandora Workbench supports CSV values that contain the corresponding namespace, relator code, and taxonomy term ID, each separated by a colon ( : ), like this: relators:art:30 In this example CSV value, relators is the namespace that the relation type art is from (the Library of Congress Relators vocabulary), and the target taxonomy term ID is 30 . Note Note that the structure required for typed relation values in the CSV file is not the same as the structure of the relations configuration depicted in the screenshot of the \"Available Relations\" list above. A second option for populating Typed Relation fields is to use taxonomy term names (as opposed to term IDs) as targets: \"relators:art:Jordan, Mark\" Warning In the next few paragraphs, the word \"namespace\" is used to describe two different kinds of namespaces - first, a vocabulary ID in the local Drupal and second, an ID for the external authority list of relators, for example by the Library of Congress. As we saw in the \"Using term names in multi-vocabulary fields\" section above, if the field that we are populating references multiple vocabularies, we need to tell Drupal which vocabulary we are referring to with a local vocabulary namespace. To add a local vocabulary namespace to Typed Relation field CSV structure, we prepend it to the term name, like this (note the addition of \"person\"): \"relators:art:person:Jordan, Mark\" (In this example, relators is the external authority lists's namespace, and person is the local Drupal vocabulary namespace, prepended to the taxonomy term name, \"Jordan, Mark\".) If this seems confusing and abstruse, don't worry. Running --check will tell you that you need to add the Drupal vocabulary namespace to values in specific CSV columns. The final option for populating Typed Relation field is to use HTTP URIs as typed relation targets: relators:art:http://markjordan.net If you want to include multiple typed relation values in a single field of your CSV file (such as in \"field_linked_agent\"), separate the three-part values with the same subdelimiter character you use in other fields, e.g. ( | ) (or whatever you have configured as your subdelimiter ): relators:art:30|relators:art:45 or \"relators:art:person:Jordan, Mark|relators:art:45\" Adding new typed relation targets Islandora Workbench allows you to add new typed relation targets while creating and updating nodes. These targets are taxonomy terms. Your configuration file must include the allow_adding_terms: true option to add new targets. In general, adding new typed relation targets is just like adding new taxonomy terms as described above in the \"Taxonomy relation fields\" section. An example of a CSV value that adds a new target term is: \"relators:art:person:Jordan, Mark\" You can also add multiple new targets: \"relators:art:person:Annez, Melissa|relators:art:person:Jordan, Mark\" Note that: For multi-vocabulary fields, new typed relator targets must be accompanied by a vocabulary namespace ( person in the above examples). You cannot add new relators (e.g. relators:foo ) in your CSV file, only new target terms. Note Adding the typed relation namespace, relators, and vocabulary names is a major hassle. If this information is the same for all values (in all rows) in your field_linked_agent column (or any other typed relation field), you can use CSV value templates to reduce the tedium. EDTF fields Running Islandora Workbench with --check will validate Extended Date/Time Format (EDTF) Specification dates (Levels 0, 1, and 2) in EDTF fields. Some common examples include: Type Examples Date 1976-04-23 1976-04 Qualified date 1976? 1976-04~ 1976-04-24% Date and time 1985-04-12T23:20:30 Interval 1964/2008 2004-06/2006-08 2004-06-04/2006-08-01 2004-06/2006-08-01 Set [1667,1668,1670..1672] [1672..1682] [1672,1673] [..1672] [1672..] Subvalues in multivalued CSV fields are validated separately, e.g. if your CSV value is 2004-06/2006-08|2007-01/2007-04 , 2004-06/2006-08 and 2007-01/2007-04 are validated separately. Note EDTF supports a very wide range of specific and general dates, and in some cases, valid dates can look counterintuitive. For example, \"2001-34\" is valid (it's Sub-Year Grouping meaning 2nd quarter of 2001). Link fields The link field type stores URLs (e.g. https://acme.com ) and link text in separate data elements. To add or update fields of this type, Workbench needs to provide the URL and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the URL and link text pairs in CSV values with double percent signs ( %% ), like this: field_related_websites http://acme.com%%Acme Products Inc. You can include multiple pairs of URL/link text pairs in one CSV field if you separate them with the subdelimiter character: field_related_websites http://acme.com%%Acme Products Inc.|http://diy-first-aid.net%%DIY First Aid The URL is required, but the link text is not. If you don't have or want any link text, omit it and the double quotation marks: field_related_websites http://acme.com field_related_websites http://acme.com|http://diy-first-aid.net%%DIY First Aid Authority link fields The authority link field type stores abbreviations for authority sources (i.e., external controlled vocabularies such as national name authorities), authority URIs (e.g. http://viaf.org/viaf/153525475 ) and link text in separate data elements. Authority link fields are most commonly used on taxonomy terms, but can be used on nodes as well. To add or update fields of this type, Workbench needs to provide the authority source abbreviation, URI and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the three parts in CSV values with double percent signs ( %% ), like this: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record You can include multiple triplets of source abbreviation/URL/link text in one CSV field if you separate them with the subdelimiter character: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record|other%%https://github.com/mjordan%%Github The authority source abbreviation and the URI are required, but the link text is not. If you don't have or want any link text, omit it: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807 field_authority_vocabs viaf%%http://viaf.org/viaf/10646807|other%%https://github.com/mjordan%%Github Geolocation fields The Geolocation field type, managed by the Geolocation Field contrib module, stores latitude and longitude coordinates in separate data elements. To add or update fields of this type, Workbench needs to provide the latitude and longitude data in these separate elements. To simplify entering geocoordinates in the CSV file, Workbench allows geocoordinates to be in lat,long format, i.e., the latitude coordinate followed by a comma followed by the longitude coordinate. When Workbench reads your CSV file, it will split data on the comma into the required lat and long parts. An example of a single geocoordinate in a field would be: field_coordinates \"49.16667,-123.93333\" You can include multiple pairs of geocoordinates in one CSV field if you separate them with the subdelimiter character: field_coordinates \"49.16667,-123.93333|49.25,-124.8\" Note that: Geocoordinate values in your CSV need to be wrapped in double quotation marks, unless the delimiter key in your configuration file is set to something other than a comma. If you are entering geocoordinates into a spreadsheet, you may need to escape leading + and - signs since they will make the spreadsheet application think you are entering a formula. You can work around this by escaping the + an - with a backslash ( \\ ), e.g., 49.16667,-123.93333 should be \\+49.16667,-123.93333 , and 49.16667,-123.93333|49.25,-124.8 should be \\+49.16667,-123.93333|\\+49.25,-124.8 . Workbench will strip the leading \\ before it populates the Drupal fields. Excel: leading + and - need to be escaped Google Sheets: only + needs to be escaped LibreOffice Calc: neither + nor - needs to be escaped Paragraphs (Entity Reference Revisions fields) Entity Reference Revisions fields are similar to Drupal's core Entity Reference fields used for taxonomy terms but are intended for entities that are not intended for reference outside the context of the item that references them. For Islandora sites, this is used for Paragraph entities. In order to populate paragraph entites using Workbench in create and update tasks, you need to enable and configure the REST endpoints for paragraphs. To do this, ensure the REST UI module is enabled, then go to Configuration/Web Services/REST Resources ( /admin/config/services/rest ) and enable \"Paragraph\". Then edit the settings for Paragraph to the following: Granularity = Method GET formats: jsonld, json authentication: jwt_auth, basic_auth, cookie POST formats: json authentication: jwt_auth, basic_auth, cookie DELETE formats: json authentication: jwt_auth, basic_auth, cookie PATCH formats: json authentication: jwt_auth, basic_auth, cookie Note Pargraphs are locked down from REST updates by default. To add new and update paragraph values you must enable the paragraphs_type_permissions submodule and ensure the Drupal user in your configuration file has sufficient privledges granted at /admin/people/permissions/module/paragraphs_type_permissions . Paragraphs are handy for Islandora as they provide flexibility in the creation of more complex metadata (such as complex titles, typed notes, or typed identifiers) by adding Drupal fields to paragraph entities, unlike the Typed Relationship field which hard-codes properties. However, this flexibility makes creating an Workbench import more complicated and, as such, requires additional configration. For example, suppose you have a \"Full Title\" field ( field_full_title ) on your Islandora Object content type referencing a paragraph type called \"Complex Title\" ( complex_title ) that contains \"main title\" ( field_main_title ) and \"subtitle\" ( field_subtitle ) text fields. The input CSV would look like: field_full_title My Title: A Subtitle|Alternate Title In this example we have two title values, \"My Title: A Subtitle\" (where \"My Title\" is the main title and \" A Subtitle\" is the subtitle) and \"Alternate Title\" (which only has a main title). To map these CSV values to our paragraph fields, we need to add the following to our configuration file: paragraph_fields: node: field_full_title: type: complex_title field_order: - field_main_title - field_subtitle field_delimiter: ':' subdelimiter: '|' This configuration defines the paragraph field on the node ( field_full_title ) and its child fields ( field_main_title and field_subtitle ), which occur within the paragraph in the order they are named in the field_order property. Within the data in the CSV column, the values corresponding to the order of those fields are separated by the character defined in the field_delimiter property. subdelimiter here is the same as the subdelimiter configuration setting used in non-paragraph multi-valued fields; in this example it overrides the default or globally configured value. We use a colon for the field delimiter in this example as it is often used in titles to denote subtitles. Note that in the above example, the space before \"A\" in the subtitle will be preserved. Whether or not you want a space there in your data will depend on how you display the Full Title field. Warning Note that Workbench assumes all fields within a paragraph are single-valued. When using Workbench to update paragraphs using update_mode: replace , any null values for fields within the paragraph (such as the null subtitle in the second \"Alternate Title\" instance in the example) will null out existing field values. However, considering each paragraph as a whole field value, Workbench behaves the same as for all other fields - update_mode: replace will replace all paragraph entities with the ones in the CSV, but if the CSV does not contain any values for this field then the field will be left as is. Values in the \"field_member_of\" column The field_member_of column can take a node ID, a full URL to a node, or a URL alias. For instance, all of these refer to the same node and can be used in field_member_of : 648 (node ID) https://islandora.traefik.me/node/648 (full URL) https://islandora.traefik.me/mycollection (full URL using an alias) /mycollection (URL alias) If you use any of these types of values other than the bare node ID, Workbench will look up the node ID based on the URL or alias. Values in the \"field_domain_access\" column The Domain Access module, part of the Domain suite of modules, creates a required, multivalued field with the machine name field_domain_access that controls, at the node level, which domains the node shows up in. When populating this field in your Workbench CSV, replace the periods in domain names with _ . For example, if the domains you want to allow a node to show up in are test1.testing.edu and test2.testing.edu , the values in your field_domain_access field look like this: test1_testing_edu test1_testing_edu test1_testing_edu|test2_testing_edu test2_testing_edu","title":"Field data (Drupal and CSV)"},{"location":"fields/#reserved-csv-columns","text":"The following CSV columns are used for specific purposes and in some cases are required in your CSV file, depending on the task you are performing (see below for specific cases). Data in them does not directly populate Drupal content-type fields. CSV field name Task(s) Note id create This CSV field is used by Workbench for internal purposes, and is not added to the metadata of your Islandora objects. Therefore, it doesn't need to have any relationship to the item described in the rest of the fields in the CSV file. You can configure this CSV field name to be something other than id by using the id_field option in your configuration file. Note that if the specified field contains multiple values, (e.g. 0001|spec-86389 ), the entire field value will be used as the internal Workbench identifier. Also note that it is important to use values in this column that are unique across input CSV files if you plan to create parent/child relationships accross Workbench sessions . parent_id create When creating paged or compound content, this column identifies the parent of the item described in the current row. For information on how to use this field, see \" With page/child-level metadata .\" Note that this field can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. node_id update, delete, add_media, export_csv, delete_media_by_node The ID of the node you are updating, deleting, or adding media to. Full URLs (including URL aliases) are also allowed in this CSV field. file create, add_media See detail in \"Values in the 'file' column\", below. media_use_tid create, add_media Tells Workbench which terms from the Islandora Media Use vocabulary to assign to media created in create and add_media tasks. This can be set for all new media in the configuration file; only include it in your CSV if you want row-level control over this value. More detail is available in the \" Configuration \" docs for media_use_tid . url_alias create, update See detail in \" Assigning URL aliases \". image_alt_text create See detail in \" Adding alt text to images \". checksum create See detail in \" Fixity checking \". term_name create_terms See detail in \" Creating taxonomy terms \".","title":"Reserved CSV columns"},{"location":"fields/#values-in-the-file-column","text":"Values in the reserved file CSV field contain the location of files that are used to create Drupal Media. By default, Workbench pushes up to Drupal only one file, and creates only one resulting media per CSV record. However, it is possible to push up multiple files per CSV record (and create all of their corresponding media). File locations in the file field can be relative to the directory named in input_dir , absolute paths, or URLs. Examples of each: relative to directory named in the input_dir configuration setting: myfile.png absolute: /tmp/data/myfile.png . On Windows, you can use values like c:\\users\\mjordan\\files\\myfile.png or \\\\some.windows.file.share.org\\share_name\\files\\myfile.png . URL: http://example.com/files/myfile.png Things to note about file values in general: Relative, absolute, and URL file locations can exist within the same CSV file, or even within the same CSV value. By default, if the file value for a row is empty, Workbench will log the empty value, both in and outside of --check . file values that point to files that don't exist will result in Workbench logging the missing file and then exiting, unless allow_missing_files: true is present in your config file. Adding perform_soft_checks will also tell Workbench to not error out when the value in the file column can't be found. If you want do not want to create media for any of the rows in your CSV file, include nodes_only: true in your configuration file. More detail is available . file values that contain non-ASCII characters are normalized to their ASCII equivalents. See this issue for more information. The Drupal filesystem where files are stored is determined by each media type's file field configuration. It is not possible to override that configuration. Things to note about URLs as file values: Workbench downloads files identified by URLs and saves them in the directory named in temp_dir before processing them further; within this directory, each file is saved in a subdirectory named after the value in the row's id_field field. It does not delete the files from these locations after they have been ingested into Islandora unless the delete_tmp_upload configuration option is set to true . Files identified by URLs must be accessible to the Workbench script, which means they must not require a username/password; however, they can be protected by a firewall, etc. as long as the computer running Workbench is allowed to retrieve the files without authenticating. Currently Workbench requires that the URLs point directly to a file or a service that generates a file, and not a wrapper page or other indirect route to the file.","title":"Values in the \"file\" column"},{"location":"fields/#required-columns","text":"A small number of columns are required in your CSV, depending on the task you are performing: Task Required in CSV Note create id See detail in \"Reserved CSV fields\", above. title The node title. file Empty values in the file field are allowed if allow_missing_files is present in your configuration file, in which case a node will be created but it will have no attached media. update node_id The node ID of an existing node you are updating. delete node_id The node ID of an existing node you are deleting. add_media node_id The node ID of an existing node you are attaching media to. file Must contain a filename, file path, or URL. allow_missing_files only works with the create task. If a required field is missing from your CSV, --check will tell you.","title":"Required columns"},{"location":"fields/#columns-you-want-workbench-to-ignore","text":"In some cases you might want to include columns in your CSV that you want Workbench to ignore. More information on this option is available in the \"Sharing the input CSV with other applications\" section of the Workflows documentation.","title":"Columns you want Workbench to ignore"},{"location":"fields/#csv-fields-that-contain-drupal-field-data","text":"These are of two types of Drupal fields, base fields and content-type specific fields.","title":"CSV fields that contain Drupal field data"},{"location":"fields/#base-fields","text":"Base fields are basic node properties, shared by all content types. The base fields you can include in your CSV file are: title : This field is required for all rows in your CSV for the create task. Optional for the 'update' task. Drupal limits the title's length to 255 characters,unless the Node Title Length contrib module is installed. If that module is installed, you can set the maximum allowed title length using the max_node_title_length configuration setting. langcode : The language of the node. Optional. If included, use one of Drupal's language codes as values (common values are 'en', 'fr', and 'es'; the entire list can be seen here . If absent, Drupal sets the value to the default value for your content type. uid : The Drupal user ID to assign to the node and media created with the node. Optional. Only available in create tasks. If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). created : The timestamp to use in the node's \"created\" attribute and in the \"created\" attribute of the media created with the node. Optional, but if present, it must be in format 2020-11-15T23:49:22+00:00 (the +00:00 is the difference to Greenwich time/GMT). If you are creating paged/compound objects from directories, this value is applied to the parent's children (if you are creating them using the page/child-level metadata method, these fields must be in your CSV metadata). published : Whether or not the node (and all accompanying media) is published. If present in add_media tasks, will override parent node's published value. Values in this field are either 1 (for published) or 0 (for unpublished). The default value for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . promote : Whether or not the node is promoted to the site's front page. 1 (for promoted) or 0 (for not promoted). The default vaue for this field is defined within each Drupal content type's (and media type's) configuration, and may be determined by contrib modules such as Workflow . All base fields other than uid can be included in both create and update tasks.","title":"Base fields"},{"location":"fields/#content-type-specific-fields","text":"These fields correspond directly to fields configured in Drupal nodes, and data you provide in them populates their equivalent field in Drupal entities. The column headings in the CSV file must match machine names of fields that exist in the target node content type. Fields' machine names are visible within the \"Manage fields\" section of each content type's configuration, here circled in red: These field names, plus the fields indicated in the \"Reserved CSV fields\" section above, are the column headers in your CSV file, like this: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Note If content-type field values apply to all of the rows in your CSV file, you can avoid including them in the CSV and instead use \" CSV field templates \".","title":"Content type-specific fields"},{"location":"fields/#using-field-labels-as-csv-column-headers","text":"By default, Workbench requires that column headers in your CSV file use the machine name of Drupal fields. However, in \"create\", \"update\", and \"create_terms\" tasks, you can use the field labels if you include csv_headers: labels in your configuration file. If you do this, you can use CSV file like this: file,id,title,Model,Description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,\"Taken from the ferry from downtown New York to Highlands, NJ. Weather was windy.\" IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. Simon Fraser University is visible on the top of the mountain in the distance. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Some things to note about using field labels in your CSV: if the content type (or vocabulary) that you are populating uses the same label for multiple fields, you won't be able to use labels as your CSV column headers. --check will tell you if there are any duplicate field labels. Spaces in field labels are OK, e.g. Country of Publication . Spelling, capitalization, punctuation, etc. in CSV column headers must match the field labels exactly. If any field labels contain the character you are using as the CSV delimiter (defined in the delimiter config setting), you will need to wrap the column header in quotation marks, e.g. \"Height, length, weight\" .","title":"Using field labels as CSV column headers"},{"location":"fields/#single-and-multi-valued-fields","text":"Drupal allows for fields to have a single value, a specific maximum number of values, or unlimited number of values. In the CSV input file, each Drupal field corresponds to a single CSV field. In other words, the CSV column names must be unique, even if a Drupal field allows multiple values. Populating multivalued fields is explained below.","title":"Single and multi-valued fields"},{"location":"fields/#single-valued-fields","text":"In your CSV file, single-valued fields simply contain the value, which, depending on the field type, can be a string or an integer. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: file,title,id,field_model,field_description,field_rights,field_extent,field_access_terms,field_member_of myfile.jpg,My nice image,obj_00001,24,\"A fine image, yes?\",Do whatever you want with it.,There's only one image.,27,45 In this example, the term ID for the tag you want to assign in field_access_terms is 27, and the node ID of the collection you want to add the object to (in field_member_of ) is 45.","title":"Single-valued fields"},{"location":"fields/#multivalued-fields","text":"For multivalued fields, you separate the values within a field with a pipe ( | ), like this: file,title,field_something IMG_1410.tif,Small boats in Havana Harbour,One subvalue|Another subvalue IMG_2549.jp2,Manhatten Island,first subvalue|second subvalue|third subvalue This works for string fields as well as taxonomy reference fields, e.g.: file,title,field_my_multivalued_taxonomy_field IMG_1410.tif,Small boats in Havana Harbour,35|46 IMG_2549.jp2,Manhatten Island,34|56|28 Drupal strictly enforces the maximum number of values allowed in a field. If the number of values in your CSV file for a field exceed a field's configured maximum number of fields, Workbench will only populate the field to the field's configured limit. The subdelimiter character defaults to a pipe ( | ) but can be set in your config file using the subdelimiter configuration setting. Note Workbench will remove duplicate values in CSV fields. For example, if you accidentally use first subvalue|second subvalue|second subvalue in your CSV, Workbench will filter out the superfluous second subvalue . This applies to both create and update tasks, and within update tasks, replacing values and appending values to existing ones. Workbench deduplicates CVS values silently: it doesn't log the fact that it is doing it.","title":"Multivalued fields"},{"location":"fields/#drupal-field-types","text":"The following types of Drupal fields can be populated from data in your input CSV file: text (plain, plain long, etc.) fields integer fields boolean fields, with values 1 or 0 EDTF date fields entity reference (taxonomy and linked node) fields typed relation (taxonomy) fields link fields geolocation fields Drupal is very strict about not accepting malformed data. Therefore, Islandora Workbench needs to provide data to Drupal that is consistent with field types (string, taxonomy reference, EDTF, etc.) we are populating. This applies not only to Drupal's base fields (as we saw above) but to all fields. A field's type is indicated in the same place as its machine name, within the \"Manage fields\" section of each content type's configuration. The field types are circled in red in the screen shot below: Below are guidelines for preparing CSV data that is compatible with common field types configured in Islandora repositories.","title":"Drupal field types"},{"location":"fields/#text-fields","text":"Generally speaking, any Drupal field where the user enters free text into a node add/edit form is configured to be one of the Drupal \"Text\" field types. Islandora Workbench supports non-Latin characters in CSV, provided the CSV file is encoded as UTF-8. For example, the following non-Latin text will be added as expected to Drupal fields: \u4e00\u4e5d\u4e8c\u56db\u5e74\u516d\u6708\u5341\u4e8c\u65e5 (Traditional Chinese) \u0938\u0930\u0915\u093e\u0930\u0940 \u0926\u0938\u094d\u0924\u093e\u0935\u0947\u095b, \u0905\u0916\u092c\u093e\u0930\u094b\u0902 \u092e\u0947\u0902 \u091b\u092a\u0947 \u0932\u0947\u0916, \u0905\u0915\u093e\u0926\u092e\u093f\u0915 \u0915\u093f\u0924\u093e\u092c\u0947\u0902 (Hindi) \u140a\u1455\u1405\u14ef\u1585 \u14c4\u14c7, \u1405\u14c4\u1585\u1450\u1466 \u14c2\u1432\u1466 (Inuktitut) However, if all of your characters are Latin (basically, the characters found on a standard US keyboard) your CSV file can be encoded as ASCII. Some things to note about Drupal text fields: Some specialized forms of text fields, such as EDTF, enforce or prohibit the presence of specific types of characters (see below for EDTF's requirements). Islandora Workbench populates Drupal text fields verbatim with the content provided in the CSV file, with these exceptions . Plus, if a text value in your CSV is longer than its field's maximum configured length, Workbench will truncate the text (see the next point and warning below). Text fields may be configured to have a maximum length. Running Workbench with --check will produce a warning (both shown to the user and written to the Workbench log) if any of the values in your CSV file are longer than their field's configured maximum length. Warning If the CSV value for text field exceeds its configured maximum length, Workbench truncates the value to the maximum length before populating the Drupal field, leaving a log message indicating that it has done so.","title":"Text fields"},{"location":"fields/#text-fields-with-markup","text":"Drupal text fields that are configured to contain \"formatted\" text (for example, text with line breaks or HTML markup) will have one of the available text formats, such as \"Full HTML\" or \"Basic HTML\", applied to them. Workbench treats these fields these fields the same as if they are populated using the node add/edit form, but you will have to tell Workbench, in your configuration file, which text format to apply to them. When you populate these fields using the node add/edit form, you need to select a text format within the WYSIWYG editor: When populating these fields using Workbench, you can configure which text format to use either 1) for all Drupal \"formatted\" text fields or 2) using a per-field configuration. 1) To configure the text format to use for all \"formatted\" text fields, include the text_format_id setting in your configuration file, indicating the ID of the text format to use, e.g., text_format_id: full_html . The default value for this setting is basic_html . 2) To configure text formats on a per-field basis, include the field_text_format_ids (plural) setting in your configuration file, along with a field machine name-to-format ID mapping, like this: field_text_format_ids: - field_description_long: full_html - field_abstract: restricted_html If you use both settings in your configuration file, field_text_format_ids takes precedence. You only need to configure text formats per field to override the global setting. Note Workbench has no way of knowing what text formats are configured in the target Drupal, and has no way of validating that the text format ID you use in your configuration file exists. However, if you use a text format ID that is invalid, Drupal will not allow nodes to be created or updated and will leave error messages in your Workbench log that contain text like Unprocessable Entity: validation failed.\\nfield_description_long.0.format: The value you selected is not a valid choice. By default, Drupal comes configured with three text formats, full_html , basic_html , and restricted_html . If you create your own text format at admin/config/content/formats , you can use its ID in the Workbench configuration settings described above. If you want to include line breaks in your CSV, they must be physical line breaks. \\n and other escaped line break characters are not recognized by Drupal's \"Convert line breaks into HTML (i.e.
    and

    )\" text filter. CSV data containing physical line breaks must be wrapped in quotation marks, like this: id,file,title,field_model,field_description_long 01,,I am a title,Image,\"Line breaks are awesome.\"","title":"Text fields with markup"},{"location":"fields/#taxonomy-reference-fields","text":"Note In the list of a content type's fields, as pictured above, Drupal uses \"Entity reference\" for all types of entity reference fields, of which Taxonomy references are one. The other most common kind of entity reference field is a node reference field. Islandora Workbench lets you assign both existing and new taxonomy terms to nodes. Creating new terms on demand during node creation reduces the need to prepopulate your vocabularies prior to creating nodes. In CSV columns for taxonomy fields, you can use either term IDs (integers) or term names (strings). You can even mix IDs and names in the same field: file,title,field_my_multivalued_taxonomy_field img001.png,Picture of cats and yarn,Cats|46 img002.png,Picture of dogs and sticks,Dogs|Sticks img003.png,Picture of yarn and needles,\"Yarn, Balls of|Knitting needles\" By default, if you use a term name in your CSV data that doesn't match a term name that exists in the referenced taxonomy, Workbench will detect this when you use --check , warn you, and exit. This strict default is intended to prevent users from accidentally adding unwanted terms through data entry error. Terms can be from any level in a vocabulary's hierarchy. In other words, if you have a vocabulary whose structure looks like this: you can use the terms IDs or labels for \"Automobiles\", \"Sports cars\", or \"Land Rover\" in your CSV. The term name (or ID) is all you need; no indication of the term's place in the hierarchy is required. If you add allow_adding_terms: true to your configuration file for any of the entity \"create\" or \"update\" tasks, Workbench will create the new term the first time it is used in the CSV file following these rules: If multiple records in your CSV contain the same new term name in the same field, the term is only created once. When Workbench checks to see if the term with the new name exists in the target vocabulary, it queries Drupal for the new term name, looking for an exact match against an existing term in the specified vocabulary. Therefore it is important that term names used in your CSV are identical to existing term names. The query to find existing term names follows these two rules: Leading and trailing whitespace on term names is ignored. Internal whitespace is significant. Case is ignored. Note that Drupal does not distinguish between diacritics. For example, it does not distinguish between \"Chylek\" and \"Ch\u00fdlek\". If the term name you provide in the CSV file does not match an existing term name in its vocabulary, the term name from the CSV data is used to create a new term. If it does match, Workbench populates the field in your nodes with a reference to the matching term. allow_adding_terms applies to all vocabularies. In general, you do not want to add new terms to vocabularies used by Islandora for system functions such as Islandora Models and Islandora Media Use. In order to exclude vocabularies from being added to, you can register vocabulary machine names in the protected_vocabularies setting, like this: protected_vocabularies: - islandora_model - islandora_display - islandora_media_use Adding new terms has some constraints: Terms created in this way do not have any external URIs, other fields, or if they are hierarchical. If you want your terms that have any of these features, you will need to either create the terms manually, through a create_terms task, or using a third-party module like Taxonomy Import prior to using their term names in an input CSV. Workbench cannot distinguish between identical term names within the same vocabulary. This means you cannot create two different terms that have the same term name (for example, two terms in the Person vocabulary that are identical but refer to two different people). The workaround for this is to create one of the terms before using Workbench and use the term ID instead of the term string. If the same term name exists multiple times in the same vocabulary (again using the example of two Person terms that describe two different people) you should be aware that when you use these identical term names within the same vocabulary in your CSV, Workbench will always choose the first one it encounters when it converts from term names to term IDs while populating your nodes. The workaround for this is to use the term ID for one (or both) of the identical terms, or to use URIs for one (or both) of the identical terms. --check will identify any new terms that exceed Drupal's maximum allowed length for term names, 255 characters. If a term name is longer than 255 characters, Workbench will truncate it at that length, log that it has done so, and create the term. Taxonomy terms created with new nodes are not removed when you delete the nodes.","title":"Taxonomy reference fields"},{"location":"fields/#using-term-names-in-multi-vocabulary-fields","text":"While most node taxonomy fields reference only a single vocabulary, Drupal does allow fields to reference multiple vocabularies. This ability poses a problem when we use term names instead of term IDs in our CSV files: in a multi-vocabulary field, Workbench can't be sure which term name belongs in which of the multiple vocabularies referenced by that field. This applies to both existing terms and to new terms we want to add when creating node content. To avoid this problem, we need to tell Workbench which of the multiple vocabularies each term name should (or does) belong to. We do this by namespacing terms with the applicable vocabulary ID. For example, let's imagine we have a node field whose name is field_sample_tags , and this field references two vocabularies, \"cats\" and \"dogs\". To use the terms Tuxedo , Tabby , German Shepherd in the CSV when adding new nodes, we need to namespace them with vocabulary IDs like this: field_sample_tags cats:Tabby cats:Tuxedo dogs:German Shepherd If you want to use multiple terms in a single field, you would namespace them all: cats:Tuxedo|cats:Misbehaving|dogs:German Shepherd To find the vocabulary ID (referred to above as the \"namespace\") to use, visit the list of your site's vocabularies at admin/structure/taxonomy : Hover your pointer over the \"List terms\" button for each vocabulary to reveal the URL to its overview page. The ID for the vocabulary is the string between \"manage\" and \"overview\" in the URL. For example, in the URL admin/structure/taxonomy/manage/person/overview , the vocabulary ID is \"person\". This is the namespace you need to use to indicate which vocabulary to add new terms to. CSV values containing term names that have commas ( , ) in multi-valued, multi-vocabulary fields need to be wrapped in quotation marks (like any CSV value containing a comma), and in addition, the need to specify the namespace within each of the subvalues: \"tags:gum, Bubble|tags:candy, Hard\" Using these conventions, Workbench will be certain which vocabulary the term names belong to. Workbench will remind you during its --check operation that you need to namespace terms. It determines 1) if the field references multiple vocabularies, and then checks to see 2) if the field's values in the CSV are term IDs or term names. If you use term names in multi-vocabulary fields, and the term names aren't namespaced, Workbench will warn you: Error: Term names in multi-vocabulary CSV field \"field_tags\" require a vocabulary namespace; value \"Dogs\" in row 4 does not have one. Note that since : is a special character when you use term names in multi-vocabulary CSV fields, you can't add a namespaced term that itself contains a : . You need to add it manually to Drupal and then use its term ID (or name, or URI) in your CSV file.","title":"Using term names in multi-vocabulary fields"},{"location":"fields/#using-term-uris-instead-of-term-ids","text":"Islandora Workbench lets you use URIs assigned to terms instead of term IDs. You can use a term URI in the media_use_tid configuration option (for example, \"http://pcdm.org/use#OriginalFile\" ) and in taxonomy fields in your metadata CSV file: field_model https://schema.org/DigitalDocument http://purl.org/coar/resource_type/c_18cc During --check , Workbench will validate that URIs correspond to existing taxonomy terms. Using term URIs has some constraints: You cannot create a new term by providing a URI like you can by providing a term name. If the same URI is registered with more than one term, Workbench will choose one and write a warning to the log indicating which term it chose and which terms the URI is registered with. However, --check will detect that a URI is registered with more than one term and warn you. At that point you can edit your CSV file to use the correct term ID rather than the URI.","title":"Using term URIs instead of term IDs"},{"location":"fields/#using-numbers-as-term-names","text":"If you want to use a term name like \"1990\" in your CSV, you need to tell Workbench to not interpret that term name as a term ID. To do this, add a list of CSV columns to your config file using the columns_with_term_names config setting that will only contain term names (and not IDs or URIs, as explained next): columns_with_term_names: - field_subject - field_tags If you register a column name here, it can contain only terms names. Any term ID or URIs will be interpreted as term names. Note that this is only necessary if your term names are comprised entirely of integers. If they contain decimals (like \"1990.3\"), Workbench will not interpret them as term IDs and you will not need to tell Workbench to do otherwise.","title":"Using numbers as term names"},{"location":"fields/#entity-reference-views-fields","text":"Islandora Workbench fully supports taxonomy reference fields that use the \"Default\" reference type, but only partially supports \"Views: Filter by an entity reference View\" taxonomy reference fields. To populate this type of entity reference in \"create\" and \"update\" tasks, you have two options. Warning Regardless of whether you use term IDs or term names in your CSV, Workbench will not validate values in \"Views: Filter by an entity reference View\" taxonomy reference fields. Term IDs or term names that are not in the referenced View will result in the node not being created or updated (Drupal will return a 422 response). However, if allow_adding_terms is set to true , terms that are not in the referenced vocabulary will be added to the vocabulary if your CSV data contains a vocabulary ID/namespace in the form vocabid:newterm . The terms will be added regardless of whether they are within the referenced View. Therefore, for this type of Drupal field, you should not include vocabulary IDs/namespaces in your CSV data for that field. Further work on supporting this type of field is being tracked in this Github issue . First, if your input CSV contains only term IDs in this type of column, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Alternatively, if you prefer to use term names instead of term IDs in CSV column for this type of field, you will need to create a special Display for the View that is refererenced from that field. To do this: In the View that is referenced, duplicate the view display as a REST Export display. When making the changes to the resulting REST Export display, be careful to modify that display only (and not \"All displays\") by alaways choosing \"This rest_export (override)\" during every change. Format: Serializer Settings: json Path: some_path (do not include the leading / ) Authentication: Basic Auth Access restrictions: Role > View published content (the default; \"administrator vocabularies and terms\" is needed for other endpoints used by Workbench, but this view doesn't require this.) Filter Criteria: Add \"Taxonomy term / name\" from the list of fields Expose this filter Choose the \"Is equal to\" operator Leave the Value field empty In the Field identifier field, enter \"name\" Your configuration for the new \"Taxonomy term: name\" filter should look like this: Then in your Workbench configuration file, using the entity_reference_view_endpoints setting, provide a mapping between columns in your CSV file and the applicable Views REST Export display path value (configured in step 3 above). In this example, we define three field/path mappings: entity_reference_view_endpoints: - field_linked_agent: /taxonomy/linked_agents - field_language: /language/lookup - field_subject: /taxonomy/subjects During \"create\" and \"update\" tasks, --check will tell you if the View REST Export display path is accesible.","title":"Entity Reference Views fields"},{"location":"fields/#typed-relation-fields","text":"Typed relation fields contain information about the relationship (or \"relation\") between a taxonomy term and the node it is attached to. For example, a term from the Person vocabulary, \"Jordan, Mark\", can be an author, illustrator, or editor of the book described in the node. In this example, \"author\", \"illustrator\", and \"editor\" are the typed relations. Note Although Islandora supports Typed Relation fields that allow adding relations to other nodes, currently Workbench only supports adding relations to taxonomies. If you need support for adding Typed Relations to other entities, please leave a comment on this issue . The Controlled Access Terms module allows the relations to be sets of terms from external authority lists (for example like the MARC Relators list maintained by the Library of Congress). Within a Typed Relation field's configuration, the configured relations look like this: In this screenshot, \"relators\" is a namespace for the MARC Relators authority list, the codes \"acp\", \"adi\", etc are the codes for each relator, and \"Art copyist\", \"Art director\", etc. are the human-readable label for each relator. Within the edit form of a node that has a Typed Relation field, the user interface adds a select list of the relation (the target taxonomy term here is \"Jordan, Mark (30))\", like this: To be able to populate Typed Relation fields using CSV data with the three pieces of required data (authority list, relation type, target term), Islandora Workbench supports CSV values that contain the corresponding namespace, relator code, and taxonomy term ID, each separated by a colon ( : ), like this: relators:art:30 In this example CSV value, relators is the namespace that the relation type art is from (the Library of Congress Relators vocabulary), and the target taxonomy term ID is 30 . Note Note that the structure required for typed relation values in the CSV file is not the same as the structure of the relations configuration depicted in the screenshot of the \"Available Relations\" list above. A second option for populating Typed Relation fields is to use taxonomy term names (as opposed to term IDs) as targets: \"relators:art:Jordan, Mark\" Warning In the next few paragraphs, the word \"namespace\" is used to describe two different kinds of namespaces - first, a vocabulary ID in the local Drupal and second, an ID for the external authority list of relators, for example by the Library of Congress. As we saw in the \"Using term names in multi-vocabulary fields\" section above, if the field that we are populating references multiple vocabularies, we need to tell Drupal which vocabulary we are referring to with a local vocabulary namespace. To add a local vocabulary namespace to Typed Relation field CSV structure, we prepend it to the term name, like this (note the addition of \"person\"): \"relators:art:person:Jordan, Mark\" (In this example, relators is the external authority lists's namespace, and person is the local Drupal vocabulary namespace, prepended to the taxonomy term name, \"Jordan, Mark\".) If this seems confusing and abstruse, don't worry. Running --check will tell you that you need to add the Drupal vocabulary namespace to values in specific CSV columns. The final option for populating Typed Relation field is to use HTTP URIs as typed relation targets: relators:art:http://markjordan.net If you want to include multiple typed relation values in a single field of your CSV file (such as in \"field_linked_agent\"), separate the three-part values with the same subdelimiter character you use in other fields, e.g. ( | ) (or whatever you have configured as your subdelimiter ): relators:art:30|relators:art:45 or \"relators:art:person:Jordan, Mark|relators:art:45\"","title":"Typed Relation fields"},{"location":"fields/#adding-new-typed-relation-targets","text":"Islandora Workbench allows you to add new typed relation targets while creating and updating nodes. These targets are taxonomy terms. Your configuration file must include the allow_adding_terms: true option to add new targets. In general, adding new typed relation targets is just like adding new taxonomy terms as described above in the \"Taxonomy relation fields\" section. An example of a CSV value that adds a new target term is: \"relators:art:person:Jordan, Mark\" You can also add multiple new targets: \"relators:art:person:Annez, Melissa|relators:art:person:Jordan, Mark\" Note that: For multi-vocabulary fields, new typed relator targets must be accompanied by a vocabulary namespace ( person in the above examples). You cannot add new relators (e.g. relators:foo ) in your CSV file, only new target terms. Note Adding the typed relation namespace, relators, and vocabulary names is a major hassle. If this information is the same for all values (in all rows) in your field_linked_agent column (or any other typed relation field), you can use CSV value templates to reduce the tedium.","title":"Adding new typed relation targets"},{"location":"fields/#edtf-fields","text":"Running Islandora Workbench with --check will validate Extended Date/Time Format (EDTF) Specification dates (Levels 0, 1, and 2) in EDTF fields. Some common examples include: Type Examples Date 1976-04-23 1976-04 Qualified date 1976? 1976-04~ 1976-04-24% Date and time 1985-04-12T23:20:30 Interval 1964/2008 2004-06/2006-08 2004-06-04/2006-08-01 2004-06/2006-08-01 Set [1667,1668,1670..1672] [1672..1682] [1672,1673] [..1672] [1672..] Subvalues in multivalued CSV fields are validated separately, e.g. if your CSV value is 2004-06/2006-08|2007-01/2007-04 , 2004-06/2006-08 and 2007-01/2007-04 are validated separately. Note EDTF supports a very wide range of specific and general dates, and in some cases, valid dates can look counterintuitive. For example, \"2001-34\" is valid (it's Sub-Year Grouping meaning 2nd quarter of 2001).","title":"EDTF fields"},{"location":"fields/#link-fields","text":"The link field type stores URLs (e.g. https://acme.com ) and link text in separate data elements. To add or update fields of this type, Workbench needs to provide the URL and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the URL and link text pairs in CSV values with double percent signs ( %% ), like this: field_related_websites http://acme.com%%Acme Products Inc. You can include multiple pairs of URL/link text pairs in one CSV field if you separate them with the subdelimiter character: field_related_websites http://acme.com%%Acme Products Inc.|http://diy-first-aid.net%%DIY First Aid The URL is required, but the link text is not. If you don't have or want any link text, omit it and the double quotation marks: field_related_websites http://acme.com field_related_websites http://acme.com|http://diy-first-aid.net%%DIY First Aid","title":"Link fields"},{"location":"fields/#authority-link-fields","text":"The authority link field type stores abbreviations for authority sources (i.e., external controlled vocabularies such as national name authorities), authority URIs (e.g. http://viaf.org/viaf/153525475 ) and link text in separate data elements. Authority link fields are most commonly used on taxonomy terms, but can be used on nodes as well. To add or update fields of this type, Workbench needs to provide the authority source abbreviation, URI and link text in the structure Drupal expects. To accomplish this within a single CSV field, we separate the three parts in CSV values with double percent signs ( %% ), like this: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record You can include multiple triplets of source abbreviation/URL/link text in one CSV field if you separate them with the subdelimiter character: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807%%VIAF Record|other%%https://github.com/mjordan%%Github The authority source abbreviation and the URI are required, but the link text is not. If you don't have or want any link text, omit it: field_authority_vocabs viaf%%http://viaf.org/viaf/10646807 field_authority_vocabs viaf%%http://viaf.org/viaf/10646807|other%%https://github.com/mjordan%%Github","title":"Authority link fields"},{"location":"fields/#geolocation-fields","text":"The Geolocation field type, managed by the Geolocation Field contrib module, stores latitude and longitude coordinates in separate data elements. To add or update fields of this type, Workbench needs to provide the latitude and longitude data in these separate elements. To simplify entering geocoordinates in the CSV file, Workbench allows geocoordinates to be in lat,long format, i.e., the latitude coordinate followed by a comma followed by the longitude coordinate. When Workbench reads your CSV file, it will split data on the comma into the required lat and long parts. An example of a single geocoordinate in a field would be: field_coordinates \"49.16667,-123.93333\" You can include multiple pairs of geocoordinates in one CSV field if you separate them with the subdelimiter character: field_coordinates \"49.16667,-123.93333|49.25,-124.8\" Note that: Geocoordinate values in your CSV need to be wrapped in double quotation marks, unless the delimiter key in your configuration file is set to something other than a comma. If you are entering geocoordinates into a spreadsheet, you may need to escape leading + and - signs since they will make the spreadsheet application think you are entering a formula. You can work around this by escaping the + an - with a backslash ( \\ ), e.g., 49.16667,-123.93333 should be \\+49.16667,-123.93333 , and 49.16667,-123.93333|49.25,-124.8 should be \\+49.16667,-123.93333|\\+49.25,-124.8 . Workbench will strip the leading \\ before it populates the Drupal fields. Excel: leading + and - need to be escaped Google Sheets: only + needs to be escaped LibreOffice Calc: neither + nor - needs to be escaped","title":"Geolocation fields"},{"location":"fields/#paragraphs-entity-reference-revisions-fields","text":"Entity Reference Revisions fields are similar to Drupal's core Entity Reference fields used for taxonomy terms but are intended for entities that are not intended for reference outside the context of the item that references them. For Islandora sites, this is used for Paragraph entities. In order to populate paragraph entites using Workbench in create and update tasks, you need to enable and configure the REST endpoints for paragraphs. To do this, ensure the REST UI module is enabled, then go to Configuration/Web Services/REST Resources ( /admin/config/services/rest ) and enable \"Paragraph\". Then edit the settings for Paragraph to the following: Granularity = Method GET formats: jsonld, json authentication: jwt_auth, basic_auth, cookie POST formats: json authentication: jwt_auth, basic_auth, cookie DELETE formats: json authentication: jwt_auth, basic_auth, cookie PATCH formats: json authentication: jwt_auth, basic_auth, cookie Note Pargraphs are locked down from REST updates by default. To add new and update paragraph values you must enable the paragraphs_type_permissions submodule and ensure the Drupal user in your configuration file has sufficient privledges granted at /admin/people/permissions/module/paragraphs_type_permissions . Paragraphs are handy for Islandora as they provide flexibility in the creation of more complex metadata (such as complex titles, typed notes, or typed identifiers) by adding Drupal fields to paragraph entities, unlike the Typed Relationship field which hard-codes properties. However, this flexibility makes creating an Workbench import more complicated and, as such, requires additional configration. For example, suppose you have a \"Full Title\" field ( field_full_title ) on your Islandora Object content type referencing a paragraph type called \"Complex Title\" ( complex_title ) that contains \"main title\" ( field_main_title ) and \"subtitle\" ( field_subtitle ) text fields. The input CSV would look like: field_full_title My Title: A Subtitle|Alternate Title In this example we have two title values, \"My Title: A Subtitle\" (where \"My Title\" is the main title and \" A Subtitle\" is the subtitle) and \"Alternate Title\" (which only has a main title). To map these CSV values to our paragraph fields, we need to add the following to our configuration file: paragraph_fields: node: field_full_title: type: complex_title field_order: - field_main_title - field_subtitle field_delimiter: ':' subdelimiter: '|' This configuration defines the paragraph field on the node ( field_full_title ) and its child fields ( field_main_title and field_subtitle ), which occur within the paragraph in the order they are named in the field_order property. Within the data in the CSV column, the values corresponding to the order of those fields are separated by the character defined in the field_delimiter property. subdelimiter here is the same as the subdelimiter configuration setting used in non-paragraph multi-valued fields; in this example it overrides the default or globally configured value. We use a colon for the field delimiter in this example as it is often used in titles to denote subtitles. Note that in the above example, the space before \"A\" in the subtitle will be preserved. Whether or not you want a space there in your data will depend on how you display the Full Title field. Warning Note that Workbench assumes all fields within a paragraph are single-valued. When using Workbench to update paragraphs using update_mode: replace , any null values for fields within the paragraph (such as the null subtitle in the second \"Alternate Title\" instance in the example) will null out existing field values. However, considering each paragraph as a whole field value, Workbench behaves the same as for all other fields - update_mode: replace will replace all paragraph entities with the ones in the CSV, but if the CSV does not contain any values for this field then the field will be left as is.","title":"Paragraphs (Entity Reference Revisions fields)"},{"location":"fields/#values-in-the-field_member_of-column","text":"The field_member_of column can take a node ID, a full URL to a node, or a URL alias. For instance, all of these refer to the same node and can be used in field_member_of : 648 (node ID) https://islandora.traefik.me/node/648 (full URL) https://islandora.traefik.me/mycollection (full URL using an alias) /mycollection (URL alias) If you use any of these types of values other than the bare node ID, Workbench will look up the node ID based on the URL or alias.","title":"Values in the \"field_member_of\" column"},{"location":"fields/#values-in-the-field_domain_access-column","text":"The Domain Access module, part of the Domain suite of modules, creates a required, multivalued field with the machine name field_domain_access that controls, at the node level, which domains the node shows up in. When populating this field in your Workbench CSV, replace the periods in domain names with _ . For example, if the domains you want to allow a node to show up in are test1.testing.edu and test2.testing.edu , the values in your field_domain_access field look like this: test1_testing_edu test1_testing_edu test1_testing_edu|test2_testing_edu test2_testing_edu","title":"Values in the \"field_domain_access\" column"},{"location":"fixity/","text":"Transmission fixity Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. \"hash\") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted. This functionality is available within create and add_media tasks. Only files named in the file CSV column are checked. To enable this feature, include the fixity_algorithm option in your create or add_media configuration file, specifying one of \"md5\", \"sha1\", or \"sha256\" hash algorithms. For example, to use the \"md5\" algorithm, include the following in your config file: fixity_algorithm: md5 Comparing checksums to known values Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check phase, as described below. If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum column in your create or add_media CSV input file. No further configuration other than indicating the fixity_algorithm as described above is necessary. If the checksum column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm configuration setting. Validating checksums during --check If you have pregenerated checksum values for your files (as described in the \"Comparing checksums to known values\" section, above), you can tell Workbench to compare those checksums with checksums during its --check phase. To do this, include the following options in your create or add_media configuration file: fixity_algorithm: md5 validate_fixity_during_check: true You must also include both a file and checksum column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm setting. Results of the checks are written to the log file. Some things to note: Fixity checking is currently only available to files named in the file CSV column, and not in any \" additional files \" columns. For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms. If you are including pregenerated checksum values in your CSV file (in the checksum column), the checksums must have been generated using the same has algorithm indicated in your fixity_algorithm configuration setting: \"md5\", \"sha1\", or \"sha256\". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail. Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches. If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the --check phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal. Validation during --check happens entirely on the computer running Workbench. During --check , Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point. Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.","title":"Fixity checking"},{"location":"fixity/#transmission-fixity","text":"Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. \"hash\") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted. This functionality is available within create and add_media tasks. Only files named in the file CSV column are checked. To enable this feature, include the fixity_algorithm option in your create or add_media configuration file, specifying one of \"md5\", \"sha1\", or \"sha256\" hash algorithms. For example, to use the \"md5\" algorithm, include the following in your config file: fixity_algorithm: md5","title":"Transmission fixity"},{"location":"fixity/#comparing-checksums-to-known-values","text":"Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check phase, as described below. If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum column in your create or add_media CSV input file. No further configuration other than indicating the fixity_algorithm as described above is necessary. If the checksum column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm configuration setting.","title":"Comparing checksums to known values"},{"location":"fixity/#validating-checksums-during-check","text":"If you have pregenerated checksum values for your files (as described in the \"Comparing checksums to known values\" section, above), you can tell Workbench to compare those checksums with checksums during its --check phase. To do this, include the following options in your create or add_media configuration file: fixity_algorithm: md5 validate_fixity_during_check: true You must also include both a file and checksum column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm setting. Results of the checks are written to the log file. Some things to note: Fixity checking is currently only available to files named in the file CSV column, and not in any \" additional files \" columns. For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms. If you are including pregenerated checksum values in your CSV file (in the checksum column), the checksums must have been generated using the same has algorithm indicated in your fixity_algorithm configuration setting: \"md5\", \"sha1\", or \"sha256\". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail. Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches. If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the --check phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal. Validation during --check happens entirely on the computer running Workbench. During --check , Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point. Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.","title":"Validating checksums during --check"},{"location":"generating_csv_files/","text":"Generating CSV Files Islandora Workbench can generate several different CSV files you might find useful. CSV file templates Note This section describes creating CSV file templates. For information on CSV field templates, see the \" Using CSV field templates \" section. You can generate a template CSV file by running Workbench with the --get_csv_template argument: ./workbench --config config.yml --get_csv_template With this option, Workbench will fetch the field definitions for the content type named in your configuration's content_type option and save a CSV file with a column for each of the content type's fields. You can then populate this template with values you want to use in a create or update task. The template file is saved in the directory indicated in your configuration's input_dir option, using the filename defined in input_csv with .csv_file_template appended. The template also contains three additional rows: human-readable label whether or not the field is required in your CSV for create tasks sample data number of values allowed (either a specific maximum number or 'unlimited') the name of the section in the documentation covering the field type Here is a screenshot of this CSV file template loaded into a spreadsheet application: Note that the first column, and all the rows other than the field machine names, should be deleted before you use a populated version of this CSV file in a create task. Also, you can remove any columns you do not intend on populating during the create task: CSV file containing a row for every newly created node In some situations, you may want to create stub nodes that only have a small subset of fields, and then populate the remaining fields later. To facilitate this type of workflow, Workbench provides an option to generate a simple CSV file containing a record for every node created during a create task. This file can then be used later in update tasks to add additional metadata or in add_media tasks to add media. You tell Workbench to generate this file by including the optional output_csv setting in your create task configuration file. If this setting is present, Workbench will write a CSV file at the specified location containing one record per node created. This CSV file contains the following fields: id (or whatever column is specified in your id_field setting): the value in your input CSV file's ID field node_id : the node ID for the newly created node uuid : the new node's UUID status : true if the node is published, False if it is unpublished title : the node's title The file will also contain empty columns corresponding to all of the fields in the target content type. An example, generated from a 2-record input CSV file, looks like this (only left-most part of the spreadsheet shown): This CSV file is suitable as a template for subsequent update tasks, since it already contains the node_id s for all the stub nodes plus column headers for all of the fields in those nodes. You can remove from the template any columns you do not want to include in your update task. You can also use the node IDs in this file as the basis for later add_media tasks; all you will need to do is delete the other columns and add a file column containing the new nodes' corresponding filenames. If you want to include in your output CSV all of the fields (and their values) from the input CSV, add output_csv_include_input_csv: true to your configuration file. This option is useful if you want a CSV that contains the node ID and a field such as field_identifier or other fields that contain local identifiers, DOIs, file paths, etc. If you use this option, all the fields from the input CSV are added to the output CSV; you cannot configure which fields are included. CSV file containing field data for existing nodes The export_csv task generates a CSV file that contains one row for each node identified in the input CSV file. The cells of the CSV are populated with data that is consistent with the structures that Workbench uses in update tasks. Using this CSV file, you can: see in one place all of the field values for nodes, which might be useful during quality assurance after a create task modify the data and use it as input for an update task using the update_mode: replace configuration option. The CSV file contains two of the extra rows included in the CSV file template, described above (specifically, the human-readable field label and number of values allowed), and the left-most \"REMOVE THIS COLUMN (KEEP THIS ROW)\" column. To use the file as input for an update task, simply delete the extraneous column and rows. A sample configuration file for an export_csv task is: task: export_csv host: \"http://localhost:8000\" username: admin password: islandora input_csv: nodes_to_export.csv export_csv_term_mode: name content_type: my_custom_content_type # If export_csv_field_list is not present, all fields will be exported. export_csv_field_list: ['title', 'field_description'] # Specifying the output path is optional; see below for more information. export_csv_file_path: output.csv The file identified by input_file has only one column, \"node_id\": node_id 7653 7732 7653 Some things to note: The output includes data from nodes only, not media. Unless a file path is specified in the export_csv_file_path configuration option, the output CSV file name is the name of the input CSV file (containing node IDs) with \".csv_file_with_field_values\" appended. For example, if you export_csv configuration file defines the input_csv as \"my_export_nodes.csv\", the CSV file created by the task will be named \"my_export_nodes.csv.csv_file_with_field_values\". The file is saved in the directory identified by the input_dir configuration option. You can include either vocabulary term IDs or term names (with accompanying vocabulary namespaces) in the CSV. By default, term IDs are included; to include term names instead, include export_csv_term_mode: name in you configuration file. A single export_csv job can only export nodes that have the content type identified in your Workbench configuration. By default, this is \"islandora_object\". If you include node IDs in your input file for nodes that have a different content type, Workbench will skip exporting their data and log the fact that it has done so. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Warning Using the export_csv_term_mode: name option will slow down the export, since Workbench must query Drupal to get the name of each term. The more taxonomy or typed relation fields in your content type, the slower the export will be with export_csv_term_mode set to \"name\". Using a Drupal View to identify content to export as CSV You can use a new or existing View to tell Workbench what nodes to export into CSV. This is done using a get_data_from_view task. A sample configuration file looks like this: task: get_data_from_view host: \"http://localhost:8000/\" view_path: '/daily_nodes_created_test' username: admin password: islandora content_type: pubished_work export_csv_file_path: /tmp/islandora_export.csv # If export_csv_field_list is not present, all fields will be exported. # node_id and title are always included. export_csv_field_list: ['field_description', 'field_extent'] # 'view_paramters' is optinal, and used only if your View uses Contextual Filters. # In this setting you identify any URL parameters configured as Contextual Filters # for the View. Note that values in the 'view_parameters' configuration setting # are literal parameter=value strings that include the =, not YAML key: value # pairs used elsewhere in the Workbench configuration file. view_parameters: - 'date=20231202' The view_path setting should contain the value of the \"Path\" option in the Views configuration page's \"Path settings\" section. The export_csv_file_path is the location where you want your CSV file saved. In the View configuration page: Add a \"REST export\" display. Under \"Format\" > \"Serializer\" > \"Settings\", choose \"json\". In the View \"Fields\" settings, leave \"The selected style or row format does not use fields\" as is (see explanation below). Under \"Path\", add a path where your REST export will be accessible to Workbench. As noted above, this value is also what you should use in the view_path setting in your Workbench configuration file. Under \"Pager\" > \"Items to display\", choose \"Paged output, mini pager\". In \"Pager options\" choose 10 items to display. Under \"Path settings\" > \"Access\", choose \"Permission\" and \"View published content\". Under \"Authentication\", choose \"basic_auth\" and \"cookie\". Here is a screenshot illustrating these settings: To test your REST export, in your browser, join your Drupal hostname and the \"Path\" defined in your View configuration. Using the values in the configuration file above, that would be http://localhost:8000/workbench-export-test . You should see raw JSON (or formatted JSON if your browser renders JSON to be human readable) that lists the nodes in your View. Warning If your View includes nodes that you do not want to be seen by anonymous users, or if it contains unpublished nodes, adjust the access permissions settings appropriately, and ensure that the user identified in your Workbench configuration file has sufficient permissions. You can optionally configure your View to use a single Contextual Filters, and expose that Contextual Filter to use one or more query parameters. This way, you can include each query parameter's name and its value in your configuration file using Workbench's view_parameters config setting, as illustrated in the sample configuration file above. The configuration in the View's Contextual Filters for this type of parameter looks like this: By adding a Contextual Filterf to your View display, you can control what nodes end up in the output CSV by including the value you want to filter on in your Workbench configuration's view_parameters setting. In the screenshot of the \"Created date\" Contextual Filter shown here, the query parameter is date , so you include that parameter in your view_parameters list in your configuration file along with the value you want to assign to the parameter (separated by an = sign), e.g.: view_parameters: - 'date=20231202' will set the value of the date query parameter in the \"Created date\" Contextual Filter to \"20231202\". Some things to note: Note that the values you include in view_parameters apply only to your View's Contextual Filter. Any \"Filter Criteria\" you include in the main part of your View configuration also take effect. In other words, both \"Filter Criteria\" and \"Contextual Filters\" determine what nodes end up in your output CSV file. You can only include a single Contextual Filter in your View, but it can have multiple query parameters. REST export Views displays don't use fields in the same way that other Views displays do. In fact, Drupal says within the Views user interface that for REST export displays, \"The selected style or row format does not use fields.\" Instead, these displays export the entire node in JSON format. Workbench iterates through all fields on the node JSON that start with field_ and includes those fields, plus node_id and title , in the output CSV. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Only content from nodes that have the content type identified in the content_type configuration setting will be written to the CSV file. If you want to export term names instead of term IDs, include export_csv_term_mode: name in your configuration file. The warning about this option slowing down the export applies to this task and the export_csv task. Using a Drupal View to generate a media report as CSV You can get a report of which media a set of nodes has using a View. This report is generated using a get_media_report_from_view task, and the View configuration it uses is the same as the View configuration described above (in fact, you can use the same View with both get_data_from_view and get_media_report_from_view tasks). A sample Workbench configuration file looks like: task: get_media_report_from_view host: \"http://localhost:8000/\" view_path: daily_nodes_created username: admin password: islandora export_csv_file_path: /tmp/media_report.csv # view_paramters is optinal, and used only if your View uses Contextual Filters. view_parameters: - 'date=20231201' The output contains colums for Node ID, Title, Content Type, Islandora Model, and Media. For each node in the View, the Media column contains the media use terms for all media attached to the node separated by semicolons, and sorted alphabetically: Exporting image, video, etc. files along with CSV data In export_csv and get_data_from_view tasks, you can optionally export media files. To do this, add the following settings to your configuration file: export_file_directory : Required. This is the path to the directory where Workbench will save the exported files. export_file_media_use_term_id : Optional. This setting tells Workbench which Islandora Media Use term to use to identify the file to export. Defaults to http://pcdm.org/use#OriginalFile (for Original File). Can be either a term ID or a term URI. Note that currently only a single file per node can be exported, and that files need to be accessible to the anonymous Drupal user to be exported. Using the CSV ID to node ID map By default, Workbench maintains a simple database that maps values in your CSV's ID column (or whatever column you define in the id_field config setting) to node IDs created in create tasks. Note You do not need to install anything extra for Workbench to create this database. Workbench provides a utility script, manage_csv_to_node_id_map.py (described below), for exporting and pruning the data. You only need to install the sqlite3 client or a third-party utility if you want to access the database in ways that surpass the capabilities of the manage_csv_to_node_id_map.py script. A useful third-party tool for viewing and modifying SQLite databases is DB Browser for SQLite . Here is a sample screenshot illustrating the CSV to node ID map database table in DB Browser for SQLite (CSV ID and node ID are the two right-most columns): Workbench optionally uses this database to determine the node ID of parent nodes when creating paged and compound content, so, for example, you can use parent_id values in your input CSV that refer to parents created in earlier Workbench sessions. But, you may find other uses for this data. Since it is stored in an SQLite database, it can be queried using SQL, or can be dumped using into a CSV file using the dump_id_map.py script provided in Workbench's scripts directory. Note In create_from_files tasks, which don't use an input CSV file, the filename is recorded instead of an \"id\". One configuration setting applies to this feature, csv_id_to_node_id_map_path . By default, its value is [your temporary directory]/csv_id_to_node_id_map.db (see the temp_dir config setting's documentation for more information on where that directory is). This default can be overridden in your config file. If you want to disable population of this database completely, set csv_id_to_node_id_map_path to false . Warning Some systems clear out their temporary directories on restart. You may want to define the absolute path to your ID map database in the csv_id_to_node_id_map_path configuration setting so it is stored in a location that will not get deleted on system restart. The SQLite database contains one table, \"csv_id_to_node_id_map\". This table has five columns: timestamp : the current timestamp in yyyy-mm-dd hh:mm:ss format or a truncated version of that format config_file : the name of the Workbench configuration file active when the row was added parent_csv_id : if the node was created along with its parent, the parent's CSV ID parent_node_id : if the node was create along with its parent, the parent's node ID csv_id : the value in the node's CSV ID field (or in create_from_files tasks, which don't use an input CSV file, the filename) node_id : the node's Drupal node ID If you don't want to query the database directly, you can use scripts/manage_csv_to_node_id_map.py to: Export the contents of the entire database to a CSV file. To do this, in the Workbench directory, run the script, specifying the path to the database file and the path to the CSV output: python scripts/manage_csv_to_node_id_map.py --db_path /tmp/csv_id_to_node_id_map.db --csv_path /tmp/output.csv Export the rows that have duplicate (i.e., identical) CSV ID values, or duplicate values in any specific field. To limit the rows that are dumped to those with duplicate values in a specific database field, add the --nonunique argument and the name of the field, e.g., --nonunique csv_id . The resulting CSV will only contain those entries from your database. Delete entries from the database that have a specific value in their config_file column. To do this, provide the script with the --remove_entries_with_config_files argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_with_config_files create.yml . You can also specify a comma-separated list of config file names (for example --remove_entries_with_config_files create.yml,create_testing.yml ) to remove all entries with those config file names with one command. Delete entries from the database create before a specific date. To do this, provide the script with the --remove_entries_before argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_before \"2023-05-29 19:17\" . Delete entries from the database created after a specific date. To do this, provide the script with the --remove_entries_after argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_after \"2023-05-29 19:17\" . The value of the --remove_entries_before and --remove_entries_after arguments is a date string that can take the form yyyy-mm-dd hh:mm:ss or any truncated version of that format, e.g. yyyy-mm-dd hh:mm , yyyy-mm-dd hh , or yyyy-mm-dd . Any rows in the database table that have a timestamp value that matches the date value will be deleted from the database. Note that if your timestamp value has a space in it, you need to wrap it quotation marks as illustrated above; if you don't, the script will delete all the entries on the timestamp value before the space, in other words, that day.","title":"Generating CSV files"},{"location":"generating_csv_files/#generating-csv-files","text":"Islandora Workbench can generate several different CSV files you might find useful.","title":"Generating CSV Files"},{"location":"generating_csv_files/#csv-file-templates","text":"Note This section describes creating CSV file templates. For information on CSV field templates, see the \" Using CSV field templates \" section. You can generate a template CSV file by running Workbench with the --get_csv_template argument: ./workbench --config config.yml --get_csv_template With this option, Workbench will fetch the field definitions for the content type named in your configuration's content_type option and save a CSV file with a column for each of the content type's fields. You can then populate this template with values you want to use in a create or update task. The template file is saved in the directory indicated in your configuration's input_dir option, using the filename defined in input_csv with .csv_file_template appended. The template also contains three additional rows: human-readable label whether or not the field is required in your CSV for create tasks sample data number of values allowed (either a specific maximum number or 'unlimited') the name of the section in the documentation covering the field type Here is a screenshot of this CSV file template loaded into a spreadsheet application: Note that the first column, and all the rows other than the field machine names, should be deleted before you use a populated version of this CSV file in a create task. Also, you can remove any columns you do not intend on populating during the create task:","title":"CSV file templates"},{"location":"generating_csv_files/#csv-file-containing-a-row-for-every-newly-created-node","text":"In some situations, you may want to create stub nodes that only have a small subset of fields, and then populate the remaining fields later. To facilitate this type of workflow, Workbench provides an option to generate a simple CSV file containing a record for every node created during a create task. This file can then be used later in update tasks to add additional metadata or in add_media tasks to add media. You tell Workbench to generate this file by including the optional output_csv setting in your create task configuration file. If this setting is present, Workbench will write a CSV file at the specified location containing one record per node created. This CSV file contains the following fields: id (or whatever column is specified in your id_field setting): the value in your input CSV file's ID field node_id : the node ID for the newly created node uuid : the new node's UUID status : true if the node is published, False if it is unpublished title : the node's title The file will also contain empty columns corresponding to all of the fields in the target content type. An example, generated from a 2-record input CSV file, looks like this (only left-most part of the spreadsheet shown): This CSV file is suitable as a template for subsequent update tasks, since it already contains the node_id s for all the stub nodes plus column headers for all of the fields in those nodes. You can remove from the template any columns you do not want to include in your update task. You can also use the node IDs in this file as the basis for later add_media tasks; all you will need to do is delete the other columns and add a file column containing the new nodes' corresponding filenames. If you want to include in your output CSV all of the fields (and their values) from the input CSV, add output_csv_include_input_csv: true to your configuration file. This option is useful if you want a CSV that contains the node ID and a field such as field_identifier or other fields that contain local identifiers, DOIs, file paths, etc. If you use this option, all the fields from the input CSV are added to the output CSV; you cannot configure which fields are included.","title":"CSV file containing a row for every newly created node"},{"location":"generating_csv_files/#csv-file-containing-field-data-for-existing-nodes","text":"The export_csv task generates a CSV file that contains one row for each node identified in the input CSV file. The cells of the CSV are populated with data that is consistent with the structures that Workbench uses in update tasks. Using this CSV file, you can: see in one place all of the field values for nodes, which might be useful during quality assurance after a create task modify the data and use it as input for an update task using the update_mode: replace configuration option. The CSV file contains two of the extra rows included in the CSV file template, described above (specifically, the human-readable field label and number of values allowed), and the left-most \"REMOVE THIS COLUMN (KEEP THIS ROW)\" column. To use the file as input for an update task, simply delete the extraneous column and rows. A sample configuration file for an export_csv task is: task: export_csv host: \"http://localhost:8000\" username: admin password: islandora input_csv: nodes_to_export.csv export_csv_term_mode: name content_type: my_custom_content_type # If export_csv_field_list is not present, all fields will be exported. export_csv_field_list: ['title', 'field_description'] # Specifying the output path is optional; see below for more information. export_csv_file_path: output.csv The file identified by input_file has only one column, \"node_id\": node_id 7653 7732 7653 Some things to note: The output includes data from nodes only, not media. Unless a file path is specified in the export_csv_file_path configuration option, the output CSV file name is the name of the input CSV file (containing node IDs) with \".csv_file_with_field_values\" appended. For example, if you export_csv configuration file defines the input_csv as \"my_export_nodes.csv\", the CSV file created by the task will be named \"my_export_nodes.csv.csv_file_with_field_values\". The file is saved in the directory identified by the input_dir configuration option. You can include either vocabulary term IDs or term names (with accompanying vocabulary namespaces) in the CSV. By default, term IDs are included; to include term names instead, include export_csv_term_mode: name in you configuration file. A single export_csv job can only export nodes that have the content type identified in your Workbench configuration. By default, this is \"islandora_object\". If you include node IDs in your input file for nodes that have a different content type, Workbench will skip exporting their data and log the fact that it has done so. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Warning Using the export_csv_term_mode: name option will slow down the export, since Workbench must query Drupal to get the name of each term. The more taxonomy or typed relation fields in your content type, the slower the export will be with export_csv_term_mode set to \"name\".","title":"CSV file containing field data for existing nodes"},{"location":"generating_csv_files/#using-a-drupal-view-to-identify-content-to-export-as-csv","text":"You can use a new or existing View to tell Workbench what nodes to export into CSV. This is done using a get_data_from_view task. A sample configuration file looks like this: task: get_data_from_view host: \"http://localhost:8000/\" view_path: '/daily_nodes_created_test' username: admin password: islandora content_type: pubished_work export_csv_file_path: /tmp/islandora_export.csv # If export_csv_field_list is not present, all fields will be exported. # node_id and title are always included. export_csv_field_list: ['field_description', 'field_extent'] # 'view_paramters' is optinal, and used only if your View uses Contextual Filters. # In this setting you identify any URL parameters configured as Contextual Filters # for the View. Note that values in the 'view_parameters' configuration setting # are literal parameter=value strings that include the =, not YAML key: value # pairs used elsewhere in the Workbench configuration file. view_parameters: - 'date=20231202' The view_path setting should contain the value of the \"Path\" option in the Views configuration page's \"Path settings\" section. The export_csv_file_path is the location where you want your CSV file saved. In the View configuration page: Add a \"REST export\" display. Under \"Format\" > \"Serializer\" > \"Settings\", choose \"json\". In the View \"Fields\" settings, leave \"The selected style or row format does not use fields\" as is (see explanation below). Under \"Path\", add a path where your REST export will be accessible to Workbench. As noted above, this value is also what you should use in the view_path setting in your Workbench configuration file. Under \"Pager\" > \"Items to display\", choose \"Paged output, mini pager\". In \"Pager options\" choose 10 items to display. Under \"Path settings\" > \"Access\", choose \"Permission\" and \"View published content\". Under \"Authentication\", choose \"basic_auth\" and \"cookie\". Here is a screenshot illustrating these settings: To test your REST export, in your browser, join your Drupal hostname and the \"Path\" defined in your View configuration. Using the values in the configuration file above, that would be http://localhost:8000/workbench-export-test . You should see raw JSON (or formatted JSON if your browser renders JSON to be human readable) that lists the nodes in your View. Warning If your View includes nodes that you do not want to be seen by anonymous users, or if it contains unpublished nodes, adjust the access permissions settings appropriately, and ensure that the user identified in your Workbench configuration file has sufficient permissions. You can optionally configure your View to use a single Contextual Filters, and expose that Contextual Filter to use one or more query parameters. This way, you can include each query parameter's name and its value in your configuration file using Workbench's view_parameters config setting, as illustrated in the sample configuration file above. The configuration in the View's Contextual Filters for this type of parameter looks like this: By adding a Contextual Filterf to your View display, you can control what nodes end up in the output CSV by including the value you want to filter on in your Workbench configuration's view_parameters setting. In the screenshot of the \"Created date\" Contextual Filter shown here, the query parameter is date , so you include that parameter in your view_parameters list in your configuration file along with the value you want to assign to the parameter (separated by an = sign), e.g.: view_parameters: - 'date=20231202' will set the value of the date query parameter in the \"Created date\" Contextual Filter to \"20231202\". Some things to note: Note that the values you include in view_parameters apply only to your View's Contextual Filter. Any \"Filter Criteria\" you include in the main part of your View configuration also take effect. In other words, both \"Filter Criteria\" and \"Contextual Filters\" determine what nodes end up in your output CSV file. You can only include a single Contextual Filter in your View, but it can have multiple query parameters. REST export Views displays don't use fields in the same way that other Views displays do. In fact, Drupal says within the Views user interface that for REST export displays, \"The selected style or row format does not use fields.\" Instead, these displays export the entire node in JSON format. Workbench iterates through all fields on the node JSON that start with field_ and includes those fields, plus node_id and title , in the output CSV. If you don't want to export all the fields on a content type, you can list the fields you want to export in the export_csv_field_list configuration option. Only content from nodes that have the content type identified in the content_type configuration setting will be written to the CSV file. If you want to export term names instead of term IDs, include export_csv_term_mode: name in your configuration file. The warning about this option slowing down the export applies to this task and the export_csv task.","title":"Using a Drupal View to identify content to export as CSV"},{"location":"generating_csv_files/#using-a-drupal-view-to-generate-a-media-report-as-csv","text":"You can get a report of which media a set of nodes has using a View. This report is generated using a get_media_report_from_view task, and the View configuration it uses is the same as the View configuration described above (in fact, you can use the same View with both get_data_from_view and get_media_report_from_view tasks). A sample Workbench configuration file looks like: task: get_media_report_from_view host: \"http://localhost:8000/\" view_path: daily_nodes_created username: admin password: islandora export_csv_file_path: /tmp/media_report.csv # view_paramters is optinal, and used only if your View uses Contextual Filters. view_parameters: - 'date=20231201' The output contains colums for Node ID, Title, Content Type, Islandora Model, and Media. For each node in the View, the Media column contains the media use terms for all media attached to the node separated by semicolons, and sorted alphabetically:","title":"Using a Drupal View to generate a media report as CSV"},{"location":"generating_csv_files/#exporting-image-video-etc-files-along-with-csv-data","text":"In export_csv and get_data_from_view tasks, you can optionally export media files. To do this, add the following settings to your configuration file: export_file_directory : Required. This is the path to the directory where Workbench will save the exported files. export_file_media_use_term_id : Optional. This setting tells Workbench which Islandora Media Use term to use to identify the file to export. Defaults to http://pcdm.org/use#OriginalFile (for Original File). Can be either a term ID or a term URI. Note that currently only a single file per node can be exported, and that files need to be accessible to the anonymous Drupal user to be exported.","title":"Exporting image, video, etc. files along with CSV data"},{"location":"generating_csv_files/#using-the-csv-id-to-node-id-map","text":"By default, Workbench maintains a simple database that maps values in your CSV's ID column (or whatever column you define in the id_field config setting) to node IDs created in create tasks. Note You do not need to install anything extra for Workbench to create this database. Workbench provides a utility script, manage_csv_to_node_id_map.py (described below), for exporting and pruning the data. You only need to install the sqlite3 client or a third-party utility if you want to access the database in ways that surpass the capabilities of the manage_csv_to_node_id_map.py script. A useful third-party tool for viewing and modifying SQLite databases is DB Browser for SQLite . Here is a sample screenshot illustrating the CSV to node ID map database table in DB Browser for SQLite (CSV ID and node ID are the two right-most columns): Workbench optionally uses this database to determine the node ID of parent nodes when creating paged and compound content, so, for example, you can use parent_id values in your input CSV that refer to parents created in earlier Workbench sessions. But, you may find other uses for this data. Since it is stored in an SQLite database, it can be queried using SQL, or can be dumped using into a CSV file using the dump_id_map.py script provided in Workbench's scripts directory. Note In create_from_files tasks, which don't use an input CSV file, the filename is recorded instead of an \"id\". One configuration setting applies to this feature, csv_id_to_node_id_map_path . By default, its value is [your temporary directory]/csv_id_to_node_id_map.db (see the temp_dir config setting's documentation for more information on where that directory is). This default can be overridden in your config file. If you want to disable population of this database completely, set csv_id_to_node_id_map_path to false . Warning Some systems clear out their temporary directories on restart. You may want to define the absolute path to your ID map database in the csv_id_to_node_id_map_path configuration setting so it is stored in a location that will not get deleted on system restart. The SQLite database contains one table, \"csv_id_to_node_id_map\". This table has five columns: timestamp : the current timestamp in yyyy-mm-dd hh:mm:ss format or a truncated version of that format config_file : the name of the Workbench configuration file active when the row was added parent_csv_id : if the node was created along with its parent, the parent's CSV ID parent_node_id : if the node was create along with its parent, the parent's node ID csv_id : the value in the node's CSV ID field (or in create_from_files tasks, which don't use an input CSV file, the filename) node_id : the node's Drupal node ID If you don't want to query the database directly, you can use scripts/manage_csv_to_node_id_map.py to: Export the contents of the entire database to a CSV file. To do this, in the Workbench directory, run the script, specifying the path to the database file and the path to the CSV output: python scripts/manage_csv_to_node_id_map.py --db_path /tmp/csv_id_to_node_id_map.db --csv_path /tmp/output.csv Export the rows that have duplicate (i.e., identical) CSV ID values, or duplicate values in any specific field. To limit the rows that are dumped to those with duplicate values in a specific database field, add the --nonunique argument and the name of the field, e.g., --nonunique csv_id . The resulting CSV will only contain those entries from your database. Delete entries from the database that have a specific value in their config_file column. To do this, provide the script with the --remove_entries_with_config_files argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_with_config_files create.yml . You can also specify a comma-separated list of config file names (for example --remove_entries_with_config_files create.yml,create_testing.yml ) to remove all entries with those config file names with one command. Delete entries from the database create before a specific date. To do this, provide the script with the --remove_entries_before argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_before \"2023-05-29 19:17\" . Delete entries from the database created after a specific date. To do this, provide the script with the --remove_entries_after argument, e.g., python scripts/manage_csv_to_node_id_map.py --db_path csv_id_to_node_id_map.db --remove_entries_after \"2023-05-29 19:17\" . The value of the --remove_entries_before and --remove_entries_after arguments is a date string that can take the form yyyy-mm-dd hh:mm:ss or any truncated version of that format, e.g. yyyy-mm-dd hh:mm , yyyy-mm-dd hh , or yyyy-mm-dd . Any rows in the database table that have a timestamp value that matches the date value will be deleted from the database. Note that if your timestamp value has a space in it, you need to wrap it quotation marks as illustrated above; if you don't, the script will delete all the entries on the timestamp value before the space, in other words, that day.","title":"Using the CSV ID to node ID map"},{"location":"generating_sample_content/","text":"If you want to quickly generate some sample images to load into Islandora, Workbench provides a utility script to do that. Running python3 scripts/generate_image_files.py from within the Islandora Workbench directory will generate PNG images from the list of titles in the sample_filenames.txt file. Running this script will result in a group of images whose filenames are normalized versions of the lines in the sample title file. You can then load this sample content into Islandora using the create_from_files task. If you want to have Workbench generate the sample content automatically, configure the generate_image_files.py script as a bootstrap script. See the autogen_content.yml configuration file for an example of how to do that.","title":"Generating sample Islandora content"},{"location":"hooks/","text":"Hooks Islandora Workbench offers three \"hooks\" that can be used to run scripts at specific points in the Workbench execution lifecycle. The three hooks are: Bootstrap CSV preprocessor Post-action Hook scripts can be in any language, and need to be executable by the user running Workbench. Execution (whether successful or not) of hook scripts is logged, including the scripts' exit code. Bootstrap and shutdown scripts Bootstrap scripts execute prior to Workbench connecting to Drupal. For an example of using this feature to run a script that generates sample Islandora content, see the \" Generating sample Islandora content \" section. To register a bootstrap script in your configuration file, add it to the bootstrap option, like this, indicating the absolute path to the script: bootstrap: [\"/home/mark/Documents/hacking/workbench/scripts/generate_image_files.py\"] Each bootstrap script gets passed a single argument, the path to the Workbench config file that was specified in Workbench's --config argument. For example, if you are running Workbench with a config file called create.yml , \"create.yml\" will automatically be passed as the argument to your bootstrap script (you do not specify it in the configuration), like this: generate_image_files.py create.yml Shutdown scripts work the same way as bootstrap scripts but they execute after Workbench has finished connecting to Drupal. Like bootstrap scripts, shutdown scripts receive a single argument from Workbench, the path to your configuration file. A common situation where a shutdown script is useful is to check the Workbench log for failures, and if any are detected, to email someone. The script email_log_if_errors.py in the scripts directory shows how this can be used for this. To register a shutdown script, add it to the shutdown option: shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example.py\"] --check will check for the existence of bootstrap and shutdown scripts, and that they are executable, but does not execute them. The scripts are only executed when Workbench is run without --check . A shutdown script that you might find useful is scripts/generate_iiif_manifests.py , which generates the IIIF Manifest (book-manifest) for each node in the current create task that is a parent. You might want to do this since pregenerating the manifest usually speeds up rendering of paged content in Mirador and OpenSeadragon. To use it, simply add the following to your create task config file, adjusting the path to your Workbench scripts directory: shutdown: [\"/home/mark/hacking/workbench/scripts/generate_iiif_manifests.py\"] Warning Bootstrap and shutdown scripts get passed the path to your configuration file, but they only have access to the configuration settings explicitly defined in that file. In other words, any configuration setting with a default value, and therefore no necessarily included in your configuration file, is not known to bootstrap/shutdown scripts. Therefore, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench. CSV preprocessor scripts CSV preprocessor scripts are applied to CSV values in a specific CSV field prior to the values being ingested into Drupal. They apply to the entire value from the CSV field and not split field values, e.g., if a field is multivalued, the preprocessor must split it and then reassemble it back into a string. Note that preprocessor scripts work only on string data and not on binary data like images, etc. and only on custom fields (so not title). Preprocessor scripts are applied in create and update tasks. Note If you are interested in seeing preprocessor scripts act on binary data such as images, see this issue . For example, if you want to convert all the values in the field_description CSV field to sentence case, you can do this by writing a small Python script that uses the capitalize() method and registering it as a preprocessor. To register a preprocessor script in your configuration file, add it to the preprocessors setting, mapping a CSV column header to the path of the script, like this: preprocessors: - field_description: /home/mark/Documents/hacking/workbench/scripts/samplepreprocessor.py You must provide the absolute path to the script, and the script must be executable. Each preprocessor script gets passed two arguments: the character used as the CSV subdelimiter (defined in the subdelimiter config setting, which defaults to | ) unlike bootstrap, shutdown, and post-action scripts, preprocessor scripts do not get passed the path to your Workbench configuration file; they only get passed the value of the subdelimiter config setting. the CSV field value When executed, the script processes the string content of the CSV field, and then replaces the original version of the CSV field value with the version processed by the script. An example preprocessor script is available in scripts/samplepreprocessor.py . Post-action scripts Post-action scripts execute after a node is created or updated, or after a media is created. To register post-action scripts in your configuration file, add them to either the node_post_create , node_post_update , or media_post_create configuration setting: node_post_create: [\"/home/mark/Documents/hacking/workbench/post_node_create.py\"] node_post_update: [\"/home/mark/Documents/hacking/workbench/post_node_update.py\"] media_post_create: [\"/home/mark/Documents/hacking/workbench/post_media_update.py\"] The arguments passed to each post-action hook are: the path to the Workbench config file that was specified in the --config argument the HTTP response code returned from the action (create, update), e.g. 201 or 403 . Note that this response code is a string, not an integer. the entire HTTP response body; this will be raw JSON. These arguments are passed to post-action scripts automatically. You don't specific them when you register your scripts in your config file. The scripts/entity_post_task_example.py illustrates these arguments. Your scripts can find the entity ID and other information within the (raw JSON) HTTP response body. Using the way Python decodes JSON as an example, if the entity is a node, its nid is in entity_json['nid'][0]['value'] ; if the entity is a media, the mid is in entity_json['mid'][0]['value'] . The exact location of the nid and mid may differ if your script is written in a language that decodes JSON differently than Python (used in this example) does. Warning Not all Workbench configuration settings are available in post-action scripts. Only the settings are explicitly defined in the configuration YAML are available. As with bootstrap and shutdown scripts, when using post-action scripts, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench. Running multiple scripts in one hook For all types of hooks, you can register multiple scripts, like this: bootstrap: [\"/home/mark/Documents/hacking/workbench/bootstrap_example_1.py\", \"/home/mark/Documents/hacking/workbench/bootstrap_example_2.py\"] shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example_1.py\", \"/home/mark/Documents/hacking/workbench/shutdown_example_2.py\"] node_post_create: [\"/home/mark/scripts/email_someone.py\", \"/tmp/hit_remote_api.py\"] They are executed in the order in which they are listed.","title":"Hooks"},{"location":"hooks/#hooks","text":"Islandora Workbench offers three \"hooks\" that can be used to run scripts at specific points in the Workbench execution lifecycle. The three hooks are: Bootstrap CSV preprocessor Post-action Hook scripts can be in any language, and need to be executable by the user running Workbench. Execution (whether successful or not) of hook scripts is logged, including the scripts' exit code.","title":"Hooks"},{"location":"hooks/#bootstrap-and-shutdown-scripts","text":"Bootstrap scripts execute prior to Workbench connecting to Drupal. For an example of using this feature to run a script that generates sample Islandora content, see the \" Generating sample Islandora content \" section. To register a bootstrap script in your configuration file, add it to the bootstrap option, like this, indicating the absolute path to the script: bootstrap: [\"/home/mark/Documents/hacking/workbench/scripts/generate_image_files.py\"] Each bootstrap script gets passed a single argument, the path to the Workbench config file that was specified in Workbench's --config argument. For example, if you are running Workbench with a config file called create.yml , \"create.yml\" will automatically be passed as the argument to your bootstrap script (you do not specify it in the configuration), like this: generate_image_files.py create.yml Shutdown scripts work the same way as bootstrap scripts but they execute after Workbench has finished connecting to Drupal. Like bootstrap scripts, shutdown scripts receive a single argument from Workbench, the path to your configuration file. A common situation where a shutdown script is useful is to check the Workbench log for failures, and if any are detected, to email someone. The script email_log_if_errors.py in the scripts directory shows how this can be used for this. To register a shutdown script, add it to the shutdown option: shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example.py\"] --check will check for the existence of bootstrap and shutdown scripts, and that they are executable, but does not execute them. The scripts are only executed when Workbench is run without --check . A shutdown script that you might find useful is scripts/generate_iiif_manifests.py , which generates the IIIF Manifest (book-manifest) for each node in the current create task that is a parent. You might want to do this since pregenerating the manifest usually speeds up rendering of paged content in Mirador and OpenSeadragon. To use it, simply add the following to your create task config file, adjusting the path to your Workbench scripts directory: shutdown: [\"/home/mark/hacking/workbench/scripts/generate_iiif_manifests.py\"] Warning Bootstrap and shutdown scripts get passed the path to your configuration file, but they only have access to the configuration settings explicitly defined in that file. In other words, any configuration setting with a default value, and therefore no necessarily included in your configuration file, is not known to bootstrap/shutdown scripts. Therefore, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench.","title":"Bootstrap and shutdown scripts"},{"location":"hooks/#csv-preprocessor-scripts","text":"CSV preprocessor scripts are applied to CSV values in a specific CSV field prior to the values being ingested into Drupal. They apply to the entire value from the CSV field and not split field values, e.g., if a field is multivalued, the preprocessor must split it and then reassemble it back into a string. Note that preprocessor scripts work only on string data and not on binary data like images, etc. and only on custom fields (so not title). Preprocessor scripts are applied in create and update tasks. Note If you are interested in seeing preprocessor scripts act on binary data such as images, see this issue . For example, if you want to convert all the values in the field_description CSV field to sentence case, you can do this by writing a small Python script that uses the capitalize() method and registering it as a preprocessor. To register a preprocessor script in your configuration file, add it to the preprocessors setting, mapping a CSV column header to the path of the script, like this: preprocessors: - field_description: /home/mark/Documents/hacking/workbench/scripts/samplepreprocessor.py You must provide the absolute path to the script, and the script must be executable. Each preprocessor script gets passed two arguments: the character used as the CSV subdelimiter (defined in the subdelimiter config setting, which defaults to | ) unlike bootstrap, shutdown, and post-action scripts, preprocessor scripts do not get passed the path to your Workbench configuration file; they only get passed the value of the subdelimiter config setting. the CSV field value When executed, the script processes the string content of the CSV field, and then replaces the original version of the CSV field value with the version processed by the script. An example preprocessor script is available in scripts/samplepreprocessor.py .","title":"CSV preprocessor scripts"},{"location":"hooks/#post-action-scripts","text":"Post-action scripts execute after a node is created or updated, or after a media is created. To register post-action scripts in your configuration file, add them to either the node_post_create , node_post_update , or media_post_create configuration setting: node_post_create: [\"/home/mark/Documents/hacking/workbench/post_node_create.py\"] node_post_update: [\"/home/mark/Documents/hacking/workbench/post_node_update.py\"] media_post_create: [\"/home/mark/Documents/hacking/workbench/post_media_update.py\"] The arguments passed to each post-action hook are: the path to the Workbench config file that was specified in the --config argument the HTTP response code returned from the action (create, update), e.g. 201 or 403 . Note that this response code is a string, not an integer. the entire HTTP response body; this will be raw JSON. These arguments are passed to post-action scripts automatically. You don't specific them when you register your scripts in your config file. The scripts/entity_post_task_example.py illustrates these arguments. Your scripts can find the entity ID and other information within the (raw JSON) HTTP response body. Using the way Python decodes JSON as an example, if the entity is a node, its nid is in entity_json['nid'][0]['value'] ; if the entity is a media, the mid is in entity_json['mid'][0]['value'] . The exact location of the nid and mid may differ if your script is written in a language that decodes JSON differently than Python (used in this example) does. Warning Not all Workbench configuration settings are available in post-action scripts. Only the settings are explicitly defined in the configuration YAML are available. As with bootstrap and shutdown scripts, when using post-action scripts, it is good practice to include in your configuration file all configuration settings your script will need. The presence of a configuration setting set to its default value has no effect on Workbench.","title":"Post-action scripts"},{"location":"hooks/#running-multiple-scripts-in-one-hook","text":"For all types of hooks, you can register multiple scripts, like this: bootstrap: [\"/home/mark/Documents/hacking/workbench/bootstrap_example_1.py\", \"/home/mark/Documents/hacking/workbench/bootstrap_example_2.py\"] shutdown: [\"/home/mark/Documents/hacking/workbench/shutdown_example_1.py\", \"/home/mark/Documents/hacking/workbench/shutdown_example_2.py\"] node_post_create: [\"/home/mark/scripts/email_someone.py\", \"/tmp/hit_remote_api.py\"] They are executed in the order in which they are listed.","title":"Running multiple scripts in one hook"},{"location":"ignoring_csv_rows_and_columns/","text":"Commenting out CSV rows In create and update tasks, you can comment out rows in your input CSV, Excel file, or Google Sheet by adding a hash mark ( # ) as the first character of the value in the first column. Workbench ignores these rows, both when it is run with and without --check . Commenting out rows works in all tasks that use CSV data. For example, the third row in the following CSV file is commented out: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,Weather was windy. #IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Since column order doesn't matter to Workbench, the same row is commented out in both the previous example and in this one: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Commenting works the same with in Excel and Google Sheets. Here is the CSV file used above in a Google Sheet: You can also use commenting to include actual comments in your CSV/Google Sheet/Excel file: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # Let's not load the following record right now. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Using CSV row ranges The csv_start_row and csv_stop_row configuration settings allow you to tell Workbench to only process a specific subset of input CSV records. Both settings are optional and can be used in any task, and apply when using text CSV, Google Sheets, or Excel input files. Each setting takes as its value a row number (ignoring the header row). For example, row number 2 is the second row of data after the CSV header row. Below are some example configurations. Process CSV rows 10 to the end of the CSV file (ignoring rows 1-9): csv_start_row: 10 Process only CSV rows 10-15 (ignoring all other rows): csv_start_row: 10 csv_stop_row: 15 Process CSV from the start of the file to row 20 (ignoring rows 21 and higher): csv_stop_row: 20 If you only want to process a single row, use its position in the CSV for both csv_start_row or csv_stop_row (for example, to only process row 100): csv_start_row: 100 csv_stop_row: 100 Note When the csv_start_row or csv_stop_row options are in use, Workbench will display a message similar to the following when run: Using a subset of the input CSV (will start at row 10, stop at row 15). Processing specific CSV rows You can tell Workbench to process only specific rows in your CSV file (or, looked at from another perspective, to ignore all rows other than the specified ones). To do this, add the csv_rows_to_process setting to your config file with a list of \"id\" column values, e.g.: csv_rows_to_process: [\"test_001\", \"test_007\", \"test_103\"] This will tell Workbench to process only the CSV rows that have those values in their \"id\" column. This works with whatever you have configured as your \"id\" column header using the id_field configuration setting. Processing or ignoring rows based on field values Workbench provides a simple mechanism to filter rows in your input CSV. For example, you can tell it to process only CSV rows that have a field_model of \"Image\", or fields that have a field_model of either \"Image\" or \"Digital document\". This is done using the csv_row_filters config setting, which defines a set of filters that are applied to your CSV at runtime. There are two types of filters, is and isnot . Here is an example configuration using an is filter to reduce the input CSV to only rows that have a field_model of \"Image\": csv_row_filters: - field_model:is:Image This example filters the CSV input to only rows that have a field_model of \"Image\" or \"Digital document\": csv_row_filters: - field_model:is:Image - field_model:is:Digital document Multiple filters are ORed together, i.e. in the above example, the row in kept in the input if its field_model is either \"Image\" or \"Digital document\". isnot filters work similarly, but they exclude rows that match ( is filters include rows that match). For example, csv_row_filters: - field_model:isnot:Image will filter out rows that have a field_model of \"Image\". Some things to keep in mind when using CSV row filters: You can filter on any column in your input CSV, not just field_model as used in examples above. You can use filters on multivalued fields; Workbench will check each value in the field against each filter. You can add as many filters as you want (both is and isnot ) but as the number of filters increases, the contents of your resulting input CSV becomes less predictable. In other words, these filters are intended for, and work best in, configurations that have a small number (e.g., one, two, or three) filters. isnot filters are applied before is filters. Within is and isnot filters, each filter is not necessarily applied in the order they appear in your configuration file. Filters are applied every time you run Workbench, regardless of whether you are in --check mode or not. Regardless of whether you are using CSV row filters, or any other technique of ignoring CSV rows or columns, Workbench converts your input CSV into a \"preprocessed\" version and uses it to perform its task. This file can be found in the temporary directory defined in your Workbench config's temp_dir setting, which by default is your computer's temporary directory. If you want to inspect this file after running --check , you can see which rows result after the filters have been applied. Ignoring CSV columns Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names and reserved Workbench column names. To accommodate CSV columns that do not correspond to either of those types, or to eliminate a column during testing or troubleshooting, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the column headers in the ignore_csv_columns configuration setting. The value of this setting is a list. For example, if you want to include a date_generated column in your CSV (which is neither a Workbench reserved column or a Drupal field name), include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated'] If you want Workbench to ignore the \"data_generated\" column and the \"field_description\" columns, your configuration would look like this: ignore_csv_columns: ['date_generated', 'field_description'] Note that if a column name is listed in the csv_field_templates setting , the value for the column defined in the template is used. In other words, the values in the CSV are ignored, but the field in Drupal is still populated using the value from the field template.","title":"Ignoring CSV rows and columns"},{"location":"ignoring_csv_rows_and_columns/#commenting-out-csv-rows","text":"In create and update tasks, you can comment out rows in your input CSV, Excel file, or Google Sheet by adding a hash mark ( # ) as the first character of the value in the first column. Workbench ignores these rows, both when it is run with and without --check . Commenting out rows works in all tasks that use CSV data. For example, the third row in the following CSV file is commented out: file,id,title,field_model,field_description IMG_1410.tif,01,Small boats in Havana Harbour,25,Taken on vacation in Cuba. IMG_2549.jp2,02,Manhatten Island,25,Weather was windy. #IMG_2940.JPG,03,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. IMG_2958.JPG,04,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. IMG_5083.JPG,05,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Since column order doesn't matter to Workbench, the same row is commented out in both the previous example and in this one: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\" Commenting works the same with in Excel and Google Sheets. Here is the CSV file used above in a Google Sheet: You can also use commenting to include actual comments in your CSV/Google Sheet/Excel file: id,file,title,field_model,field_description 01,IMG_1410.tif,Small boats in Havana Harbour,25,Taken on vacation in Cuba. 02,IMG_2549.jp2,Manhatten Island,25,Weather was windy. # Let's not load the following record right now. # 03,IMG_2940.JPG,Looking across Burrard Inlet,25,View from Deep Cove to Burnaby Mountain. 04,IMG_2958.JPG,Amsterdam waterfront,25,Amsterdam waterfront on an overcast day. 05,IMG_5083.JPG,Alcatraz Island,25,\"Taken from Fisherman's Wharf, San Francisco.\"","title":"Commenting out CSV rows"},{"location":"ignoring_csv_rows_and_columns/#using-csv-row-ranges","text":"The csv_start_row and csv_stop_row configuration settings allow you to tell Workbench to only process a specific subset of input CSV records. Both settings are optional and can be used in any task, and apply when using text CSV, Google Sheets, or Excel input files. Each setting takes as its value a row number (ignoring the header row). For example, row number 2 is the second row of data after the CSV header row. Below are some example configurations. Process CSV rows 10 to the end of the CSV file (ignoring rows 1-9): csv_start_row: 10 Process only CSV rows 10-15 (ignoring all other rows): csv_start_row: 10 csv_stop_row: 15 Process CSV from the start of the file to row 20 (ignoring rows 21 and higher): csv_stop_row: 20 If you only want to process a single row, use its position in the CSV for both csv_start_row or csv_stop_row (for example, to only process row 100): csv_start_row: 100 csv_stop_row: 100 Note When the csv_start_row or csv_stop_row options are in use, Workbench will display a message similar to the following when run: Using a subset of the input CSV (will start at row 10, stop at row 15).","title":"Using CSV row ranges"},{"location":"ignoring_csv_rows_and_columns/#processing-specific-csv-rows","text":"You can tell Workbench to process only specific rows in your CSV file (or, looked at from another perspective, to ignore all rows other than the specified ones). To do this, add the csv_rows_to_process setting to your config file with a list of \"id\" column values, e.g.: csv_rows_to_process: [\"test_001\", \"test_007\", \"test_103\"] This will tell Workbench to process only the CSV rows that have those values in their \"id\" column. This works with whatever you have configured as your \"id\" column header using the id_field configuration setting.","title":"Processing specific CSV rows"},{"location":"ignoring_csv_rows_and_columns/#processing-or-ignoring-rows-based-on-field-values","text":"Workbench provides a simple mechanism to filter rows in your input CSV. For example, you can tell it to process only CSV rows that have a field_model of \"Image\", or fields that have a field_model of either \"Image\" or \"Digital document\". This is done using the csv_row_filters config setting, which defines a set of filters that are applied to your CSV at runtime. There are two types of filters, is and isnot . Here is an example configuration using an is filter to reduce the input CSV to only rows that have a field_model of \"Image\": csv_row_filters: - field_model:is:Image This example filters the CSV input to only rows that have a field_model of \"Image\" or \"Digital document\": csv_row_filters: - field_model:is:Image - field_model:is:Digital document Multiple filters are ORed together, i.e. in the above example, the row in kept in the input if its field_model is either \"Image\" or \"Digital document\". isnot filters work similarly, but they exclude rows that match ( is filters include rows that match). For example, csv_row_filters: - field_model:isnot:Image will filter out rows that have a field_model of \"Image\". Some things to keep in mind when using CSV row filters: You can filter on any column in your input CSV, not just field_model as used in examples above. You can use filters on multivalued fields; Workbench will check each value in the field against each filter. You can add as many filters as you want (both is and isnot ) but as the number of filters increases, the contents of your resulting input CSV becomes less predictable. In other words, these filters are intended for, and work best in, configurations that have a small number (e.g., one, two, or three) filters. isnot filters are applied before is filters. Within is and isnot filters, each filter is not necessarily applied in the order they appear in your configuration file. Filters are applied every time you run Workbench, regardless of whether you are in --check mode or not. Regardless of whether you are using CSV row filters, or any other technique of ignoring CSV rows or columns, Workbench converts your input CSV into a \"preprocessed\" version and uses it to perform its task. This file can be found in the temporary directory defined in your Workbench config's temp_dir setting, which by default is your computer's temporary directory. If you want to inspect this file after running --check , you can see which rows result after the filters have been applied.","title":"Processing or ignoring rows based on field values"},{"location":"ignoring_csv_rows_and_columns/#ignoring-csv-columns","text":"Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names and reserved Workbench column names. To accommodate CSV columns that do not correspond to either of those types, or to eliminate a column during testing or troubleshooting, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the column headers in the ignore_csv_columns configuration setting. The value of this setting is a list. For example, if you want to include a date_generated column in your CSV (which is neither a Workbench reserved column or a Drupal field name), include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated'] If you want Workbench to ignore the \"data_generated\" column and the \"field_description\" columns, your configuration would look like this: ignore_csv_columns: ['date_generated', 'field_description'] Note that if a column name is listed in the csv_field_templates setting , the value for the column defined in the template is used. In other words, the values in the CSV are ignored, but the field in Drupal is still populated using the value from the field template.","title":"Ignoring CSV columns"},{"location":"installation/","text":"Requirements An Islandora repository using Drupal 8 or 9, with the Islandora Workbench Integration module enabled. If you are using Drupal 8.5 or earlier, please refer to the \"Using Drupal 8.5 or earlier\" section below. Python 3.7 or higher The following Python libraries: ruamel.yaml Requests Requests-Cache progress_bar openpyxl unidecode edtf-validate rich If you want to have these libraries automatically installed, you will need Python's setuptools Islandora Workbench has been installed and used on Linux, Mac, and Windows. Warning Some systems have both Python 2 and Python 3 installed. It's a good idea to check which version is used when you run python . To do this, run python --version , which will output something like \"Python 2.7.17\" or \"Python 3.8.10\". If python --version indicates you're running version 2, try running python3 --version to see if you have version 3 installed. Also, if you installed an alternate version of Python 3.x on your system (for example via Homebrew on a Mac), you may need to run Workbench by calling that Python interpreter directly. For Python 3.x installed via Homebrew, that will be at /opt/homebrew/bin/python3 , so to run Workbench you would use /opt/homebrew/bin/python3 workbench while in the islandora_workbench directory. Installing Islandora Workbench Installation involves two steps: cloning the Islandora Workbench Github repo running setup.py to install the required Python libraries (listed above) Step 1: cloning the Islandora Workbench Github repo In a terminal, run: git clone https://github.com/mjordan/islandora_workbench.git This will create a directory named islandora_workbench where you will run the ./workbench command. Step 2: running setup.py to install the required Python libraries For most people, the preferred place to install Python libraries is in your user directory. To do this, change into the \"islandora_workbench\" directory created by cloning the repo, and run the following command: python3 setup.py install --user A less common method is to install the required Python libraries into your computer's central Python environment. To do this, omit the --user (note: you must have administrator privileges on the computer to do this): sudo python3 setup.py install Updating Islandora Workbench Since Islandora Workbench is under development, you will want to update it often. To do this, within the islandora_workbench directory, run the following git command: git pull origin main After you pull in the latest changes using git , it's a good idea to rerun the setup tools in case new Python libraries have been added since you last ran the setup tools (same command as above): python3 setup.py install --user or if you originally installed the required Python libraries centrally, without the --user option (again, you will need administrator privileges on the machine): sudo python3 setup.py install Keeping the Islandora Workbench Integration Drupal module up to date Islandora Workbench communicates with Drupal using REST endpoints and Views. The Islandora Workbench Integration module (linked above in the \"Requirements\" section) ensures that the target Drupal has all required REST endpoints and Views enabled. Therefore, keeping it in sync with Islandora Workbench is important. Workbench checks the version of the Integration module and tells you if you need to upgrade it. To upgrade the module, update its code via Git or Composer, and follow the instructions in the \"Updates\" section of its README . Configuring Drupal's media URLs Islandora Workbench uses Drupal's default form of media URLs. You should not need to do anything to allow this, since the admin setting in admin/config/media/media-settings (under \"Security\") that determines what form of media URLs your site uses defaults to the correct setting (unchecked): If your site needs to have this option checked (so it supports URLs like /media/{id} ), you will need to add the following entry to all configuration files for tasks that create or delete media: standalone_media_url: true Note If you change the checkbox in Drupal's media settings admin page, be sure you clear your Drupal cache to make the new media URLs work. Using Drupal 8.5 or earlier When ingesting media in Drupal versions 8.5 and earlier, Islandora Workbench has two significant limitations/bugs that you should be aware of: Approximately 10% of media creation attempts will likely fail. Workbench will log these failures. Additional information is available in this issue . A file with a filename that already exists in Islandora will overwrite the existing file, as reported in this issue . To avoid these issues, you need to be running Drupal version 8.6 or higher. Warning If you are using Drupal 8.5 or earlier, you need to use the version of Workbench tagged with drupal_8.5_and_lower (commit 542325fb6d44c2ac84a4e2965289bb9f9ed9bf68). Later versions no longer support Drupal 8.5 and earlier. Password management Islandora Workbench requires user credentials that have administrator-level permissions in the target Drupal. Therefore you should exercise caution when managing those credentials. Workbench configuration files must contain a username setting, but you can provide the corresponding password in three ways: in the password setting in your YAML configuration file in the ISLANDORA_WORKBENCH_PASSWORD environment variable in response to a prompt when you run Workbench. If the password setting is present in your configuration files, Workbench will use its value as the user password and will ignore the other two methods of providing a password. If the password setting is absent, Workbench will look for the ISLANDORA_WORKBENCH_PASSWORD environment variable and if it is present, use its value. If both the password setting and the ISLANDORA_WORKBENCH_PASSWORD environment variable are absent, Workbench will prompt the user for a password before proceeding. Warning If you put the password in configuration files, you should not leave the files in directories that are widely readable, send them in emails or share them in Slack, commit the configuration files to public Git repositories, etc.","title":"Requirements and installation"},{"location":"installation/#requirements","text":"An Islandora repository using Drupal 8 or 9, with the Islandora Workbench Integration module enabled. If you are using Drupal 8.5 or earlier, please refer to the \"Using Drupal 8.5 or earlier\" section below. Python 3.7 or higher The following Python libraries: ruamel.yaml Requests Requests-Cache progress_bar openpyxl unidecode edtf-validate rich If you want to have these libraries automatically installed, you will need Python's setuptools Islandora Workbench has been installed and used on Linux, Mac, and Windows. Warning Some systems have both Python 2 and Python 3 installed. It's a good idea to check which version is used when you run python . To do this, run python --version , which will output something like \"Python 2.7.17\" or \"Python 3.8.10\". If python --version indicates you're running version 2, try running python3 --version to see if you have version 3 installed. Also, if you installed an alternate version of Python 3.x on your system (for example via Homebrew on a Mac), you may need to run Workbench by calling that Python interpreter directly. For Python 3.x installed via Homebrew, that will be at /opt/homebrew/bin/python3 , so to run Workbench you would use /opt/homebrew/bin/python3 workbench while in the islandora_workbench directory.","title":"Requirements"},{"location":"installation/#installing-islandora-workbench","text":"Installation involves two steps: cloning the Islandora Workbench Github repo running setup.py to install the required Python libraries (listed above)","title":"Installing Islandora Workbench"},{"location":"installation/#step-1-cloning-the-islandora-workbench-github-repo","text":"In a terminal, run: git clone https://github.com/mjordan/islandora_workbench.git This will create a directory named islandora_workbench where you will run the ./workbench command.","title":"Step 1: cloning the Islandora Workbench Github repo"},{"location":"installation/#step-2-running-setuppy-to-install-the-required-python-libraries","text":"For most people, the preferred place to install Python libraries is in your user directory. To do this, change into the \"islandora_workbench\" directory created by cloning the repo, and run the following command: python3 setup.py install --user A less common method is to install the required Python libraries into your computer's central Python environment. To do this, omit the --user (note: you must have administrator privileges on the computer to do this): sudo python3 setup.py install","title":"Step 2: running setup.py to install the required Python libraries"},{"location":"installation/#updating-islandora-workbench","text":"Since Islandora Workbench is under development, you will want to update it often. To do this, within the islandora_workbench directory, run the following git command: git pull origin main After you pull in the latest changes using git , it's a good idea to rerun the setup tools in case new Python libraries have been added since you last ran the setup tools (same command as above): python3 setup.py install --user or if you originally installed the required Python libraries centrally, without the --user option (again, you will need administrator privileges on the machine): sudo python3 setup.py install","title":"Updating Islandora Workbench"},{"location":"installation/#keeping-the-islandora-workbench-integration-drupal-module-up-to-date","text":"Islandora Workbench communicates with Drupal using REST endpoints and Views. The Islandora Workbench Integration module (linked above in the \"Requirements\" section) ensures that the target Drupal has all required REST endpoints and Views enabled. Therefore, keeping it in sync with Islandora Workbench is important. Workbench checks the version of the Integration module and tells you if you need to upgrade it. To upgrade the module, update its code via Git or Composer, and follow the instructions in the \"Updates\" section of its README .","title":"Keeping the Islandora Workbench Integration Drupal module up to date"},{"location":"installation/#configuring-drupals-media-urls","text":"Islandora Workbench uses Drupal's default form of media URLs. You should not need to do anything to allow this, since the admin setting in admin/config/media/media-settings (under \"Security\") that determines what form of media URLs your site uses defaults to the correct setting (unchecked): If your site needs to have this option checked (so it supports URLs like /media/{id} ), you will need to add the following entry to all configuration files for tasks that create or delete media: standalone_media_url: true Note If you change the checkbox in Drupal's media settings admin page, be sure you clear your Drupal cache to make the new media URLs work.","title":"Configuring Drupal's media URLs"},{"location":"installation/#using-drupal-85-or-earlier","text":"When ingesting media in Drupal versions 8.5 and earlier, Islandora Workbench has two significant limitations/bugs that you should be aware of: Approximately 10% of media creation attempts will likely fail. Workbench will log these failures. Additional information is available in this issue . A file with a filename that already exists in Islandora will overwrite the existing file, as reported in this issue . To avoid these issues, you need to be running Drupal version 8.6 or higher. Warning If you are using Drupal 8.5 or earlier, you need to use the version of Workbench tagged with drupal_8.5_and_lower (commit 542325fb6d44c2ac84a4e2965289bb9f9ed9bf68). Later versions no longer support Drupal 8.5 and earlier.","title":"Using Drupal 8.5 or earlier"},{"location":"installation/#password-management","text":"Islandora Workbench requires user credentials that have administrator-level permissions in the target Drupal. Therefore you should exercise caution when managing those credentials. Workbench configuration files must contain a username setting, but you can provide the corresponding password in three ways: in the password setting in your YAML configuration file in the ISLANDORA_WORKBENCH_PASSWORD environment variable in response to a prompt when you run Workbench. If the password setting is present in your configuration files, Workbench will use its value as the user password and will ignore the other two methods of providing a password. If the password setting is absent, Workbench will look for the ISLANDORA_WORKBENCH_PASSWORD environment variable and if it is present, use its value. If both the password setting and the ISLANDORA_WORKBENCH_PASSWORD environment variable are absent, Workbench will prompt the user for a password before proceeding. Warning If you put the password in configuration files, you should not leave the files in directories that are widely readable, send them in emails or share them in Slack, commit the configuration files to public Git repositories, etc.","title":"Password management"},{"location":"limitations/","text":"Note If you are encountering problems not described here, please open an issue and help improve Islandora Workbench! parent_id CSV column can only contain one ID The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Non-ASCII filenames are normalized to their ASCII equivalents. The HTTP client library Workbench uses, Requests, requires filenames to be encoded as Latin-1 , while Drupal requires filenames to be encoded as UTF-8. Normalizing filenames that contain diacritics or non-Latin characters to their ASCII equivalents is a compromise. See this issue for more information. If Workbench normalizes a filename, it logs the original and the normalized version. Updating nodes does not create revisions. This is limitation of Drupal (see this issue ). Password prompt always fails first time, and prompts a second time (which works) More information is available at this issue . Workbench doesn't support taxonomy reference fields that use the \"Filter by an entity reference View\" reference type Only taxonomy reference fields that use the \"Default\" reference type are currently supported. As a work around, to populate a \"Filter by an entity reference View\" field, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Note that Workbench will not validate values in fields that are configured to use this type of reference. Also, term IDs that are not in the View results will result in the node not being created (Drupal will return a 422 response). Setting destination filesystem for media is not possible Drupal's REST interface for file fields does not allow overriding the \"upload destination\" (filesystem) that is defined in a media type's file field configuration. For example, if a file field is configured to use the \"Flysystem: fedora\" upload destination, you cannot tell Workbench to use the \"Public Files\" upload destination instead. Note that the drupal_filesystem configuration setting only applied to Drupal versions 8.x - 9.1. In Drupal 9.2 or later, this setting is ignored. Workbench cannot modify media automatically generated by Islandora's microservices Islandora uses Contexts to initiate the generation of derivative media. Configuration for these Contexts is available in the \"Derivatives\" section of admin/structure/context . One example of a derivative media is a thumbnail generated on the ingestion of a JPEG original file. During create or add_media tasks, Workbench cannot modify or even be aware of derivative media automatically generated by Islandora. For example, Workbench can't add alt text to a thumbnail image automatically created by the \"Image derivatives\" Context. This is because the derivative media are generated asynchronously by Islandora's job processing queue. In other words, there is no predictable relationship between when an \"Original file\" media is created by Workbench (or uploaded by a user in the Drupal content management forms) and when the \"Thumbnail\" media is generated by Islandora's microservices. This unpredictability prevents Workbench from knowing when the derivative will be or has been created, making it impossible to have Workbench automatically add alt text to that thumbnail.","title":"Known limitations"},{"location":"limitations/#parent_id-csv-column-can-only-contain-one-id","text":"The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field.","title":"parent_id CSV column can only contain one ID"},{"location":"limitations/#non-ascii-filenames-are-normalized-to-their-ascii-equivalents","text":"The HTTP client library Workbench uses, Requests, requires filenames to be encoded as Latin-1 , while Drupal requires filenames to be encoded as UTF-8. Normalizing filenames that contain diacritics or non-Latin characters to their ASCII equivalents is a compromise. See this issue for more information. If Workbench normalizes a filename, it logs the original and the normalized version.","title":"Non-ASCII filenames are normalized to their ASCII equivalents."},{"location":"limitations/#updating-nodes-does-not-create-revisions","text":"This is limitation of Drupal (see this issue ).","title":"Updating nodes does not create revisions."},{"location":"limitations/#password-prompt-always-fails-first-time-and-prompts-a-second-time-which-works","text":"More information is available at this issue .","title":"Password prompt always fails first time, and prompts a second time (which works)"},{"location":"limitations/#workbench-doesnt-support-taxonomy-reference-fields-that-use-the-filter-by-an-entity-reference-view-reference-type","text":"Only taxonomy reference fields that use the \"Default\" reference type are currently supported. As a work around, to populate a \"Filter by an entity reference View\" field, you can do the following: use term IDs instead of term names or URIs in your input CSV and include require_entity_reference_views: false in your configuration file. Note that Workbench will not validate values in fields that are configured to use this type of reference. Also, term IDs that are not in the View results will result in the node not being created (Drupal will return a 422 response).","title":"Workbench doesn't support taxonomy reference fields that use the \"Filter by an entity reference View\" reference type"},{"location":"limitations/#setting-destination-filesystem-for-media-is-not-possible","text":"Drupal's REST interface for file fields does not allow overriding the \"upload destination\" (filesystem) that is defined in a media type's file field configuration. For example, if a file field is configured to use the \"Flysystem: fedora\" upload destination, you cannot tell Workbench to use the \"Public Files\" upload destination instead. Note that the drupal_filesystem configuration setting only applied to Drupal versions 8.x - 9.1. In Drupal 9.2 or later, this setting is ignored.","title":"Setting destination filesystem for media is not possible"},{"location":"limitations/#workbench-cannot-modify-media-automatically-generated-by-islandoras-microservices","text":"Islandora uses Contexts to initiate the generation of derivative media. Configuration for these Contexts is available in the \"Derivatives\" section of admin/structure/context . One example of a derivative media is a thumbnail generated on the ingestion of a JPEG original file. During create or add_media tasks, Workbench cannot modify or even be aware of derivative media automatically generated by Islandora. For example, Workbench can't add alt text to a thumbnail image automatically created by the \"Image derivatives\" Context. This is because the derivative media are generated asynchronously by Islandora's job processing queue. In other words, there is no predictable relationship between when an \"Original file\" media is created by Workbench (or uploaded by a user in the Drupal content management forms) and when the \"Thumbnail\" media is generated by Islandora's microservices. This unpredictability prevents Workbench from knowing when the derivative will be or has been created, making it impossible to have Workbench automatically add alt text to that thumbnail.","title":"Workbench cannot modify media automatically generated by Islandora's microservices"},{"location":"logging/","text":"Islandora Workbench writes a log file for all tasks to a file named \"workbench.log\" in the directory Workbench is run from, unless you specify an alternative log file location using the log_file_path configuration option, e.g.: log_file_path: /tmp/mylogfilepath.log Note The only times that the default log file name is used instead of one defined in log_file_path is 1) when Workbench can't find the specified configuration file and 2) when Workbench finds the configuration file but detects that the file is not valid YAML, and therefore can't understand the value of log_file_path . The log contains information that is similar to what you see when you run Workbench, but with time stamps: 24-Dec-20 15:05:06 - INFO - Starting configuration check for \"create\" task using config file create.yml. 24-Dec-20 15:05:07 - INFO - OK, configuration file has all required values (did not check for optional values). 24-Dec-20 15:05:07 - INFO - OK, CSV file input_data/metadata.csv found. 24-Dec-20 15:05:07 - INFO - OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). 24-Dec-20 15:05:21 - INFO - OK, CSV column headers match Drupal field names. 24-Dec-20 15:05:21 - INFO - OK, required Drupal fields are present in the CSV file. 24-Dec-20 15:05:23 - INFO - OK, term IDs/names in CSV file exist in their respective taxonomies. 24-Dec-20 15:05:23 - INFO - OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. 24-Dec-20 15:05:23 - INFO - OK, files named in the CSV \"file\" column are all present. 24-Dec-20 15:05:23 - INFO - Configuration checked for \"create\" task using config file create.yml, no problems found. It may also contain additional detail that would clutter up the console output, for example which term is being added to a vocabulary. Appending to vs. overwriting your log file By default, new entries are appended to this log, unless you indicate that the log file should be overwritten each time Workbench is run by providing the log_file_mode configuration option with a value of \"w\": log_file_mode: w Logging debugging information Workbench doesn't provide a way to set the amount of detail in its log, but several options are available that are useful for debugging and troubleshooting. These options, when set to true , write raw values used in the REST requests to Drupal: log_request_url : Logs the request URL and its method (GET, POST, etc.). log_json : Logs the raw JSON that Workbench uses in POST, PUT, and PATCH requests. log_headers : Logs the raw HTTP headers used in all requests. log_response_status_code : Logs the HTTP response code. log_response_body : Logs the raw HTTP response body. Another configuration setting that is useful during debugging is log_file_name_and_line_number , which, as the name suggests, adds to all log entries the filename and line number where the entry was generated. These options can be used independently of each other, but they are often more useful for debugging when used together. Warning Using these options, especially log_json and log_response_body , can add a lot of data to you log file.","title":"Logging"},{"location":"logging/#appending-to-vs-overwriting-your-log-file","text":"By default, new entries are appended to this log, unless you indicate that the log file should be overwritten each time Workbench is run by providing the log_file_mode configuration option with a value of \"w\": log_file_mode: w","title":"Appending to vs. overwriting your log file"},{"location":"logging/#logging-debugging-information","text":"Workbench doesn't provide a way to set the amount of detail in its log, but several options are available that are useful for debugging and troubleshooting. These options, when set to true , write raw values used in the REST requests to Drupal: log_request_url : Logs the request URL and its method (GET, POST, etc.). log_json : Logs the raw JSON that Workbench uses in POST, PUT, and PATCH requests. log_headers : Logs the raw HTTP headers used in all requests. log_response_status_code : Logs the HTTP response code. log_response_body : Logs the raw HTTP response body. Another configuration setting that is useful during debugging is log_file_name_and_line_number , which, as the name suggests, adds to all log entries the filename and line number where the entry was generated. These options can be used independently of each other, but they are often more useful for debugging when used together. Warning Using these options, especially log_json and log_response_body , can add a lot of data to you log file.","title":"Logging debugging information"},{"location":"media_track_files/","text":"Video and audio service file media can have accompanying track files. These files can be added in \"create\" tasks by including a column in your CSV file that contains the information Islandora Workbench needs to create these files. Note 1) This feature of Workbench only works with media track files that are stored in a file field on Audio and Video media. In the Starter Site, these fields are named field_track . If you store your track files in other fields on the Audio or Video media, you can identify those fields using the instructions below. 2) In order to create media track files as described below, you must configure Drupal in the following ways. First, the \"File\" media type's \"field_media_file\" field must be configured to accept files with the \".vtt\" extension. To do this, visit /admin/structure/media/manage/file/fields/media.file.field_media_file and add \"vtt\" to the list of allowed file extensions. 3) You must ensure that captions are enabled in the respective media display configurations. To do this for video, visit /admin/structure/media/manage/video/display and select \"Video with Captions\" in the \"Video file\" entry. For audio, visit /admin/structure/media/manage/audio/display and select \"Audio with Captions\" in the \"Audio file\" entry. Since track files are part of the media's service file, Workbench will only add track files to media that are tagged as \"Service File\" in your Workbench CSV using a \"media_use_tid\" column (or using the media_use_tid configuration setting). Tagging video and audio files as both \"Original File\" and \"Service File\" will also work. Running --check will tell you if your media_use_tid value (either in the media_use_tid configuration setting or in row-level values in your CSV) are compatible with creating media track files. Workbench cannot add track files to service files generated by Islandora - it can only create track files that accompany service files it creates. The header for this CSV column looks like media:video:field_track - the string \"media\" followed by a colon, which is then followed by a media type (in a standard Islandora configuration, either \"video\" or \"audio\"), which is then followed by another colon and the machine name of the field on the media that holds the track file (in a standard Islandora configuration, this is \"field_track\" for both video and audio). If you have a custom setup and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field In this case, your column header would look like media:mycustommmedia:field_another_custom_track_file_field . You only need to use this configuration setting if you have a custom media type, or you have configured your Islandora so that audio or video uses a nonstandard track file field. Note that this setting replaces the default values of video: field_track and audio: field_track , so if you wanted to retain those values and add your own, your configuration file would need to contain something like this: media_track_file_fields: audio: field_my_custom_track_file_field video: field_track mycustommmediatype: field_another_custom_track_file_field In the track column in your input CSV, you specify, in the following order and separated by colons, the label for the track, its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\"), the language code for the track (\"en\", \"fr\", \"es\", etc.), and the absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case). Here is an example CSV file that creates track files for accompanying video files: id,field_model,title,file,media:video:field_track 001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt 002,Image,An image,test.png, 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt 004,Video,Third video,third.mp4, Warning Since a colon is used to separate the parts of the track data, you can't use a colon in the label. A label value like \"Transcript: The Making Of\" will invalidate the entire track data for that CSV row, and Workbench will skip creating it. If you need to have a colon in your track label, you will need to update the label value manually in the video or audio media's add/edit form. However, if you are running Workbench on Windows, you can use absolute file paths that contain colons, such as c:\\Users\\mark\\Documents\\my_vtt_file.vtt . You can mix CSV entries that contain track file information with those that do not (as with row 002 above, which is for an image), and also omit the track data for video and audio files that don't have an accompanying track file. If there is no value in the media track column, Workbench will not attempt to create a media track file. You can also add multiple track files for a single video or audio file in the same way you add multiple values in any field: id,field_model,title,file,media:video:field_track 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt|Transcript:subtitles:en:/path/to/another/track_file.vtt Note If you add multiple track files to a single video or audio, only the first one listed in the CSV data for that entry will be configured as the \"Default\" track. If your CSV has entries for both audio and video files, you will need to include separate track columns, one for each media type: id,field_model,title,file,media:video:field_track,media:audio:field_track 001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt, 002,Image,An image,test.png,, 003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt, 004,Audio,An audio,,Transcript:subtitles:en:/path/to/audio004/vtt/file.vtt","title":"Creating media track files"},{"location":"media_types/","text":"Overriding Workbench's default extension to media type mappings Note Drupal's use of Media types (image, video, document, etc.) is distinct from Islandora's use of \"model\", which identifies an intellectual entity as an image, video, collection, compound object, newspaper, etc. By default Workbench defines the following file extension to media type mapping: File extensions Media type png, gif, jpg, jpeg image pdf, doc, docx, ppt, pptx document tif, tiff, jp2, zip, tar file mp3, wav, aac audio mp4 video txt extracted_text If a file's extension is not defined in this default mapping, the media is assigned the \"file\" type. If you need to override this default mapping, you can do so in two ways: If the override applies to all files named in your CSV's file column, use the media_type configuration option, for example media_type: document ). Use this option if all of the files in your batch are to be assigned the same media type, but their extensions are not defined in the default mapping or you wish to override the default mapping. On a per file extension basis, via a mapping in the media_types_override option in your configuration file like this one: media_types_override: - video: ['mp4', 'ogg'] Use the media_types_override option if each of the files named in your CSV's file column are to be assigned an extension-specific media type, and their extensions are not defined in the default mapping (or add to the extensions in the default mapping, as in this example). Note that: If a file's extension is not present in the default mapping or in the media_types_override custom mapping, the media is assigned the \"file\" type. If you use the media_types_override configuration option, your mapping replaces Workbench's default mappings for the specified media type. This means that if you want to retain the default media type mapping for a file extension, you need to include it in the mapping, as illustrated by the presence of \"mp4\" in the example above. If both media_type and media_types_override are included in the config file, the mapping in media_types_override is ignored and the media type assigned in media_type is used. Overriding Workbench's default MIME type to file extension mappings For remote files, in other words files that start with http or https , Workbench relies on the MIME type provided by the remote web server to determine the extension of the temporary file that it writes locally. If you are getting errors indicating that a file extension is not registered for a given media type, and you suspect that the extensions are wrong, you can include the mimetype_extensions setting in your config file to tell Workbench which extensions to use for a given MIME type. Here is a (hypothetical) example that tells Workbench to assign the '.foo' extension to files with the MIME type 'image/jp2' and the extension '.bar' to files with the MIME type 'image/jpeg': mimetype_extensions: 'image/jp2': '.foo' 'image/jpeg': '.bar' Overriding Workbench's default file extension to MIME type mappings It may be necessary to assign a media a MIME type that is different from the MIME type that Drupal ordinarily derives from a given extension. The best example of this is that media for hOCR files need to have the MIME type text/vnd.hocr+html . Without explicitly indicating that MIME type, Drupal will assign the media for an .hocr file, on creation, the catch-all MIME type application/octet-stream . Workbench automatically assigns files ending in .hocr the correct MIME type. But, if you want to override that for some reason, or want to tell Workbench to create a media with a specific MIME type from a file with a specific extension, you can add to your configuration file a an extension-to-MIME-type mapping like this (the leading . in the extension on the left is optional): extensions_to_mimetypes: 'mbox': 'application/mbox' Assigning a media type by Media Use URI We typically assign a media type that corresponds with the asset's MIME type. However, there are instances where two media types share the same MIME type. A notable example is FITS XML: while assets with the MIME type application/xml are usually assigned a file media type, the fits_technical_metadata designation is more suitable in this case. To assign a specific media to an asset type based on its intended use, use media_type_by_media_use . Below is an example demonstrating how to assign a FITS-tagged XML file to the fits_technical_metadata media type . media_type_by_media_use: - https://projects.iq.harvard.edu/fits: fits_technical_metadata Configuring a custom media type Islandora ships with a set of default media types, including audio, document, extracted text, file, FITS technical metadata, image, and video. If you want to add your own custom media type, you need to tell Workbench two things: which file extension(s) should map to the new media type, and which field on the new media type is used to store the file associated with the media. To satisfy the first requirement, use the media_type or media_types_override option as described above. To satisfy the second requirement, use Workbench's media_type_file_fields option. The values in the media_type_file_fields option are the machine name of the media type and the machine name of the \"File\" field configured for that media. To determine the machine name of your media type, go to the field configuration of your media types (Admin > Structure > Media types) choose your custom media type choose the \"Manage fields\" operation for the media type. The URL of the Drupal page you are now at should look like /admin/structure/media/manage/my_custom_media/fields . The machine name of the media is in the second-last position in the URL. In this example, it's my_custom_media . in the list of fields, look for the one that says \"File\" in the \"Field type\" column the field machine name you want is in that row's \"Machine name\" column. Here's an example that tells Workbench that the custom media type \"Custom media\" uses the \"field_media_file\" field: media_type_file_fields: - my_custom_media: field_media_file Put together, the two configuration options would look like this: media_types_override: - my_custom_media: ['cus'] media_type_file_fields: - my_custom_media: field_media_file In this example, your Workbench job is creating media of varying types (for example, images, videos, and documents, all using the default extension-to-media type mappings. If all the files you are adding in the Workbench job all have the same media type (in the following example, your \"my_custom_media\" type), you could use this configuration: media_type: my_custom_media media_type_file_fields: - my_custom_media: field_media_file","title":"Configuring media types"},{"location":"media_types/#overriding-workbenchs-default-extension-to-media-type-mappings","text":"Note Drupal's use of Media types (image, video, document, etc.) is distinct from Islandora's use of \"model\", which identifies an intellectual entity as an image, video, collection, compound object, newspaper, etc. By default Workbench defines the following file extension to media type mapping: File extensions Media type png, gif, jpg, jpeg image pdf, doc, docx, ppt, pptx document tif, tiff, jp2, zip, tar file mp3, wav, aac audio mp4 video txt extracted_text If a file's extension is not defined in this default mapping, the media is assigned the \"file\" type. If you need to override this default mapping, you can do so in two ways: If the override applies to all files named in your CSV's file column, use the media_type configuration option, for example media_type: document ). Use this option if all of the files in your batch are to be assigned the same media type, but their extensions are not defined in the default mapping or you wish to override the default mapping. On a per file extension basis, via a mapping in the media_types_override option in your configuration file like this one: media_types_override: - video: ['mp4', 'ogg'] Use the media_types_override option if each of the files named in your CSV's file column are to be assigned an extension-specific media type, and their extensions are not defined in the default mapping (or add to the extensions in the default mapping, as in this example). Note that: If a file's extension is not present in the default mapping or in the media_types_override custom mapping, the media is assigned the \"file\" type. If you use the media_types_override configuration option, your mapping replaces Workbench's default mappings for the specified media type. This means that if you want to retain the default media type mapping for a file extension, you need to include it in the mapping, as illustrated by the presence of \"mp4\" in the example above. If both media_type and media_types_override are included in the config file, the mapping in media_types_override is ignored and the media type assigned in media_type is used.","title":"Overriding Workbench's default extension to media type mappings"},{"location":"media_types/#overriding-workbenchs-default-mime-type-to-file-extension-mappings","text":"For remote files, in other words files that start with http or https , Workbench relies on the MIME type provided by the remote web server to determine the extension of the temporary file that it writes locally. If you are getting errors indicating that a file extension is not registered for a given media type, and you suspect that the extensions are wrong, you can include the mimetype_extensions setting in your config file to tell Workbench which extensions to use for a given MIME type. Here is a (hypothetical) example that tells Workbench to assign the '.foo' extension to files with the MIME type 'image/jp2' and the extension '.bar' to files with the MIME type 'image/jpeg': mimetype_extensions: 'image/jp2': '.foo' 'image/jpeg': '.bar'","title":"Overriding Workbench's default MIME type to file extension mappings"},{"location":"media_types/#overriding-workbenchs-default-file-extension-to-mime-type-mappings","text":"It may be necessary to assign a media a MIME type that is different from the MIME type that Drupal ordinarily derives from a given extension. The best example of this is that media for hOCR files need to have the MIME type text/vnd.hocr+html . Without explicitly indicating that MIME type, Drupal will assign the media for an .hocr file, on creation, the catch-all MIME type application/octet-stream . Workbench automatically assigns files ending in .hocr the correct MIME type. But, if you want to override that for some reason, or want to tell Workbench to create a media with a specific MIME type from a file with a specific extension, you can add to your configuration file a an extension-to-MIME-type mapping like this (the leading . in the extension on the left is optional): extensions_to_mimetypes: 'mbox': 'application/mbox'","title":"Overriding Workbench's default file extension to MIME type mappings"},{"location":"media_types/#assigning-a-media-type-by-media-use-uri","text":"We typically assign a media type that corresponds with the asset's MIME type. However, there are instances where two media types share the same MIME type. A notable example is FITS XML: while assets with the MIME type application/xml are usually assigned a file media type, the fits_technical_metadata designation is more suitable in this case. To assign a specific media to an asset type based on its intended use, use media_type_by_media_use . Below is an example demonstrating how to assign a FITS-tagged XML file to the fits_technical_metadata media type . media_type_by_media_use: - https://projects.iq.harvard.edu/fits: fits_technical_metadata","title":"Assigning a media type by Media Use URI"},{"location":"media_types/#configuring-a-custom-media-type","text":"Islandora ships with a set of default media types, including audio, document, extracted text, file, FITS technical metadata, image, and video. If you want to add your own custom media type, you need to tell Workbench two things: which file extension(s) should map to the new media type, and which field on the new media type is used to store the file associated with the media. To satisfy the first requirement, use the media_type or media_types_override option as described above. To satisfy the second requirement, use Workbench's media_type_file_fields option. The values in the media_type_file_fields option are the machine name of the media type and the machine name of the \"File\" field configured for that media. To determine the machine name of your media type, go to the field configuration of your media types (Admin > Structure > Media types) choose your custom media type choose the \"Manage fields\" operation for the media type. The URL of the Drupal page you are now at should look like /admin/structure/media/manage/my_custom_media/fields . The machine name of the media is in the second-last position in the URL. In this example, it's my_custom_media . in the list of fields, look for the one that says \"File\" in the \"Field type\" column the field machine name you want is in that row's \"Machine name\" column. Here's an example that tells Workbench that the custom media type \"Custom media\" uses the \"field_media_file\" field: media_type_file_fields: - my_custom_media: field_media_file Put together, the two configuration options would look like this: media_types_override: - my_custom_media: ['cus'] media_type_file_fields: - my_custom_media: field_media_file In this example, your Workbench job is creating media of varying types (for example, images, videos, and documents, all using the default extension-to-media type mappings. If all the files you are adding in the Workbench job all have the same media type (in the following example, your \"my_custom_media\" type), you could use this configuration: media_type: my_custom_media media_type_file_fields: - my_custom_media: field_media_file","title":"Configuring a custom media type"},{"location":"nodes_only/","text":"During a create task, if you want to create nodes but not any accompanying media, for example if you are testing your metadata values or creating collection nodes, you can include the nodes_only: true option in your configuration file: task: create host: \"http://localhost:8000\" username: admin password: islandora nodes_only: true If this is present, Islandora Workbench will only create nodes and will skip all media creation. During --check , it will ignore anything in your CSV's files field (in fact, your CSV doesn't even need a file column). If nodes_only is true , your configuration file for the create task doesn't need a media_use_tid , drupal_filesystem , or media_type / media_types_override option.","title":"Creating nodes without media"},{"location":"paged_and_compound/","text":"Islandora Workbench provides three ways to create paged and compound content: using a subdirectory structure to define the relationship between the parent item and its children using page-level metadata in the CSV to establish that relationship using a secondary task. Using subdirectories Note Information in this section applies to all compound content, not just \"paged content\". That term is used here since the most common use of this method will be for creating paged content. In other words, where \"page\" is used below, it can be substituted with \"child\". Enable this method by including paged_content_from_directories: true in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata. CSV and directory structure This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Only the parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.: id,title,field_model book1,How to Use Islandora Workbench like a Pro,Paged Content book2,Using Islandora Workbench for Fun and Profit,Paged Content Note Unlike every other Islandora Workbench \"create\" configuration, the metadata CSV should not contain a file column (however, you can include a directory column as described below). This means that content created using this method cannot be created using the same CSV file as other content. Each parent's pages are located in a subdirectory of the input directory that is named by default to match the value of the id field of the parent item they are pages of: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv If you don't want to use your id column to name the directory that stores pages, you can include a directory column in your input CSV and add the page_files_source_dir_field: directory setting to your config file. The values in the directory column can then contain the names of the page directories. If you do that, your CSV would look like this: id,title,field_model,directory sfu_book_1,How to Use Islandora Workbench like a Pro,Paged Content,book1 sfu_book_2,Using Islandora Workbench for Fun and Profit,Paged Content,book2 Filename conventions The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash ( - ), although you can use another character by setting the paged_content_sequence_separator option in your configuration file. These sequence indicators are essentially physical page numbers, starting a \"1\" (not \"0\"). For example, using the filenames for \"book1\" above, the sequence of \"page-001.jpg\" is \"001\". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of \"isbn-1843341778-001.jpg\" for \"book2\" is also \"001\". Workbench takes this sequence number, strips all leading zeros, and uses it to populate the field_weight in the page nodes, so \"001\" becomes a weight value of 1, \"002\" becomes a weight value of 2, and so on. Important things to note when using this method: To use this method of creating paged content, you must include paged_content_page_model_tid in your configuration file and set it to your Islandora's term ID for the \"Page\" term in the Islandora Models vocabulary (or to http://id.loc.gov/ontologies/bibframe/part ). The Islandora model of the parent is not set automatically. You need to include a field_model value for each item in your CSV file, commonly \"Paged content\" or \"Publication issue\". You can apply CSV value templates to paged/child items using values from their respective parents. See the \" CSV value templates \" documentation for more information. You should also include a field_display_hints column in your CSV. This value is applied to the parent nodes and also the page nodes, unless the paged_content_page_display_hints setting is present in you configuration file. However, if you normally don't set the \"Display hints\" field in your objects but use a Context to determine how objects display, you should not include a field_display_hints column in your CSV file. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/paged item relationships will still work. The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the paged_content_page_content_type setting in your configuration file. If your page directories contain files other than page images, you need to include the paged_content_image_file_extension setting in your configuration. Otherwise, Workbench can't tell which files to create pages from. If you don't want to use your id column to name the directories that contain each item's pages, you can include page_files_source_dir_field: directory to your config file and add a directory column to your input CSV to name the directories. Applying field data to pages/children created from subdirectories Titles for pages/children created from subdirectories are generated automatically using the pattern parent_title + , page + sequence_number , where \"parent title\" is inherited from the page's parent node and \"sequence number\" is the page's sequence. For example, if a page's parent has the title \"How to Write a Book\" and its sequence number is 450, its automatically generated title will be \"How to Write a Book, page 450\". You can override this pattern by including the page_title_template setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is '$parent_title, page $weight' . There are only two variables you can include in the template, $parent_title and $weight , although you do not need to include either one if you don't want that information appearing in your page titles. The Islandora Model applied to all page/child nodes is the one defined in the paged_content_page_model_tid configuration setting. This model is automatically applied to all pages/children created from subdirectories. Fields on pages/children that are configured as required in the parent and page content type are automatically inherited from the parent. No special configuration is necessary. You can add additional (non-required field) metadata to pages/children using CSV value templates during the create task that creates the pages/children from subdirectories. Ingesting pages, their parents, and their \"grandparents\" using a single CSV file In the \"books\" example above, each row in the CSV (i.e., book1, book2) describes a node with the \"Paged Content\" Islandora model; each of the books is the direct parent of the individual page nodes. However, in some cases, you may want to create the pages, their direct parents (each book), and a parent of the parents (let's call it a \"grandparent\" of the pages) at the same time, using the same Workbench job and the same input CSV. Some common use cases for this ability are: creating a node describing a periodical, some nodes describing issues of the periodical, and the pages of each issue, and creating a node describing a book series, a set of nodes describing books in the series, and page nodes for each book. paged_content_from_directories: true in your config file tells Workbench to look in a directory containing page files for each row in your input CSV. If you want to include the pages, the immediate parent of the pages, and the grandparent of the pages in the same CSV, you can create an empty directory for the grandparent node, named after its id value like the other items in your CSV. In addition, and importantly, you also need to include a parent_id column in your CSV file to define the relationship between the grandparent and its direct children (in our example, the book nodes). The presence of the parent_id column does not have impact on the parent-child relationship between the books and their pages; that relationship is created automatically, like it is in the \"books\" example above. To illustrate this, let's extend the \"books\" example above to include a higher-level (grandparent to the pages) node that describes the series of books used in that example. Here is the CSV with the new top-level item, and with the addition of the parent_id column to indicate that the paged content items are children of the new \"book000\" node: id,parent_id,title,field_model book000,,How-to Books: A Best-Selling Genre of Books,Compound Object book1,book000,How to Use Islandora Workbench like a Pro,Paged Content book2,book000,Using Islandora Workbench for Fun and Profit,Paged Content The directory structure looks like this (note that the book000 directory should be empty since it doesn't have any pages as direct children): books/ \u251c\u2500\u2500 book000 \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Workbench will warn you that the book000 directory is empty, but that's OK - it will look for, but not find, any pages for that item. The node corresponding to that directory will be created as expected, and values in the parent_id column will ensure that the intended hierarchical relationship between \"book000\" and its child items (the book nodes) is created. Ingesting OCR (and other) files with page images You can tell Workbench to add OCR and other media related to page images when using the \"Using subdirectories\" method of creating paged content. To do this, add the OCR files to your subdirectories, using the base filenames of each page image plus an extension like .txt : books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 isbn-1843341778-001.txt \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Then, add the following settings to your configuration file: paged_content_from_directories: true (as described above) paged_content_page_model_tid (as described above) paged_content_image_file_extension : this is the file extension, without the leading . , of the page images, for example tif , jpg , etc. paged_content_additional_page_media : this is a list of mappings from Media Use term IDs or URIs to the file extensions of the OCR or other files you are ingesting. See the example below. An example configuration is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data/paged_content_example standalone_media_url: true paged_content_from_directories: true paged_content_page_model_tid: http://id.loc.gov/ontologies/bibframe/part paged_content_image_file_extension: jpg paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt You can add multiple additional files (for example, OCR and hOCR) if you provide a Media Use term-to-file-extension mapping for each type of file: paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt - https://discoverygarden.ca/use#hocr: hocr You can also use your Drupal's numeric Media Use term IDs in the mappings, like: paged_content_additional_page_media: - 354: txt - 429: hocr Note Using hOCR media for Islandora paged content nodes may not be configured on your Islandora repository; hOCR and the corresponding URI are used here as an example only. In this case, Workbench looks for files with the extensions txt and hocr and creates media for them with respective mapped Media Use terms. The paged content input directory would look like this: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-001.hocr \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-002.hocr \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u251c\u2500\u2500 page-003.hocr \u2502 \u2514\u2500\u2500 page-003.jpg Warning It is important to temporarily disable actions in Contexts that generate media/derivatives that would conflict with additional media you are adding using the method described here. For example, if you are adding OCR files, in the \"Page Derivatives\" Context listed at /admin/structure/context , disable the \"Extract text from PDF or image\" action prior to running Workbench, and be sure to re-enable it afterwards. If you do not do this, the OCR media added by Workbench will get overwritten with the one that Islandora generates using the \"Extract text from PDF or image\" action. Ignoring files in page directories Sometimes files such as \"Thumbs.db\" (on Windows) can creep into page directories. You can tell Workbench to ignore specific files within directories by including the paged_content_ignore_files configuration setting in your config file. Note that the default setting is to ignore \"Thumbs.db\" files. If you want to add additional files, or override that default setting, include the paged_content_ignore_files followed by a list of filenames, e.g.: paged_content_ignore_files: [\"Thumbs.db\", \"scanning_manifest.txt\"] Note that Workbench converts all filenames in the directories and filenames listed in the paged_content_ignore_files setting to lower case before checking to see if they are in this list. For example, if Workbench encounters a filename Scanning_Manifest.TXT , it will match \"scanning_manifest.txt\" in the configuration above configuration. With page/child-level metadata Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's file column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with a special parent_id CSV column. Values in the parent_id column, which only apply to rows describing pages/children, are the id value of their parent. For this to work, your CSV file must contain a parent_id field plus the standard Islandora fields field_weight , field_member_of , and field_model (the role of these last three fields will be explained below). The id field is required in all CSV files used to create content, so in this case, your CSV needs both an id field and a parent_id field. The following example illustrates how this works. Here is the raw CSV data: id,parent_id,field_weight,file,title,field_description,field_model,field_member_of 001,,,,Postcard 1,The first postcard,28,197 003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29, 004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29, 002,,,,Postcard 2,The second postcard,28,197 006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29, 007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29, The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet: The data contains rows for two postcards (rows with id values \"001\" and \"002\") plus a back and front for each (the remaining four rows). The parent_id value for items with id values \"003\" and \"004\" is the same as the id value for item \"001\", which will tell Workbench to make both of those items children of item \"001\"; the parent_id value for items with id values \"006\" and \"007\" is the same as the id value for item \"002\", which will tell Workbench to make both of those items children of the item \"002\". We can't populate field_member_of for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children. In this example, the rows for our postcard objects have empty parent_id , field_weight , and file columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in field_member_of , which is the node ID of the \"Postcards\" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their field_weight field, and they have values in their file column because we are creating objects that contain image media. Importantly, they have no value in their field_member_of field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's field_member_of dynamically, just after its parent node is created. Some important things to note: The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Currently, you need to include the option allow_missing_files: true in your configuration file when using this method to create paged/compound content. See this issue for more information. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/child relationships will still work. The values of the id and parent_id columns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields. The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. ( --check will tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child's field_member_of will be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items. Currently, you must include values in the children's field_weight column (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue ). Currently, Islandora model values (e.g. \"Paged Content\", \"Page\") are not automatically assigned. You must include the correct \"Islandora Models\" taxonomy term IDs in your field_model column for all parent and child records, as you would for any other Islandora objects you are creating. Like for field_weight , it may be possible to automatically generate values for this field (see this issue ). Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your \"create\" tasks accordingly. For example: If you want to use a single \"create\" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages. If you would rather use multiple \"create\" tasks, you can create all your collections first, then, in subsequent \"create\" tasks, use their respective node IDs in the field_member_of CSV column for their members. If you use a separate \"create\" task to create members of a single collection, you can define the value of field_member_of in a CSV field template . If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book node's node ID in their field_member_of column). Or, you could run a separate \"create\" task for each book, and use a CSV field template containing a field_member_of entry containing the book item's node ID. For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent \"create\" task for both newspaper issues and pages. In this task, the field_member_of column in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blank field_member_of and a parent_id using the parent issue's id value. Using a secondary task You can configure Islandora Workbench to execute two \"create\" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents. The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's parent_id values to their parent's id values, much the same way as in the \"With page/child-level metadata\" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and also their own configuration file). Each task is a standard Islandora Workbench \"create\" task, joined by one setting in the primary's configuration file, secondary_tasks , as described below. In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children: As you can see, values in the parent_id column in the secondary CSV reference values in the id column in the primary CSV: parent_id 001 in the secondary CSV matches id 001 in the primary, parent_id 003 in the secondary matches id 003 in the primary, and so on. You configure secondary tasks by adding the secondary_tasks setting to your primary configuration file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora # This is the setting that links the two configuration files together. secondary_tasks: ['children.yml'] input_csv: parents.csv nodes_only: true In the secondary_tasks setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named \"children.yml\") contains no indication that it's a secondary task: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: kids.csv csv_field_templates: - field_model: http://purl.org/coar/resource_type/c_c513 query_csv_id_to_node_id_map_for_parents: true Note The CSV ID to node ID map is required in secondary create tasks. Workbench will automatically change the query_csv_id_to_node_id_map_for_parents to true , regardless of whether that setting is in your secondary task's config file. Note The nodes_only setting in the above example primary configuration file and the csv_field_templates setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ. When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of id + node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the field_member_of values in the secondary task with the node IDs corresponding to the referenced primary id values. Some things to note about secondary tasks: Only \"create\" tasks can be used as the primary and secondary tasks. When you have a secondary task configured, running --check will validate both tasks' configuration and input data. The secondary CSV must contain parent_id and field_member_of columns. field_member_of must be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, include field_weight with the appropriate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order). If a row in the secondary task CSV does not have a parent_id that matches an id of a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so. As already stated, each task has its own configuration file, which means that you can specify a content_type value in your secondary configuration file that differs from the content_type of the primary task. You can include more than one secondary task in your configuration. For example, secondary_tasks: ['first.yml', 'second.yml'] will execute the primary task, then the \"first.yml\" secondary task, then the \"second.yml\" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes. Specifying paths to the python interpreter and to the workbench script When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the \"workbench\" script is located. The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute path to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are: path_to_python path_to_workbench_script An example of using these settings is: secondary_tasks: ['children.yml'] path_to_python: '/usr/bin/python' path_to_workbench_script: '/home/mark/islandora_workbench/workbench' The second situation is when using a secondary task when running Workbench in Windows and \"python.exe\" is not in the PATH of the user running the scheduled job. Specifying the absolute path to \"python.exe\" will ensure that Workbench can execute the secondary task properly, like this: secondary_tasks: ['children.yml'] path_to_python: 'c:/program files/python39/python.exe' path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench' Creating parent/child relationships across Workbench sessions It is possible to use parent_id values in your CSV that refer to id values from earlier Workbench sessions. In other words, you don't need to create parents and their member/child nodes within the same Workbench job; you can create parents in an earlier job and refer to their id values in later jobs. This is possible because during create tasks, Workbench records each newly created node ID and its corresponding value from the input CSV's id (or configured equivalent) column. It also records any values from the CSV parent_id column, if they exist. This data is stored in a simple SQLite database called the \" CSV ID to node ID map \". Because this database persists across Workbench sessions, you can use id values in your input CSV's parent_id column from previously loaded CSV files. The mapping between the previously loaded parents' id values and the values in your current CSV's parent_id column are stored in the CSV ID to node ID map database. Note It is important to use unique values in your CSV id (or configured equivalent) column, since if duplicate ID values exist in this database, Workbench can't know which corresponding node ID to use. In this case, Workbench will create the child node, but it won't assign a parent to it. --check will inform you if this happens with messages like Warning: Query of ID map for parent ID \"0002\" returned multiple node IDs: (771, 772, 773, 774, 778, 779). , and your Workbench log will also document that there are duplicate IDs. Warning By default, Workbench only checks the CSV ID to node ID map for parent IDs created in the same session as the children. If you want to assign children to parents created in previous Workbench sessions, you need to set the query_csv_id_to_node_id_map_for_parents configuration setting to true . Creating collections and members together Using a variation of the \"With page/child-level metadata\" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members' parent_id field to the collections' id field: id,parent_id,file,title,field_model,field_member_of,field_weight 1,,,A collection of animal photos,24,, 2,1,cat.jpg,Picture of a cat,25,, 3,1,dog.jpg,Picture of a dog,25,, 3,1,horse.jpg,Picture of a horse,25,, The use of the parent_id and field_member_of fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in field_weight empty, since Islandora collections don't use field_weight to determine order of members. Collection Views are sorted using other fields. Warning Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the \"Using a secondary task\" method to ingest collections and members in the same Workbench job. Summary The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes: Method Relationships created by field_weight Advantage Subdirectories Directory structure Do not include column in CSV; autopopulated. Useful for creating paged content where pages don't have their own metadata. Parent/child-level metadata in same CSV References from child's parent_id to parent's id in same CSV data Column required; values required in child rows Allows including parent and child metadata in same CSV. Secondary task References from parent_id in child CSV file to id in parent CSV file Column and values recommended in secondary (child) CSV data Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job. Collections and members together References from child (member) parent_id fields to parent (collection) id fields in same CSV data Column required in CSV but must be empty (collections do not use weight to determine sort order) Allows creation of collection and members in same Islandora Workbench job.","title":"Creating paged, compound, and collection content"},{"location":"paged_and_compound/#using-subdirectories","text":"Note Information in this section applies to all compound content, not just \"paged content\". That term is used here since the most common use of this method will be for creating paged content. In other words, where \"page\" is used below, it can be substituted with \"child\". Enable this method by including paged_content_from_directories: true in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata.","title":"Using subdirectories"},{"location":"paged_and_compound/#csv-and-directory-structure","text":"This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Only the parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.: id,title,field_model book1,How to Use Islandora Workbench like a Pro,Paged Content book2,Using Islandora Workbench for Fun and Profit,Paged Content Note Unlike every other Islandora Workbench \"create\" configuration, the metadata CSV should not contain a file column (however, you can include a directory column as described below). This means that content created using this method cannot be created using the same CSV file as other content. Each parent's pages are located in a subdirectory of the input directory that is named by default to match the value of the id field of the parent item they are pages of: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv If you don't want to use your id column to name the directory that stores pages, you can include a directory column in your input CSV and add the page_files_source_dir_field: directory setting to your config file. The values in the directory column can then contain the names of the page directories. If you do that, your CSV would look like this: id,title,field_model,directory sfu_book_1,How to Use Islandora Workbench like a Pro,Paged Content,book1 sfu_book_2,Using Islandora Workbench for Fun and Profit,Paged Content,book2","title":"CSV and directory structure"},{"location":"paged_and_compound/#filename-conventions","text":"The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash ( - ), although you can use another character by setting the paged_content_sequence_separator option in your configuration file. These sequence indicators are essentially physical page numbers, starting a \"1\" (not \"0\"). For example, using the filenames for \"book1\" above, the sequence of \"page-001.jpg\" is \"001\". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of \"isbn-1843341778-001.jpg\" for \"book2\" is also \"001\". Workbench takes this sequence number, strips all leading zeros, and uses it to populate the field_weight in the page nodes, so \"001\" becomes a weight value of 1, \"002\" becomes a weight value of 2, and so on. Important things to note when using this method: To use this method of creating paged content, you must include paged_content_page_model_tid in your configuration file and set it to your Islandora's term ID for the \"Page\" term in the Islandora Models vocabulary (or to http://id.loc.gov/ontologies/bibframe/part ). The Islandora model of the parent is not set automatically. You need to include a field_model value for each item in your CSV file, commonly \"Paged content\" or \"Publication issue\". You can apply CSV value templates to paged/child items using values from their respective parents. See the \" CSV value templates \" documentation for more information. You should also include a field_display_hints column in your CSV. This value is applied to the parent nodes and also the page nodes, unless the paged_content_page_display_hints setting is present in you configuration file. However, if you normally don't set the \"Display hints\" field in your objects but use a Context to determine how objects display, you should not include a field_display_hints column in your CSV file. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/paged item relationships will still work. The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the paged_content_page_content_type setting in your configuration file. If your page directories contain files other than page images, you need to include the paged_content_image_file_extension setting in your configuration. Otherwise, Workbench can't tell which files to create pages from. If you don't want to use your id column to name the directories that contain each item's pages, you can include page_files_source_dir_field: directory to your config file and add a directory column to your input CSV to name the directories.","title":"Filename conventions"},{"location":"paged_and_compound/#applying-field-data-to-pageschildren-created-from-subdirectories","text":"Titles for pages/children created from subdirectories are generated automatically using the pattern parent_title + , page + sequence_number , where \"parent title\" is inherited from the page's parent node and \"sequence number\" is the page's sequence. For example, if a page's parent has the title \"How to Write a Book\" and its sequence number is 450, its automatically generated title will be \"How to Write a Book, page 450\". You can override this pattern by including the page_title_template setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is '$parent_title, page $weight' . There are only two variables you can include in the template, $parent_title and $weight , although you do not need to include either one if you don't want that information appearing in your page titles. The Islandora Model applied to all page/child nodes is the one defined in the paged_content_page_model_tid configuration setting. This model is automatically applied to all pages/children created from subdirectories. Fields on pages/children that are configured as required in the parent and page content type are automatically inherited from the parent. No special configuration is necessary. You can add additional (non-required field) metadata to pages/children using CSV value templates during the create task that creates the pages/children from subdirectories.","title":"Applying field data to pages/children created from subdirectories"},{"location":"paged_and_compound/#ingesting-pages-their-parents-and-their-grandparents-using-a-single-csv-file","text":"In the \"books\" example above, each row in the CSV (i.e., book1, book2) describes a node with the \"Paged Content\" Islandora model; each of the books is the direct parent of the individual page nodes. However, in some cases, you may want to create the pages, their direct parents (each book), and a parent of the parents (let's call it a \"grandparent\" of the pages) at the same time, using the same Workbench job and the same input CSV. Some common use cases for this ability are: creating a node describing a periodical, some nodes describing issues of the periodical, and the pages of each issue, and creating a node describing a book series, a set of nodes describing books in the series, and page nodes for each book. paged_content_from_directories: true in your config file tells Workbench to look in a directory containing page files for each row in your input CSV. If you want to include the pages, the immediate parent of the pages, and the grandparent of the pages in the same CSV, you can create an empty directory for the grandparent node, named after its id value like the other items in your CSV. In addition, and importantly, you also need to include a parent_id column in your CSV file to define the relationship between the grandparent and its direct children (in our example, the book nodes). The presence of the parent_id column does not have impact on the parent-child relationship between the books and their pages; that relationship is created automatically, like it is in the \"books\" example above. To illustrate this, let's extend the \"books\" example above to include a higher-level (grandparent to the pages) node that describes the series of books used in that example. Here is the CSV with the new top-level item, and with the addition of the parent_id column to indicate that the paged content items are children of the new \"book000\" node: id,parent_id,title,field_model book000,,How-to Books: A Best-Selling Genre of Books,Compound Object book1,book000,How to Use Islandora Workbench like a Pro,Paged Content book2,book000,Using Islandora Workbench for Fun and Profit,Paged Content The directory structure looks like this (note that the book000 directory should be empty since it doesn't have any pages as direct children): books/ \u251c\u2500\u2500 book000 \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Workbench will warn you that the book000 directory is empty, but that's OK - it will look for, but not find, any pages for that item. The node corresponding to that directory will be created as expected, and values in the parent_id column will ensure that the intended hierarchical relationship between \"book000\" and its child items (the book nodes) is created.","title":"Ingesting pages, their parents, and their \"grandparents\" using a single CSV file"},{"location":"paged_and_compound/#ingesting-ocr-and-other-files-with-page-images","text":"You can tell Workbench to add OCR and other media related to page images when using the \"Using subdirectories\" method of creating paged content. To do this, add the OCR files to your subdirectories, using the base filenames of each page image plus an extension like .txt : books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u251c\u2500\u2500 book2 \u2502 \u251c\u2500\u2500 isbn-1843341778-001.jpg \u2502 \u251c\u2500\u2500 isbn-1843341778-001.txt \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.jpg \u2502 \u251c\u2500\u2500 using-islandora-workbench-page-002.txt \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u2514\u2500\u2500 page-003.jpg \u2514\u2500\u2500 metadata.csv Then, add the following settings to your configuration file: paged_content_from_directories: true (as described above) paged_content_page_model_tid (as described above) paged_content_image_file_extension : this is the file extension, without the leading . , of the page images, for example tif , jpg , etc. paged_content_additional_page_media : this is a list of mappings from Media Use term IDs or URIs to the file extensions of the OCR or other files you are ingesting. See the example below. An example configuration is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: input_data/paged_content_example standalone_media_url: true paged_content_from_directories: true paged_content_page_model_tid: http://id.loc.gov/ontologies/bibframe/part paged_content_image_file_extension: jpg paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt You can add multiple additional files (for example, OCR and hOCR) if you provide a Media Use term-to-file-extension mapping for each type of file: paged_content_additional_page_media: - http://pcdm.org/use#ExtractedText: txt - https://discoverygarden.ca/use#hocr: hocr You can also use your Drupal's numeric Media Use term IDs in the mappings, like: paged_content_additional_page_media: - 354: txt - 429: hocr Note Using hOCR media for Islandora paged content nodes may not be configured on your Islandora repository; hOCR and the corresponding URI are used here as an example only. In this case, Workbench looks for files with the extensions txt and hocr and creates media for them with respective mapped Media Use terms. The paged content input directory would look like this: books/ \u251c\u2500\u2500 book1 \u2502 \u251c\u2500\u2500 page-001.jpg \u2502 \u251c\u2500\u2500 page-001.txt \u2502 \u251c\u2500\u2500 page-001.hocr \u2502 \u251c\u2500\u2500 page-002.jpg \u2502 \u251c\u2500\u2500 page-002.txt \u2502 \u251c\u2500\u2500 page-002.hocr \u2502 \u251c\u2500\u2500 page-003.txt \u2502 \u251c\u2500\u2500 page-003.hocr \u2502 \u2514\u2500\u2500 page-003.jpg Warning It is important to temporarily disable actions in Contexts that generate media/derivatives that would conflict with additional media you are adding using the method described here. For example, if you are adding OCR files, in the \"Page Derivatives\" Context listed at /admin/structure/context , disable the \"Extract text from PDF or image\" action prior to running Workbench, and be sure to re-enable it afterwards. If you do not do this, the OCR media added by Workbench will get overwritten with the one that Islandora generates using the \"Extract text from PDF or image\" action.","title":"Ingesting OCR (and other) files with page images"},{"location":"paged_and_compound/#ignoring-files-in-page-directories","text":"Sometimes files such as \"Thumbs.db\" (on Windows) can creep into page directories. You can tell Workbench to ignore specific files within directories by including the paged_content_ignore_files configuration setting in your config file. Note that the default setting is to ignore \"Thumbs.db\" files. If you want to add additional files, or override that default setting, include the paged_content_ignore_files followed by a list of filenames, e.g.: paged_content_ignore_files: [\"Thumbs.db\", \"scanning_manifest.txt\"] Note that Workbench converts all filenames in the directories and filenames listed in the paged_content_ignore_files setting to lower case before checking to see if they are in this list. For example, if Workbench encounters a filename Scanning_Manifest.TXT , it will match \"scanning_manifest.txt\" in the configuration above configuration.","title":"Ignoring files in page directories"},{"location":"paged_and_compound/#with-pagechild-level-metadata","text":"Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's file column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with a special parent_id CSV column. Values in the parent_id column, which only apply to rows describing pages/children, are the id value of their parent. For this to work, your CSV file must contain a parent_id field plus the standard Islandora fields field_weight , field_member_of , and field_model (the role of these last three fields will be explained below). The id field is required in all CSV files used to create content, so in this case, your CSV needs both an id field and a parent_id field. The following example illustrates how this works. Here is the raw CSV data: id,parent_id,field_weight,file,title,field_description,field_model,field_member_of 001,,,,Postcard 1,The first postcard,28,197 003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29, 004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29, 002,,,,Postcard 2,The second postcard,28,197 006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29, 007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29, The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet: The data contains rows for two postcards (rows with id values \"001\" and \"002\") plus a back and front for each (the remaining four rows). The parent_id value for items with id values \"003\" and \"004\" is the same as the id value for item \"001\", which will tell Workbench to make both of those items children of item \"001\"; the parent_id value for items with id values \"006\" and \"007\" is the same as the id value for item \"002\", which will tell Workbench to make both of those items children of the item \"002\". We can't populate field_member_of for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children. In this example, the rows for our postcard objects have empty parent_id , field_weight , and file columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in field_member_of , which is the node ID of the \"Postcards\" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their field_weight field, and they have values in their file column because we are creating objects that contain image media. Importantly, they have no value in their field_member_of field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's field_member_of dynamically, just after its parent node is created. Some important things to note: The parent_id column can contain only a single value. In other words, values like id_0029|id_0030 won't work. If you want an item to have multiple parents, you need to use a later update task to assign additional values to the child node's field_member_of field. Currently, you need to include the option allow_missing_files: true in your configuration file when using this method to create paged/compound content. See this issue for more information. id can be defined as another field name using the id_field configuration option. If you do define a different ID field using the id_field option, creating the parent/child relationships will still work. The values of the id and parent_id columns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields. The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. ( --check will tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child's field_member_of will be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items. Currently, you must include values in the children's field_weight column (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue ). Currently, Islandora model values (e.g. \"Paged Content\", \"Page\") are not automatically assigned. You must include the correct \"Islandora Models\" taxonomy term IDs in your field_model column for all parent and child records, as you would for any other Islandora objects you are creating. Like for field_weight , it may be possible to automatically generate values for this field (see this issue ). Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your \"create\" tasks accordingly. For example: If you want to use a single \"create\" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages. If you would rather use multiple \"create\" tasks, you can create all your collections first, then, in subsequent \"create\" tasks, use their respective node IDs in the field_member_of CSV column for their members. If you use a separate \"create\" task to create members of a single collection, you can define the value of field_member_of in a CSV field template . If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book node's node ID in their field_member_of column). Or, you could run a separate \"create\" task for each book, and use a CSV field template containing a field_member_of entry containing the book item's node ID. For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent \"create\" task for both newspaper issues and pages. In this task, the field_member_of column in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blank field_member_of and a parent_id using the parent issue's id value.","title":"With page/child-level metadata"},{"location":"paged_and_compound/#using-a-secondary-task","text":"You can configure Islandora Workbench to execute two \"create\" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents. The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's parent_id values to their parent's id values, much the same way as in the \"With page/child-level metadata\" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and also their own configuration file). Each task is a standard Islandora Workbench \"create\" task, joined by one setting in the primary's configuration file, secondary_tasks , as described below. In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children: As you can see, values in the parent_id column in the secondary CSV reference values in the id column in the primary CSV: parent_id 001 in the secondary CSV matches id 001 in the primary, parent_id 003 in the secondary matches id 003 in the primary, and so on. You configure secondary tasks by adding the secondary_tasks setting to your primary configuration file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora # This is the setting that links the two configuration files together. secondary_tasks: ['children.yml'] input_csv: parents.csv nodes_only: true In the secondary_tasks setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named \"children.yml\") contains no indication that it's a secondary task: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: kids.csv csv_field_templates: - field_model: http://purl.org/coar/resource_type/c_c513 query_csv_id_to_node_id_map_for_parents: true Note The CSV ID to node ID map is required in secondary create tasks. Workbench will automatically change the query_csv_id_to_node_id_map_for_parents to true , regardless of whether that setting is in your secondary task's config file. Note The nodes_only setting in the above example primary configuration file and the csv_field_templates setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ. When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of id + node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the field_member_of values in the secondary task with the node IDs corresponding to the referenced primary id values. Some things to note about secondary tasks: Only \"create\" tasks can be used as the primary and secondary tasks. When you have a secondary task configured, running --check will validate both tasks' configuration and input data. The secondary CSV must contain parent_id and field_member_of columns. field_member_of must be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, include field_weight with the appropriate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order). If a row in the secondary task CSV does not have a parent_id that matches an id of a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so. As already stated, each task has its own configuration file, which means that you can specify a content_type value in your secondary configuration file that differs from the content_type of the primary task. You can include more than one secondary task in your configuration. For example, secondary_tasks: ['first.yml', 'second.yml'] will execute the primary task, then the \"first.yml\" secondary task, then the \"second.yml\" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes.","title":"Using a secondary task"},{"location":"paged_and_compound/#specifying-paths-to-the-python-interpreter-and-to-the-workbench-script","text":"When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the \"workbench\" script is located. The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute path to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are: path_to_python path_to_workbench_script An example of using these settings is: secondary_tasks: ['children.yml'] path_to_python: '/usr/bin/python' path_to_workbench_script: '/home/mark/islandora_workbench/workbench' The second situation is when using a secondary task when running Workbench in Windows and \"python.exe\" is not in the PATH of the user running the scheduled job. Specifying the absolute path to \"python.exe\" will ensure that Workbench can execute the secondary task properly, like this: secondary_tasks: ['children.yml'] path_to_python: 'c:/program files/python39/python.exe' path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench'","title":"Specifying paths to the python interpreter and to the workbench script"},{"location":"paged_and_compound/#creating-parentchild-relationships-across-workbench-sessions","text":"It is possible to use parent_id values in your CSV that refer to id values from earlier Workbench sessions. In other words, you don't need to create parents and their member/child nodes within the same Workbench job; you can create parents in an earlier job and refer to their id values in later jobs. This is possible because during create tasks, Workbench records each newly created node ID and its corresponding value from the input CSV's id (or configured equivalent) column. It also records any values from the CSV parent_id column, if they exist. This data is stored in a simple SQLite database called the \" CSV ID to node ID map \". Because this database persists across Workbench sessions, you can use id values in your input CSV's parent_id column from previously loaded CSV files. The mapping between the previously loaded parents' id values and the values in your current CSV's parent_id column are stored in the CSV ID to node ID map database. Note It is important to use unique values in your CSV id (or configured equivalent) column, since if duplicate ID values exist in this database, Workbench can't know which corresponding node ID to use. In this case, Workbench will create the child node, but it won't assign a parent to it. --check will inform you if this happens with messages like Warning: Query of ID map for parent ID \"0002\" returned multiple node IDs: (771, 772, 773, 774, 778, 779). , and your Workbench log will also document that there are duplicate IDs. Warning By default, Workbench only checks the CSV ID to node ID map for parent IDs created in the same session as the children. If you want to assign children to parents created in previous Workbench sessions, you need to set the query_csv_id_to_node_id_map_for_parents configuration setting to true .","title":"Creating parent/child relationships across Workbench sessions"},{"location":"paged_and_compound/#creating-collections-and-members-together","text":"Using a variation of the \"With page/child-level metadata\" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members' parent_id field to the collections' id field: id,parent_id,file,title,field_model,field_member_of,field_weight 1,,,A collection of animal photos,24,, 2,1,cat.jpg,Picture of a cat,25,, 3,1,dog.jpg,Picture of a dog,25,, 3,1,horse.jpg,Picture of a horse,25,, The use of the parent_id and field_member_of fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in field_weight empty, since Islandora collections don't use field_weight to determine order of members. Collection Views are sorted using other fields. Warning Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the \"Using a secondary task\" method to ingest collections and members in the same Workbench job.","title":"Creating collections and members together"},{"location":"paged_and_compound/#summary","text":"The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes: Method Relationships created by field_weight Advantage Subdirectories Directory structure Do not include column in CSV; autopopulated. Useful for creating paged content where pages don't have their own metadata. Parent/child-level metadata in same CSV References from child's parent_id to parent's id in same CSV data Column required; values required in child rows Allows including parent and child metadata in same CSV. Secondary task References from parent_id in child CSV file to id in parent CSV file Column and values recommended in secondary (child) CSV data Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job. Collections and members together References from child (member) parent_id fields to parent (collection) id fields in same CSV data Column required in CSV but must be empty (collections do not use weight to determine sort order) Allows creation of collection and members in same Islandora Workbench job.","title":"Summary"},{"location":"preparing_data/","text":"Islandora Workbench allows you to arrange your input data in a variety of ways. The two basic sets of data you need to prepare (depending on what task you are performing) are: a CSV file, containing data that will populate node fields (or do other things depending on what task you are performing), described here files that will be used as Drupal media. The options for arranging your data are detailed below. Note Some of Workbench's functionality depends on a specific directory structure not described here, for example \" Creating nodes from files \" and \" Creating paged, compound, and collection content .\" However, the information on this page applies to the vast majority of Workbench usage. Using an input directory In this configuration, you define an input directory (identified by the input_dir config option) that contains a CSV file with field content (identified by the input_csv config option) and any accompanying media files you want to add to the newly created nodes: input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv Here is the same input directory, with some explanation of how the files relate to each other: input_data/ <-- This is the directory named in the \"input_dir\" configuration setting. \u251c\u2500\u2500 image1.JPG <-- This and the other JPEG files are named in the \"file\" column in the CSV file. \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv <-- This is the CSV file named in the \"input_csv\" configuration setting. The names of the image/PDF/video/etc. files are included in the file column of the CSV file. Files with any extension that you can upload to Drupal are allowed. Islandora Workbench reads the CSV file and iterates through it, performing the current task for each record. In this configuration, files other than the CSV and your media files are allowed in this directory (although for some configurations, your input directory should not contain any files that are not going to be ingested). This is Islandora Workbench's default configuration. If you do not specify an input_dir or an input_csv , as illustrated in following minimal configuration file, Workbench will assume your files are in a directory named \"input_data\" in the same directory as the Workbench script, and that within that directory, your CSV file is named \"metadata.csv\": task: create host: \"http://localhost:8000\" username: admin password: islandora Workbench ignores the other files in the input directory, and only looks for files in that directory if the filename alone (no directory component) is in file column. workbench <-- The \"workbench\" script. \u251c\u2500\u2500 input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv For example, in this configuration, in the following \"metadata.csv\" file, Workbench looks for \"image1.JPG\", \"image-27626.jpg\", and \"someimage.jpg\" at \"input_data/image1.JPG\", \"input_data/image1.JPG\", and \"input_data/someimage.jpg\" respectively, relative to the location of the \"workbench\" script: id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog Workbench complete ignores \"pic_saturday.jpg\" and \"IMG_2958.JPG\" because they are not named in any of the file columns in the \"metadata.csv\" file. If the configuration file specified an input_dir value, or identified a CSV file in input_csv , Workbench would use those values: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: myfiles input_csv: mymetadata.csv workbench <-- The \"workbench\" script. \u251c\u2500\u2500 myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv The value of input_dir doesn't need to be relative to the workbench script, it can be absolute: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: /tmp/myfiles \u251c\u2500\u2500 /tmp/myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog In this case, even though only the CSV file entries contain only filenames and no path information, Workbench looks for the image files at \"/tmp/myfiles/image1.JPG\", \"/tmp/myfiles/image1.JPG\", and \"/tmp/myfiles/someimage.jpg\". Using absolute file paths We saw in the previous section that the path specified in your configuration file's input_dir configuration option need not be relative to the location of the workbench script, it can be absolute. That is also true for both the configuration value of input_csv and for the values in your input CSV's file column. You can also mix absolute and relative filenames in the same CSV file, but all relative filenames are considered to be in the directory named in input_dir . An example configuration file for this is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: media_files input_csv: /tmp/input.csv And within the file column of the CSV, values like: id,file,title 001,/tmp/mydata/file01.png,A very good file 0002,/home/me/Documents/files/cat.jpg,My cat 003,dog.png,My dog Notice that the file values in the first two rows are absolute, but the file value in the last row is relative. Workbench will look for that file at \"media_files/dog.png\". Note In general, Workbench doesn't care if any file path used in configuration or CSV data is relative or absolute, but if it's relative, it's relative to the directory where the workbench script lives. Note Most of the example paths used in this documentation are Linux paths. In general, paths on Mac computers look and work the same way. On Windows, relative paths and absolute paths like C:\\Users\\Mark\\Downloads\\myfile.pdf and UNC paths like \\\\some.windows.file.share.org\\share_name\\files\\myfile.png work fine. These paths also work in Workbench configuration files in settings such as input_dir . Using URLs as file paths In the file column, you can also use URLs to files, like this: id,file,title 001,http://www.mysite.com/file01.png,A very good file 0002,https://mycatssite.org/images/cat.jpg,My cat 003,dog.png,My dog More information is available on using URLs in your file column. Using a local or remote .zip archive as input data If you register the location of a local .zip archive or a remote (available over http(s)) zip archive in your configuration file, Workbench will unzip the contents of the archive into the directory defined in your input_dir setting: input_data_zip_archives: - /tmp/mytest.zip - https://myremote.host.org/zips/another_zip.zip The archive is unzipped with its internal directory structure intact; for example, if your zip has the following structure: bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 In this case, your CSV file column values should include the intermediate directory's path, e.g. bpnichol/003 Partial Side A.mp3 . You can also include an input CSV in your zip archive if you want: bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Within input_data , the unzipped content would look like: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Alternatively, all of your files can also be at the root of the zip archive. In that case, they would be unzipped into the directory named in your input_dir setting. A zip archive with this structure: \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 would be unzipped into: input_data/ \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 If you are zipping up directories to create paged content , all of the directories containing page files should be at the root of your zip archive, with no intermediate parent directory: \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif This is because if you include paged_content_from_directories: true in your configuration file, Workbench looks within your input_dir for a directory named after the paged content item's id value, without an intermediate directory. A few things to note if you are using a zip archive as your input data: Remote URLs to zip archives do not need to end in \".zip\", but the remote files must be directly accessible for downloading without any authentication. You can register a single or multiple zip file in your input_data_zip_archives setting. Workbench doesn't check for the existence of files at extracted destination paths, so if a file with the same extracted path exists in more than one archive (or is already at a path the same as that of a file from an archive), the file from the last archive in the input_data_zip_archives list will overwrite existing files at the same path. You can include in your zip archive(s) any files that you want to put in the directory indicated in your input_dir config setting, including files named in your CSV file column, files named in columns defined by your additional_files configuration, or the CSV or Excel file named in your input_csv setting (as illustrated in the Rungh example above). Workbench will automatically delete the archive file after extracting it unless you add delete_zip_archive_after_extraction: false to your config file. Using a Google Sheet as the input CSV file With this option, your configuration's input_csv option contains the URL to a publicly readable Google Sheet. To do this, simply provide the URL to the Google spreadsheet in your configuration file's input_csv option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit google_sheets_gid: 430969348 That's all you need to do. It's most reliable to have your Google Sheets URLs end in \"/edit\" and explicitly add the \"gid\" value using the google_sheets_gid config setting. Every time Workbench runs, it fetches the CSV content of the spreadsheet and saves it to a local file in the directory named in your input_directory configuration option, and from that point onward in its execution, uses the locally saved version of the spreadsheet. The default filename for this CSV file is google_sheet.csv but you can change it if you need to by including the google_sheets_csv_filename option in your configuration file, e.g., google_sheets_csv_filename: my_filename.csv . Islandora Workbench fetches a new copy of the CSV data every time it runs (even with the --check option), so if you make changes to the contents of that local file, the changes will be overwritten with the data from the Google spreadsheet the next time you run Workbench. If you don't want to overwrite your local copy of the data, rename the local CSV file manually before running Workbench, and update the input_csv option in your configuration file to use the name of the CSV file you copied. Note Using a Google Sheet is currently the fastest and most convenient way of managing CSV data for use with Islandora Workbench. Since Sheets saves changes in realtime, and since Workbench fetches a fresh copy of the CSV data every time you run it, it's easy to iterate by making changes to your data in Sheets, running Workbench (don't forget to use --check first to identify any problems!), seeing the effects of your changes in the nodes you've just created, rolling back your nodes , tweaking your data in Sheets, and starting a new cycle. If you are focused on refining your CSV metadata, you can save time by skipping the creation of media by including nodes_only: true in your configuration file. Some things to note about using Google Sheets: You can use a Google Sheet in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Google spreadsheet. The URL in the configuration file needs single or double quotes around it, like any other value that contains a colon. You can use either the URL you copy from your browser when you are viewing the spreadsheet (which ends in \"/edit#gid=0\" or something similar), or the \"sharing\" URL you copy into your clipboard from within the \"Share\" dialog box (which ends in \"edit?usp=sharing\"). Either is OK, but, as noted above, it's best to have your URL end in \"/edit\" and explicitly state the gid using the google_sheets_gid config setting. The Google spreadsheet must be publicly readable, e.g. with \"Anyone on the Internet with this link can view\" permission. Spreadsheets work best for descriptive metadata if all cells are formatted as \"Plain text\". To do this in Google Sheets, select all cells, then choose the menu items Format > Number > Plain text before adding any content to the cells . If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above. Selecting a specific worksheet within a Google Sheet Worksheets within a given Google Sheet are identified by a \"gid\". If a Sheet has only a single worksheet, its \"gid\" is \"0\" (zero): https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=0 If you add additional worksheets, they get a randomly generated \"gid\", such as \"1094504353\". You can see this \"gid\" in the URL when you are in the worksheet: https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=1094504353 By default, Workbench extracts CSV data from the worksheet with a \"gid\" of \"0\". If you want Workbench to extract the CSV data from a specific worksheet that is not the one with a \"gid\" of \"0\", specify the \"gid\" in your configuration file using the google_sheets_gid option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit?usp=sharing' google_sheets_gid: 1094504353 Using an Excel file as the input CSV file With this option, your configuration's input_csv option contains the filename of an Excel 2010 (or higher) file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx Islandora Workbench extracts the content of this file as CSV data, and uses that extracted data as its input the same way it would use a raw CSV file. Note that: You can use an Excel file in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Excel spreadsheet. Spreadsheets work best for descriptive metadata if all cells are formatted as \"text\". To do this, in Excel, select all cells, alt-click on the selected area, then choose the \"Format Cells\" context menu item. In the \"Number\" tab, choose \"Text\", then click on the \"OK\" button. If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above. Selecting a specific worksheet within an Excel file The worksheet that the CSV data is taken from is the one named \"Sheet1\", unless you specify another worksheet using the excel_worksheet configuration option. As with Google Sheets, you can tell Workbench to use a specific worksheet in an Excel file. Here is an example of a config file using that setting: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx excel_worksheet: Second sheet How Workbench cleans your input data Regardless of whether your input data is raw CSV, a Google Sheet, or Excel, Workbench applies a small number of cleansing operations on it. These are: replaces smart/curly quotes (both double and single) with regular quotes replaces multiple whitespaces within strings with a single space removes leading and trailing spaces (including newlines). removes leading and trailing subdelimter characters (i.e., the value of the subdelimiter config setting, default of | ). If you do not want Workbench to do one or more of these cleanups, include the clean_csv_values_skip setting in your configuration, specifying in a list one or more of the following: smart_quotes inside_spaces outside_spaces outside_subdelimiters An example of using this configuration setting is: clean_csv_values_skip: [\"smart_quotes\", \"inside_spaces\"] When Workbench skips invalid CSV data Running --check will tell you when any of the data in your CSV file is invalid, or in other words, does not conform to its target Drupal field's configuration and is likely to cause the creation/updating of content to fail. Currently, for the following types of fields: text geolocation link Workbench will validate CSV values and skip values that fail its validation tests. Work is underway to complete this feature, including skipping of invalid entity reference and typed relation fields. Blank or missing \"file\" values By default, if the file value for a row is empty, Workbench's --check option will show an error. But, in some cases you may want to create nodes but not add any media. If you add allow_missing_files: true to your config file for \"create\" tasks, you can leave the file column in your CSV empty. Creating nodes but not media If you want to only create nodes and not media, you can do so by including nodes_only: true in your configuration file. More detail is available . Encoding of text files All text files used as input to Islandora Workbench, including CSV data, files that are going to be use to create \"Extracted Text\" media such as OCR/hOCR files, and media track files, must use a standard UTF-8 encoding. This is generally not a problem other than if you have created any of these files on Microsoft Windows. Even on Windows, UTF-8 files can contain a Microsoft-specifc feature called a \"Byte-order mark\", or \"BOM\". This type of UTF-8 file is not valid; only standard UTF-8 files are.","title":"Preparing your data"},{"location":"preparing_data/#using-an-input-directory","text":"In this configuration, you define an input directory (identified by the input_dir config option) that contains a CSV file with field content (identified by the input_csv config option) and any accompanying media files you want to add to the newly created nodes: input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv Here is the same input directory, with some explanation of how the files relate to each other: input_data/ <-- This is the directory named in the \"input_dir\" configuration setting. \u251c\u2500\u2500 image1.JPG <-- This and the other JPEG files are named in the \"file\" column in the CSV file. \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv <-- This is the CSV file named in the \"input_csv\" configuration setting. The names of the image/PDF/video/etc. files are included in the file column of the CSV file. Files with any extension that you can upload to Drupal are allowed. Islandora Workbench reads the CSV file and iterates through it, performing the current task for each record. In this configuration, files other than the CSV and your media files are allowed in this directory (although for some configurations, your input directory should not contain any files that are not going to be ingested). This is Islandora Workbench's default configuration. If you do not specify an input_dir or an input_csv , as illustrated in following minimal configuration file, Workbench will assume your files are in a directory named \"input_data\" in the same directory as the Workbench script, and that within that directory, your CSV file is named \"metadata.csv\": task: create host: \"http://localhost:8000\" username: admin password: islandora Workbench ignores the other files in the input directory, and only looks for files in that directory if the filename alone (no directory component) is in file column. workbench <-- The \"workbench\" script. \u251c\u2500\u2500 input_data/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 metadata.csv For example, in this configuration, in the following \"metadata.csv\" file, Workbench looks for \"image1.JPG\", \"image-27626.jpg\", and \"someimage.jpg\" at \"input_data/image1.JPG\", \"input_data/image1.JPG\", and \"input_data/someimage.jpg\" respectively, relative to the location of the \"workbench\" script: id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog Workbench complete ignores \"pic_saturday.jpg\" and \"IMG_2958.JPG\" because they are not named in any of the file columns in the \"metadata.csv\" file. If the configuration file specified an input_dir value, or identified a CSV file in input_csv , Workbench would use those values: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: myfiles input_csv: mymetadata.csv workbench <-- The \"workbench\" script. \u251c\u2500\u2500 myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 pic_saturday.jpg \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 IMG_2958.JPG \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv The value of input_dir doesn't need to be relative to the workbench script, it can be absolute: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: /tmp/myfiles \u251c\u2500\u2500 /tmp/myfiles/ \u251c\u2500\u2500 image1.JPG \u251c\u2500\u2500 image-27262.jpg \u251c\u2500\u2500 someimage.jpg \u2514\u2500\u2500 mymetadata.csv id,file,title 001,image1.JPG,A very good file 0002,image-27262.jpg,My cat 003,someimage.jpg,My dog In this case, even though only the CSV file entries contain only filenames and no path information, Workbench looks for the image files at \"/tmp/myfiles/image1.JPG\", \"/tmp/myfiles/image1.JPG\", and \"/tmp/myfiles/someimage.jpg\".","title":"Using an input directory"},{"location":"preparing_data/#using-absolute-file-paths","text":"We saw in the previous section that the path specified in your configuration file's input_dir configuration option need not be relative to the location of the workbench script, it can be absolute. That is also true for both the configuration value of input_csv and for the values in your input CSV's file column. You can also mix absolute and relative filenames in the same CSV file, but all relative filenames are considered to be in the directory named in input_dir . An example configuration file for this is: task: create host: \"http://localhost:8000\" username: admin password: islandora input_dir: media_files input_csv: /tmp/input.csv And within the file column of the CSV, values like: id,file,title 001,/tmp/mydata/file01.png,A very good file 0002,/home/me/Documents/files/cat.jpg,My cat 003,dog.png,My dog Notice that the file values in the first two rows are absolute, but the file value in the last row is relative. Workbench will look for that file at \"media_files/dog.png\". Note In general, Workbench doesn't care if any file path used in configuration or CSV data is relative or absolute, but if it's relative, it's relative to the directory where the workbench script lives. Note Most of the example paths used in this documentation are Linux paths. In general, paths on Mac computers look and work the same way. On Windows, relative paths and absolute paths like C:\\Users\\Mark\\Downloads\\myfile.pdf and UNC paths like \\\\some.windows.file.share.org\\share_name\\files\\myfile.png work fine. These paths also work in Workbench configuration files in settings such as input_dir .","title":"Using absolute file paths"},{"location":"preparing_data/#using-urls-as-file-paths","text":"In the file column, you can also use URLs to files, like this: id,file,title 001,http://www.mysite.com/file01.png,A very good file 0002,https://mycatssite.org/images/cat.jpg,My cat 003,dog.png,My dog More information is available on using URLs in your file column.","title":"Using URLs as file paths"},{"location":"preparing_data/#using-a-local-or-remote-zip-archive-as-input-data","text":"If you register the location of a local .zip archive or a remote (available over http(s)) zip archive in your configuration file, Workbench will unzip the contents of the archive into the directory defined in your input_dir setting: input_data_zip_archives: - /tmp/mytest.zip - https://myremote.host.org/zips/another_zip.zip The archive is unzipped with its internal directory structure intact; for example, if your zip has the following structure: bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 In this case, your CSV file column values should include the intermediate directory's path, e.g. bpnichol/003 Partial Side A.mp3 . You can also include an input CSV in your zip archive if you want: bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Within input_data , the unzipped content would look like: input_data/ \u251c\u2500\u2500bpnichol \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 Alternatively, all of your files can also be at the root of the zip archive. In that case, they would be unzipped into the directory named in your input_dir setting. A zip archive with this structure: \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 would be unzipped into: input_data/ \u251c\u2500\u2500 bpn_metadata.csv \u251c\u2500\u2500 003 Partial Side A.mp3 \u2514\u2500\u2500 MsC12.Nichol.Tape15.mp3 If you are zipping up directories to create paged content , all of the directories containing page files should be at the root of your zip archive, with no intermediate parent directory: \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif and your input_dir value is \"input_data\", the archive will be unzipped into: input_data/ \u251c\u2500\u2500 rungh.csv \u251c\u2500\u2500 rungh_v2_n1-2 \u2502 \u251c\u2500\u2500 Vol.2-1-2-001.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-002.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-003.tif \u2502 \u251c\u2500\u2500 Vol.2-1-2-004.tif \u2502 \u2514\u2500\u2500 Vol.2-1-2-005.tif \u2514\u2500\u2500 rungh_v2_n3 \u251c\u2500\u2500 Vol.2-3-01.tif \u251c\u2500\u2500 Vol.2-3-02.tif \u251c\u2500\u2500 Vol.2-3-03.tif \u251c\u2500\u2500 Vol.2-3-04.tif \u251c\u2500\u2500 Vol.2-3-05.tif \u2514\u2500\u2500 Vol.2-3-07.tif This is because if you include paged_content_from_directories: true in your configuration file, Workbench looks within your input_dir for a directory named after the paged content item's id value, without an intermediate directory. A few things to note if you are using a zip archive as your input data: Remote URLs to zip archives do not need to end in \".zip\", but the remote files must be directly accessible for downloading without any authentication. You can register a single or multiple zip file in your input_data_zip_archives setting. Workbench doesn't check for the existence of files at extracted destination paths, so if a file with the same extracted path exists in more than one archive (or is already at a path the same as that of a file from an archive), the file from the last archive in the input_data_zip_archives list will overwrite existing files at the same path. You can include in your zip archive(s) any files that you want to put in the directory indicated in your input_dir config setting, including files named in your CSV file column, files named in columns defined by your additional_files configuration, or the CSV or Excel file named in your input_csv setting (as illustrated in the Rungh example above). Workbench will automatically delete the archive file after extracting it unless you add delete_zip_archive_after_extraction: false to your config file.","title":"Using a local or remote .zip archive as input data"},{"location":"preparing_data/#using-a-google-sheet-as-the-input-csv-file","text":"With this option, your configuration's input_csv option contains the URL to a publicly readable Google Sheet. To do this, simply provide the URL to the Google spreadsheet in your configuration file's input_csv option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/13Mw7gtBy1A3ZhYEAlBzmkswIdaZvX18xoRBxfbgxqWc/edit google_sheets_gid: 430969348 That's all you need to do. It's most reliable to have your Google Sheets URLs end in \"/edit\" and explicitly add the \"gid\" value using the google_sheets_gid config setting. Every time Workbench runs, it fetches the CSV content of the spreadsheet and saves it to a local file in the directory named in your input_directory configuration option, and from that point onward in its execution, uses the locally saved version of the spreadsheet. The default filename for this CSV file is google_sheet.csv but you can change it if you need to by including the google_sheets_csv_filename option in your configuration file, e.g., google_sheets_csv_filename: my_filename.csv . Islandora Workbench fetches a new copy of the CSV data every time it runs (even with the --check option), so if you make changes to the contents of that local file, the changes will be overwritten with the data from the Google spreadsheet the next time you run Workbench. If you don't want to overwrite your local copy of the data, rename the local CSV file manually before running Workbench, and update the input_csv option in your configuration file to use the name of the CSV file you copied. Note Using a Google Sheet is currently the fastest and most convenient way of managing CSV data for use with Islandora Workbench. Since Sheets saves changes in realtime, and since Workbench fetches a fresh copy of the CSV data every time you run it, it's easy to iterate by making changes to your data in Sheets, running Workbench (don't forget to use --check first to identify any problems!), seeing the effects of your changes in the nodes you've just created, rolling back your nodes , tweaking your data in Sheets, and starting a new cycle. If you are focused on refining your CSV metadata, you can save time by skipping the creation of media by including nodes_only: true in your configuration file. Some things to note about using Google Sheets: You can use a Google Sheet in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Google spreadsheet. The URL in the configuration file needs single or double quotes around it, like any other value that contains a colon. You can use either the URL you copy from your browser when you are viewing the spreadsheet (which ends in \"/edit#gid=0\" or something similar), or the \"sharing\" URL you copy into your clipboard from within the \"Share\" dialog box (which ends in \"edit?usp=sharing\"). Either is OK, but, as noted above, it's best to have your URL end in \"/edit\" and explicitly state the gid using the google_sheets_gid config setting. The Google spreadsheet must be publicly readable, e.g. with \"Anyone on the Internet with this link can view\" permission. Spreadsheets work best for descriptive metadata if all cells are formatted as \"Plain text\". To do this in Google Sheets, select all cells, then choose the menu items Format > Number > Plain text before adding any content to the cells . If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above.","title":"Using a Google Sheet as the input CSV file"},{"location":"preparing_data/#selecting-a-specific-worksheet-within-a-google-sheet","text":"Worksheets within a given Google Sheet are identified by a \"gid\". If a Sheet has only a single worksheet, its \"gid\" is \"0\" (zero): https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=0 If you add additional worksheets, they get a randomly generated \"gid\", such as \"1094504353\". You can see this \"gid\" in the URL when you are in the worksheet: https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit#gid=1094504353 By default, Workbench extracts CSV data from the worksheet with a \"gid\" of \"0\". If you want Workbench to extract the CSV data from a specific worksheet that is not the one with a \"gid\" of \"0\", specify the \"gid\" in your configuration file using the google_sheets_gid option, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: 'https://docs.google.com/spreadsheets/d/1RLrjb5BrlJNaasFIKrKV4l2rw/edit?usp=sharing' google_sheets_gid: 1094504353","title":"Selecting a specific worksheet within a Google Sheet"},{"location":"preparing_data/#using-an-excel-file-as-the-input-csv-file","text":"With this option, your configuration's input_csv option contains the filename of an Excel 2010 (or higher) file, like this: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx Islandora Workbench extracts the content of this file as CSV data, and uses that extracted data as its input the same way it would use a raw CSV file. Note that: You can use an Excel file in all tasks that use a CSV file as input. All of the columns required in a local CSV file are also required in the Excel spreadsheet. Spreadsheets work best for descriptive metadata if all cells are formatted as \"text\". To do this, in Excel, select all cells, alt-click on the selected area, then choose the \"Format Cells\" context menu item. In the \"Number\" tab, choose \"Text\", then click on the \"OK\" button. If the values in the file column of the spreadsheet are relative, they are assumed to point to files within your local input_directory , just like they do in a local CSV input file. However, you can also use absolute file paths and URLs in the file column, as described above.","title":"Using an Excel file as the input CSV file"},{"location":"preparing_data/#selecting-a-specific-worksheet-within-an-excel-file","text":"The worksheet that the CSV data is taken from is the one named \"Sheet1\", unless you specify another worksheet using the excel_worksheet configuration option. As with Google Sheets, you can tell Workbench to use a specific worksheet in an Excel file. Here is an example of a config file using that setting: task: create host: \"http://localhost:8000\" username: admin password: islandora input_csv: my_file.xlsx excel_worksheet: Second sheet","title":"Selecting a specific worksheet within an Excel file"},{"location":"preparing_data/#how-workbench-cleans-your-input-data","text":"Regardless of whether your input data is raw CSV, a Google Sheet, or Excel, Workbench applies a small number of cleansing operations on it. These are: replaces smart/curly quotes (both double and single) with regular quotes replaces multiple whitespaces within strings with a single space removes leading and trailing spaces (including newlines). removes leading and trailing subdelimter characters (i.e., the value of the subdelimiter config setting, default of | ). If you do not want Workbench to do one or more of these cleanups, include the clean_csv_values_skip setting in your configuration, specifying in a list one or more of the following: smart_quotes inside_spaces outside_spaces outside_subdelimiters An example of using this configuration setting is: clean_csv_values_skip: [\"smart_quotes\", \"inside_spaces\"]","title":"How Workbench cleans your input data"},{"location":"preparing_data/#when-workbench-skips-invalid-csv-data","text":"Running --check will tell you when any of the data in your CSV file is invalid, or in other words, does not conform to its target Drupal field's configuration and is likely to cause the creation/updating of content to fail. Currently, for the following types of fields: text geolocation link Workbench will validate CSV values and skip values that fail its validation tests. Work is underway to complete this feature, including skipping of invalid entity reference and typed relation fields.","title":"When Workbench skips invalid CSV data"},{"location":"preparing_data/#blank-or-missing-file-values","text":"By default, if the file value for a row is empty, Workbench's --check option will show an error. But, in some cases you may want to create nodes but not add any media. If you add allow_missing_files: true to your config file for \"create\" tasks, you can leave the file column in your CSV empty.","title":"Blank or missing \"file\" values"},{"location":"preparing_data/#creating-nodes-but-not-media","text":"If you want to only create nodes and not media, you can do so by including nodes_only: true in your configuration file. More detail is available .","title":"Creating nodes but not media"},{"location":"preparing_data/#encoding-of-text-files","text":"All text files used as input to Islandora Workbench, including CSV data, files that are going to be use to create \"Extracted Text\" media such as OCR/hOCR files, and media track files, must use a standard UTF-8 encoding. This is generally not a problem other than if you have created any of these files on Microsoft Windows. Even on Windows, UTF-8 files can contain a Microsoft-specifc feature called a \"Byte-order mark\", or \"BOM\". This type of UTF-8 file is not valid; only standard UTF-8 files are.","title":"Encoding of text files"},{"location":"prompts/","text":"You can define prompts to the user by adding settings to your configuration file. There are two types: 1) a prompt to ask the user whether they have run --check , and 2) configurable prompts that allow you to define your own \"y/n\" questions to the user. The \"Have you run --check? (y/n)\" prompt simply displays that question to the end user and asks for a \"y\" or \"n\" response. \"y\" resumes normal operation, \"n\" (or any other response) causes Workbench to exit. To show the user this prompt, add the following to your config file: remind_user_to_run_check: true If this configuration setting is present, the user will be prompted to answer \"Have you run --check? (y/n)\" when Workbench is run without the --check argumement. (This prompt is separate from the next type of prompt because it contains special logic such that it is never presented to the user in --check mode.) The second type of prompt, the configurable prompts, allow the creation of custom \"y/n\" questions that the user must answer \"y\" to in order to proceed. These prompts are shown to the user regardless of whether they are running Workbench with the --check argument. A situation where you may want to ask the user a question of this sort is when creating media that are normally created by Islandora as derivatives. In that case, it is usefult to remind the user that they need to temporarily disable the Contexts that generate the derivatives. These \"y/n\" questions you want to prompt users with are listed within the user_prompts config setting like this: user_prompts: - Have you inspected your Workbench log to look for issues you can fix? (y/n) - If you are adding \"additional files\", have you temporarily disabled the Contexts that would generate corresponding derivatives? (y/n) Each prompt is shown to the user in the order they are listed, each asking for a \"y/n\" response. If the users responds by entering \"y\", the next prompt is shown. If they respond with \"n\" or any other key, Workbench logs the prompt and response, and exits. Note that you must include the \"(y/n)\" text in your prompts; Workbench doesn't add it automatically. A couple notes: The user must hit Enter after entering \"y\" or \"n\". You can include both the remind_user_to_run_check setting and the user_prompts settings in your config file. If you include both, \"Have you run --check? (y/n)\" is presented to the user first, then the customizable prompts. If the user responds \"n\" to any prompt (or any response other than \"y\"), the prompt along with their response is logged before Workbench exits. Prompts can be overridden with an implied \"y\" answer by including the --skip_user_prompts argument when running Workbench. This is useful if your config file includes either the remind_user_to_run_check or user_prompts during testing or scripted use of Workbench.","title":"Prompting the user"},{"location":"quick_delete/","text":"You can delete individual nodes and media by using the --quick_delete_node and --quick_delete_media command-line options, respectively. In both cases, you need to provide the full URL (or URL alias) of the node or media you are deleting. For example: ./workbench --config anyconfig.yml --quick_delete_node http://localhost:8000/node/393 ./workbench --config anyconfig.yml --quick_delete_media http://localhost:8000/media/552/edit In the case of the --quick_delete_node option, all associated media and files are also deleted (but see the exception below); in the case of the --quick_delete_media option, all associated files are deleted. You can use any valid Workbench configuration file regardless of the task setting, since the only settings the quick delete tool requires are username and password . All other configuration settings are ignored with the exception of: delete_media_with_nodes . If you want to not delete a node's media when using the --quick_delete_node option, your configuration file should include delete_media_with_nodes: false . standalone_media_url . If your Drupal instance has the \"Standalone media URL\" option at /admin/config/media/media-settings unchecked (the default), you will need to include /edit at the end of media URLs. Workbench will tell you if you include or exclude the /edit incorrectly. If your Drupal has this option checked, you will need to include standalone_media_url: true in your configuration file. In this case, you should not include the /edit at the end of the media URL.","title":"Quick delete"},{"location":"redirects/","text":"Islandora Workbench enables you to create redirects managed by the Redirect contrib module. One of the most common uses for redirects is to retain old URLs that have been replaced by newer ones. Within the Islandora context, being able to redirect from Islandora 7.x islandora/object/[PID] style URLs to their new Islandora 2 replacements such as node/[NODEID] . Using redirects, you can ensure that the old URLs continue to lead the user to the expected content. In order to use Workbench for this, you must install the Redirect module and configure the \"Redirect\" REST endpoing at /admin/config/services/rest so that it has the following settings: A sample configuration file for a creatre_redirects task looks like this: task: create_redirects host: https://islandora.traefik.me username: admin password: password input_csv: my_redirects.csv # This setting is explained below. # redirect_status_code: 302 The input CSV contains only two columns, redirect_source and redirect_target . Each value in the redirect_source column is a relative (to the hostname running Drupal) URL path that, when visited, will automatically redirect the user to the relative (to the hostname running Drupal) URL path (or external absolute URL) named in the redirect_target column. Here is a brief example CSV: redirect_source,redirect_target contact_us,how_can_we_help islandora/object/alpine:482,node/1089 islandora/object/awesomeimage:collection,https://galleries.sfu.ca node/16729,node/3536 Assuming that the hostname of the Drupal instance running the Redirects module is https://mydrupal.org , when these redirects are created, the following will happen: When a user visits https://mydrupal.org/contact_us , they will automatically be redirected to https://mydrupal.org/how_can_we_help . When a user visits https://mydrupal.org/islandora/object/alpine:482 , they will be automatically redirected to https://mydrupal.org/node/1089 . When a user visits https://mydrupal.org/islandora/object/awesomeimage:collection , they will be automatically redirected to the external (to https://mydrupal.org ) URL https://galleries.sfu.ca . When a user visits https://mydrupal.org/node/16729 , they will be automatically redirected to https://mydrupal.org/node/3536 . A few things to note about creating redirects using Islandora Workbench: The values in the redirect_source column are always relative to the root of the Drupal hostname URL. Drupal expects them to not begin with a leading / , but if you include it, Workbench will trim it off automatically. The redirect_source values do not need to represent existing nodes. For example, a value like islandora/object/awesomeimage:collection has no underlying node; it's just a path that Drupal listens at, and when requested, redirects the user to the corresponding target URL. Values in redirect_source that do have underlying nodes will redirect users to the corresponding redirect_target but don't make a lot of sense, since the user will always be automatically redirected and never get to see the underlying node at the source URL. However, you may want to use a redirect_target value that has a local node if you don't want users to see that node temporarily for some reason (after that reason is no longer valid, you would remove the redirect). Currently, Islandora Workbench can only create redirects, it can't update or delete them. If you have a need for either of those funtionalities, open a Github issue. HTTP redirects work by issuing a special response code to the browser. In most cases, and by default in Workbench, this is 301 . However, you can change this to another redirect HTTP status code by including the redirect_status_code setting in your config file specifying the code you want Drupal to send to the user's browser, e.g., redirect_status_code: 302 .","title":"Creating redirects"},{"location":"reducing_load/","text":"Workbench can put substantial stress on Drupal. In some cases, this stress can lead to instability and errors. Note The options described below modify Workbench's interaction with Drupal. They do not have a direct impact on the load experienced by the microservices Islandora uses to do things like indexing node metadata in its triplestore, extracting full text for indexing, and generating thumbnails. However, since Drupal is the \"controller\" for these microservices, reducing the number of Workbench requests to Drupal will also indirectly reduce the load experienced by Islandora's microservices. Workbench provides two ways to reduce this stress: pausing between Workbench's requests to Drupal, and caching Workbench's requests to Drupal. The first way to reduce stress on Drupal is by telling Workbench to pause between each request it makes. There are two types of pausing, 1) basic and 2) adaptive. Both types of pausing improve stability and reliability by slowing down Workbench's overall execution time. Pausing Basic pause The pause configuration setting tells Workbench to temporarily halt execution before every request, thereby spreading load caused by the requests over a longer period of time. To enable pause , include the setting in your configuration file, indicating the number of seconds to wait between requests: pause: 2 Using pause will help decrease load-induced errors, but it is inefficient because it causes Workbench to pause between all requests, even ones that are not putting stress on Drupal. A useful strategy for refining Workbench's load-reduction capabilities is to try pause first, and if it reduces errors, then disable pause and try adaptive_pause instead. pause will confirm that Workbench is adding load to Drupal, but adaptive_pause will tell Workbench to pause only when it detects its requests are putting load on Drupal. Note pause and adaptive_pause are mutually exclusive. If you include one in your configuration files, you should not include the other. Adaptive pause Adaptive pause only halts execution between requests if Workbench detects that Drupal is slowing down. It does this by comparing Drupal's response time for the most recent request to the average response time of the 20 previous requests made by Islandora Workbench. If the response time for the most recent request reaches a specific threshold, Workbench's adaptive pause will kick in and temporarily halt execution to allow Drupal to catch up. The number of previous requests used to determine the average response time, 20, cannot be changed with a configuration setting. The threshold that needs to be met is configured using the adaptive_pause_threshold setting. This setting's default value is 2, which means that the adaptive pause will kick in if the response time for the most recent request Workbench makes to Drupal is 2 times (double) the average of Workbench's last 20 requests. The amount of time that Workbench will pause is determined by the value of adaptive_pause , which, like the value for pause , is a number of seconds (e.g., adaptive_pause: 3 ). You enable adaptive pausing by adding the adaptive_pause setting to your configuration file. Here are a couple of examples. Keep in mind that adaptive_pause_threshold has a default value (2), but adaptive_pause does not have a default value. The first example enables adaptive_pause using the default value for adaptive_pause_threshold , telling it to pause for 3 seconds between requests if Drupal's response time to the last request is 2 times slower ( adaptive_pause_threshold 's default value) than the average of the last 20 requests: adaptive_pause: 3 In the next example, we override adaptive_pause_threshold 's default by including the setting in the configuration: adaptive_pause: 2 adaptive_pause_threshold: 2.5 In this example, adaptive pausing kicks in only if the response time for the most recent request is 2.5 times the average of the response time for the last 20 requests. You can increment adaptive_pause_threshold 's value by .5 (e.g., 2.5, 3, 3.5, etc.) until you find a sweet spot that balances reliability with overall execution time. You can also decrease or increase the value of adaptive_pause incrementally by intervals of .5 to further refine the balance - increasing adaptive_pause 's value lessens Workbench's impact on Drupal at the expense of speed, and decreasing its value increases speed but also impact on Drupal. Since adaptive_pause doesn't have a default value, you need to define its value in your configuration file. Because of this, using adaptive_pause_threshold on its own in a configuration, e.g.: adaptive_pause_threshold: 3 doesn't do anything because adaptive_pause has no value. In other words, you can use adaptive_pause on its own, or you can use adaptive_pause and adaptive_pause_threshold together, but it's pointless to use adaptive_pause_threshold on its own. Logging Drupal's response time If a request if paused by adaptive pausing, Workbench will automatically log the response time for the next request, indicating that adaptive_pause has temporarily halted execution. If you want to log Drupal's response time regardless of whether adaptive_pause had kicked in or not, add log_response_time: true to your configuration file. All logging of response time includes variation from the average of the last 20 response times. Caching The second way that Workbench reduces stress on Drupal is by caching its outbound HTTP requests, thereby reducing the overall number of requests. This caching both reduces the load on Drupal and speeds up Workbench considerably. By default, this caching is enabled for requests that Workbench makes more than once and that are expected to have the same response each time, such as requests for field configurations or for checks for the existence of taxonomy terms. Note that you should not normally have to disable this caching, but if you do (for example, if you are asked to during troubleshooting), you can do so by including the following setting in your configuration file: enable_http_cache: false","title":"Reducing Workbench's impact on Drupal"},{"location":"reducing_load/#pausing","text":"","title":"Pausing"},{"location":"reducing_load/#basic-pause","text":"The pause configuration setting tells Workbench to temporarily halt execution before every request, thereby spreading load caused by the requests over a longer period of time. To enable pause , include the setting in your configuration file, indicating the number of seconds to wait between requests: pause: 2 Using pause will help decrease load-induced errors, but it is inefficient because it causes Workbench to pause between all requests, even ones that are not putting stress on Drupal. A useful strategy for refining Workbench's load-reduction capabilities is to try pause first, and if it reduces errors, then disable pause and try adaptive_pause instead. pause will confirm that Workbench is adding load to Drupal, but adaptive_pause will tell Workbench to pause only when it detects its requests are putting load on Drupal. Note pause and adaptive_pause are mutually exclusive. If you include one in your configuration files, you should not include the other.","title":"Basic pause"},{"location":"reducing_load/#adaptive-pause","text":"Adaptive pause only halts execution between requests if Workbench detects that Drupal is slowing down. It does this by comparing Drupal's response time for the most recent request to the average response time of the 20 previous requests made by Islandora Workbench. If the response time for the most recent request reaches a specific threshold, Workbench's adaptive pause will kick in and temporarily halt execution to allow Drupal to catch up. The number of previous requests used to determine the average response time, 20, cannot be changed with a configuration setting. The threshold that needs to be met is configured using the adaptive_pause_threshold setting. This setting's default value is 2, which means that the adaptive pause will kick in if the response time for the most recent request Workbench makes to Drupal is 2 times (double) the average of Workbench's last 20 requests. The amount of time that Workbench will pause is determined by the value of adaptive_pause , which, like the value for pause , is a number of seconds (e.g., adaptive_pause: 3 ). You enable adaptive pausing by adding the adaptive_pause setting to your configuration file. Here are a couple of examples. Keep in mind that adaptive_pause_threshold has a default value (2), but adaptive_pause does not have a default value. The first example enables adaptive_pause using the default value for adaptive_pause_threshold , telling it to pause for 3 seconds between requests if Drupal's response time to the last request is 2 times slower ( adaptive_pause_threshold 's default value) than the average of the last 20 requests: adaptive_pause: 3 In the next example, we override adaptive_pause_threshold 's default by including the setting in the configuration: adaptive_pause: 2 adaptive_pause_threshold: 2.5 In this example, adaptive pausing kicks in only if the response time for the most recent request is 2.5 times the average of the response time for the last 20 requests. You can increment adaptive_pause_threshold 's value by .5 (e.g., 2.5, 3, 3.5, etc.) until you find a sweet spot that balances reliability with overall execution time. You can also decrease or increase the value of adaptive_pause incrementally by intervals of .5 to further refine the balance - increasing adaptive_pause 's value lessens Workbench's impact on Drupal at the expense of speed, and decreasing its value increases speed but also impact on Drupal. Since adaptive_pause doesn't have a default value, you need to define its value in your configuration file. Because of this, using adaptive_pause_threshold on its own in a configuration, e.g.: adaptive_pause_threshold: 3 doesn't do anything because adaptive_pause has no value. In other words, you can use adaptive_pause on its own, or you can use adaptive_pause and adaptive_pause_threshold together, but it's pointless to use adaptive_pause_threshold on its own.","title":"Adaptive pause"},{"location":"reducing_load/#logging-drupals-response-time","text":"If a request if paused by adaptive pausing, Workbench will automatically log the response time for the next request, indicating that adaptive_pause has temporarily halted execution. If you want to log Drupal's response time regardless of whether adaptive_pause had kicked in or not, add log_response_time: true to your configuration file. All logging of response time includes variation from the average of the last 20 response times.","title":"Logging Drupal's response time"},{"location":"reducing_load/#caching","text":"The second way that Workbench reduces stress on Drupal is by caching its outbound HTTP requests, thereby reducing the overall number of requests. This caching both reduces the load on Drupal and speeds up Workbench considerably. By default, this caching is enabled for requests that Workbench makes more than once and that are expected to have the same response each time, such as requests for field configurations or for checks for the existence of taxonomy terms. Note that you should not normally have to disable this caching, but if you do (for example, if you are asked to during troubleshooting), you can do so by including the following setting in your configuration file: enable_http_cache: false","title":"Caching"},{"location":"roadmap/","text":"Islandora Workbench development priorities for Fall 2021 are: ability to create taxonomy terms with fields ( issue ) - Done. ability to create hierarchical taxonomy terms ( issue ) - Done. ability to add/update remote video and audio ( issue ) ability to add/update multilingual field data ( issue ) ability to add/update a non-base Drupal title field ( issue ) - Done. allow --check to report more than one error at a time ( issue ) - Done. a desktop spreadsheet editor ( issue ) ability to add/update Paragraph fields ( issue ) - Done. support the TUS resumable upload protocol ( issue )","title":"Roadmap (Fall 2021)"},{"location":"rolling_back/","text":"In the create and create_from_files tasks, Workbench generates a configuration file and accompanying input CSV file in the format described in \"Deleting nodes\" documentation. These files allow you to easily roll back (i.e., delete) all the nodes and accompanying media you just created. Specifically, this configuration file defines a delete task. See the \" Deleting nodes \" section for more information. --check it will also write an entry in your log file indicating the location of the rollback config and CSV files, and it will test to ensure that the rollback config and CSV files can be written. If either one cannot, Workbench exists with an error. Warning The rollback configuration file contains the username used in the create or create_from_files task that generates it. If you also want to include the accompanying password, add include_password_in_rollback_config_file: true to your configuration. Also note, rollback configuration and CSV files are overwritten every time you run a create or create_from_files task. It is highly recommended that you timestamp your rollback files (see below) to produce unique filenames. By default, the configuration file is named \"rollback.yml\" and is written into the Workbench directory. The input CSV file is named \"rollback.csv\" and is written into the directory defined in your input_dir configuration setting. If either of these files exist, they are overwritten during the next create or create_from_files task. You can also specify the relative (to workbench) or abolute path to your rollback config and CSV files, by including rollback_config_file_path and rollback_csv_file_path , respectively, in your configuration. To roll back all the nodes and media you just created, run ./workbench --config rollback.yml . There are several configuration settings that let you control the names of these two files, and there is also an option to include comments in the files, as described below. Note When secondary tasks are configured, each task will get its own rollback file. Each secondary task's rollback file will have a normalized version of the path to the task's configuration file appended to the rollback filename, e.g., rollback.csv.home_mark_hacking_islandora_workbench_example_secondary_task . Using these separate rollback files, you can delete only the nodes created in a specific secondary task. Setting the directory where the rollback CVS file is written You can determine where the rollback CSV file is written by including the rollback_dir in your configuration. This overrides the default location defined in input_dir . The rollback configuration file is always written to the Workbench working directory unless that behavior is overwritten by rollback_config_file_path . Using rollback filename templates For both the rollback config file and the rollback CSV file, you can define a template that provides four placeholders: $config_filename , which contains the name of the create (or create_from_files ) configuration file $input_csv_filename , which contain the name of the input CSV file (only available in create tasks) csv_start_row , which contains the value of the csv_start_row configuration setting, or \"0\" if none is set (only available in create tasks) csv_stop_row , which contains the value of the csv_stop_row configuration setting, or \"None\" if not set (only available in create tasks). You can embed these placeholders in your template, which is then used to create the names of the two files. The templates for the two filenames are defined in two separate configuration settings. For the rollback configuration file, this setting is rollback_config_filename_template , e.g.: rollback_config_filename_template: my-custom_filename_${config_filename}_$input_csv_filename Assuming the configuration file for the create task that generates this rollback configuration has the name \"mjtest.yml\", and the input_csv filename has the filename \"sample.csv\", this template will result in a configuration file named my-custom_filename_mjtest_sample.yml . Note that the \".yml\" extension is added automatically; you do not need to include it in your template. For the rollback CSV file, the configuration setting is rollback_csv_filename_template , e.g.: rollback_csv_filename_template: my_custom_rollback_filename_${config_filename}_$input_csv_filename Using the same create task configuration filename and input CSV filename as in the previous example, this template will result in a CSV file named my_custom_rollback_filename_mjtest_sample.csv . The \".csv\" extension is added automatically; you do not need to include it in your template. Warning Python's built-in templating system has a quirk where when a character that is valid in a Python variable name follows a template placeholder, that character is added to the template placeholder. When templating filenames, the two most common characters of this type are _ and - . When this happens, Workbench will exit with the error message \"One or more parts of the configured rollback configuration filename template ([your template here]) need adjusting.\" You can work around this behavior by wrapping your template variable name, without the leading $ , in {} . The example rollback config filename template above ( my-custom_filename_$config_filename_$input_csv_filename ) will trigger this error because the _ following the $config_filename placeholder is valid in Python variable names. If you see this type of message, adjust your template to my-custom_filename_${config_filename}_$input_csv_filename . Adding a timestamp to the rollback filenames By default, Workbench overwrites the rollback configuration and CSV files each time it runs, so these files only apply to the most recent create and create_from_files runs. If you add timestamp_rollback: true to your configuration file, a (to-the-second) timestamp will be added to the rollback.yml and corresponding rollback.csv files, for example, rollback.2024_11_03_21_10_28.yml and rollback.2024_11_03_21_10_28.csv . The name of the CSV is also written to workbench.log . Running ./workbench --config rollback.2024_11_03_21_10_28.yml will delete the nodes identified in input_data/rollback.2024_11_03_21_10_28.csv . Timestamps are added in the same way to custom rollback configuration and CSV filenames create using templates. Adding custom comments to your rollback configuration and CSV files Workbench always adds to lines of comments to rollback configuration and CSV files, indicating when the files were generated and the names of the configuration and input CSV files they were generated from, like this: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". You can add additional, custom comment lines by including the rollback_file_comments configuration setting in your create or create_from_files configuration, like this: rollback_file_comments: - Keep this file! It might be useful if something goes wrong with this job. - Have a nice day! This will result in the following comments in your rollback configuration and CSV files: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". # Keep this file! It might be useful if something goes wrong with this job. # Have a nice day!","title":"Rolling back nodes and media"},{"location":"rolling_back/#setting-the-directory-where-the-rollback-cvs-file-is-written","text":"You can determine where the rollback CSV file is written by including the rollback_dir in your configuration. This overrides the default location defined in input_dir . The rollback configuration file is always written to the Workbench working directory unless that behavior is overwritten by rollback_config_file_path .","title":"Setting the directory where the rollback CVS file is written"},{"location":"rolling_back/#using-rollback-filename-templates","text":"For both the rollback config file and the rollback CSV file, you can define a template that provides four placeholders: $config_filename , which contains the name of the create (or create_from_files ) configuration file $input_csv_filename , which contain the name of the input CSV file (only available in create tasks) csv_start_row , which contains the value of the csv_start_row configuration setting, or \"0\" if none is set (only available in create tasks) csv_stop_row , which contains the value of the csv_stop_row configuration setting, or \"None\" if not set (only available in create tasks). You can embed these placeholders in your template, which is then used to create the names of the two files. The templates for the two filenames are defined in two separate configuration settings. For the rollback configuration file, this setting is rollback_config_filename_template , e.g.: rollback_config_filename_template: my-custom_filename_${config_filename}_$input_csv_filename Assuming the configuration file for the create task that generates this rollback configuration has the name \"mjtest.yml\", and the input_csv filename has the filename \"sample.csv\", this template will result in a configuration file named my-custom_filename_mjtest_sample.yml . Note that the \".yml\" extension is added automatically; you do not need to include it in your template. For the rollback CSV file, the configuration setting is rollback_csv_filename_template , e.g.: rollback_csv_filename_template: my_custom_rollback_filename_${config_filename}_$input_csv_filename Using the same create task configuration filename and input CSV filename as in the previous example, this template will result in a CSV file named my_custom_rollback_filename_mjtest_sample.csv . The \".csv\" extension is added automatically; you do not need to include it in your template. Warning Python's built-in templating system has a quirk where when a character that is valid in a Python variable name follows a template placeholder, that character is added to the template placeholder. When templating filenames, the two most common characters of this type are _ and - . When this happens, Workbench will exit with the error message \"One or more parts of the configured rollback configuration filename template ([your template here]) need adjusting.\" You can work around this behavior by wrapping your template variable name, without the leading $ , in {} . The example rollback config filename template above ( my-custom_filename_$config_filename_$input_csv_filename ) will trigger this error because the _ following the $config_filename placeholder is valid in Python variable names. If you see this type of message, adjust your template to my-custom_filename_${config_filename}_$input_csv_filename .","title":"Using rollback filename templates"},{"location":"rolling_back/#adding-a-timestamp-to-the-rollback-filenames","text":"By default, Workbench overwrites the rollback configuration and CSV files each time it runs, so these files only apply to the most recent create and create_from_files runs. If you add timestamp_rollback: true to your configuration file, a (to-the-second) timestamp will be added to the rollback.yml and corresponding rollback.csv files, for example, rollback.2024_11_03_21_10_28.yml and rollback.2024_11_03_21_10_28.csv . The name of the CSV is also written to workbench.log . Running ./workbench --config rollback.2024_11_03_21_10_28.yml will delete the nodes identified in input_data/rollback.2024_11_03_21_10_28.csv . Timestamps are added in the same way to custom rollback configuration and CSV filenames create using templates.","title":"Adding a timestamp to the rollback filenames"},{"location":"rolling_back/#adding-custom-comments-to-your-rollback-configuration-and-csv-files","text":"Workbench always adds to lines of comments to rollback configuration and CSV files, indicating when the files were generated and the names of the configuration and input CSV files they were generated from, like this: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". You can add additional, custom comment lines by including the rollback_file_comments configuration setting in your create or create_from_files configuration, like this: rollback_file_comments: - Keep this file! It might be useful if something goes wrong with this job. - Have a nice day! This will result in the following comments in your rollback configuration and CSV files: # Generated by a \"create\" task started 2024:11:10 09:52:57 using # config file \"mjtest.yml\" and input CSV \"sample.csv\". # Keep this file! It might be useful if something goes wrong with this job. # Have a nice day!","title":"Adding custom comments to your rollback configuration and CSV files"},{"location":"sample_data/","text":"Islandora Workbench comes with some sample data. Running ./workbench --config create.yml --check will result in the following output: OK, connection to Drupal at http://localhost:8000 verified. OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/metadata.csv found. OK, all 5 rows in the CSV file have the same number of columns as there are headers (5). OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, term IDs/names in CSV file exist in their respective taxonomies. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. OK, files named in the CSV \"file\" column are all present. Configuration and input data appear to be valid. Then running workbench Workbench without --check will result in something like: Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52. +File media for IMG_1410.tif created. Node for 'Manhatten Island' created at http://localhost:8000/node/53. +File media for IMG_2549.jp2 created. Node for 'Looking across Burrard Inlet' created at http://localhost:8000/node/54. +Image media for IMG_2940.JPG created. Node for 'Amsterdam waterfront' created at http://localhost:8000/node/55. +Image media for IMG_2958.JPG created. Node for 'Alcatraz Island' created at http://localhost:8000/node/56. +Image media for IMG_5083.JPG created.","title":"Creating nodes from the sample data"},{"location":"troubleshooting/","text":"If you encounter a problem, take a look at the \"things that might sound familiar\" section below. But, if the problem you're encountering isn't described below, you can ask for help. Ask for help The #islandoraworkbench Slack channel is a good place to ask a question if Workbench isn't working the way you expect or if it crashes. You can also open a Github issue . If Workbench \"isn't working the way you expect\", the documentation is likely unclear. Crashes are usually caused by sloppy Python coding. Reporting either is a great way to contribute to Islandora Workbench. But before you ask... The first step you should take while troubleshooting a Workbench failure is to use Islandora's graphical user interface to create/edit/delete a node/media/taxonomy term (or whatever it is you're trying to do with Workbench). If Islandora works without error, you have confirmed that the problem you are experiencing is likely isolated to Workbench and is not being caused by an underlying problem with Islandora. Next, if you have eliminated Islandora as the cause of the Workbench problem you are experiencing, you might be able to fix the problem by pulling in the most recent Workbench code. The best way to keep it up to date is to pull in the latest commits from the Github repository periodically, but if you haven't done that in a while, within the \"islandora_workbench\" directory, run the following git commands: git branch , which should tell whether you're currently in the \"main\" branch. If you are: git pull , which will fetch the most recent code and and merge it into the code you are running. If git tells you it has pulled in any changes to Workbench, you will be running the latest code. If you get an error while running git, ask for help. Also, you might be asked to provide one or more of the following: your configuration file (without username and password!). You can also print your configuration to your terminal by includeing the --print_config argument to workbench, e.g. python workbench --config test.yml --check --print_config . some sample input CSV your Workbench log file details from your Drupal log, available at Admin > Reports > Recent log messages whether you have certain contrib modules installed, or other questions about how your Drupal is configured. Some things that might sound familiar Running Workbench results in lots of messages like InsecureRequestWarning: Unverified HTTPS request is being made to host 'islandora.dev' . If you see this, and you are using an ISLE istallation whose Drupal hostname uses the traefik.me domain (for example, https://islandora.traefik.me), the HTTPS certificate for the domain has expired. This problem will be widespread so please check the Islandora Slack for any current discussion about it. If you don't get any help in Slack, try redownloading the certificates following the instructions in the Islandora documentation . If that doesn't work, you can temporarily avoid the warning messages by adding secure_ssl_only: false to your Workbench config file and updating an environment variable using the following command: export PYTHONWARNINGS=\"ignore:Unverified HTTPS request\" Workbench is slow. True, it can be slow. However, users have found that the following strategies increase Workbench's speed substantially: Running Workbench on the same server that Drupal is running on (e.g. using \"localhost\" as the value of host in your config file). While doing this negates Workbench's most important design principle - that it does not require access to the Drupal server's command line - during long-running jobs such as those that are part of migrations, this is the best way to speed up Workbench. Using local instead of remote files. If you populate your file or \"additional files\" fields with filenames that start with \"http\", Workbench downloads each of those files before ingesting them. Providing local copies of those files in advance of running Workbench will eliminate the time it takes Workbench to download them. Avoid confirming taxonomy terms' existence during --check . If you add validate_terms_exist: false to your configuration file, Workbench will not query Drupal for each taxonomy term during --check . This option is suitable if you know that the terms don't exist in the target Drupal. Note that this option only speeds up --check ; it does not have any effect when creating new nodes. Generate FITS XML files offline and then add them as media like any other file. Doing this allows Islandora to not generate FITS data, which can slow down ingests substantially. If you do this, be sure to disable the \"FITS derivatives\" Context first. I've pulled in updates to Islandora Workbench from Github but when I run it, Python complains about not being able to find a library. This won't happen very often, and the cause of this message will likely have been announced in the #islandoraworkbench Slack channel. This error is caused by the addition of a new Python library to Workbench. Running setup.py will install the missing library. Details are available in the \"Updating Islandora Workbench\" section of the Requirements and Installation docs. You do not need to run setup.py every time you update the Workbench code. Introducing a new library is not a common occurance. Workbench is failing to ingest some nodes and is leaving messages in the log mentioning HTTP response code 422. This is probably caused by unexpected data in your CSV file that Workbench's --check validation is not finding. If you encounter these messages, please open an issue and share any relevant entries in your Workbench log and Drupal log (as an admin user, go to Admin > Reports > Recent log messages) so we can track down the problem. One of the most common causes of this error is that one or more of the vocabularies being populated in your create task CSV contain required fields other than the default term name. It is possible to have Workbench create these fields, but you must do so as a separate create_terms task. See \" Creating taxonomy terms \" for more information. Workbench is crashing and telling me there are problems with SSL certificates. To determine if this issue is specific to Workbench, from the same computer Workbench is running on, try hitting your Drupal server (or server your remote files are on) with curl . If curl also complains about SSL certificates, the problem lies in the SSL/HTTPS configuration on the server. An example curl command is curl https://wwww.lib.sfu.ca . If curl doesn't complain, the problem is specific to Workbench. --check is telling me that one the rows in my CSV file has more columns than headers. The most likely problem is that one of your CSV values contains a comma but is not wrapped in double quotes. My Drupal has the \"Standalone media URL\" option at /admin/config/media/media-settings checked, and I'm using Workbench's standalone_media_url: true option in my config, but I'm still getting lots of errors. Be sure to clear Drupal's cache every time you change the \"Standalone media URL\" option. More information can be found here . Workbench crashes or slows down my Drupal server. If Islandora Workbench is putting too much strain on your Drupal server, you should try enabling the pause configuration option. If that works, replace it with the adaptive_pause option and see if that also works. The former option pauses between all communication requests between Workbench and Drupal, while the latter pauses only if the server's response time for the last request is longer than the average of the last 20 requests. Note that both of these settings will slow Workbench down, which is their purpose. However, adaptive_pause should have less impact on overall speed since it only pauses between requests if it detects the server is getting slower over time. If you use adaptive_pause , you can also tune the adaptive_pause_threshold option by incrementing the value by .5 intervals (for example, from the default of 2 to 2.5, then 3, etc.) to see if doing so reduces strain on your Drupal server while keeping overall speed acceptable. You can also lower the value of adaptive_pause incrementally to balance strain with overall speed. Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file. Some web applications, including Drupal 7, return a human-readable HTML page instead of a expected HTTP response code when they encounter an error. If Workbench is complaining that a remote file in your file other file column in your input CSV has an extension of \".htm\" or \".html\" and you know that the file is not an HTML page, what Workbench is seeing is probably an error message. For example, Workbench might leave a message like this in your log: Error: File \"https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view\" in CSV row \"text:302175\" has an extension (html) that is not allowed in the \"field_media_file\" field of the \"file\" media type. This error can be challenging to track down since the HTML error page might have been specific to the request that Workbench just made (e.g. a timeout or some other temporary server condition). One way of determining if the error is temporary (i.e. specific to the request) is to use curl to fetch the file (e.g., curl -o test.tif https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view ). If the returned file (in this example, it will be named test.tif ) is in fact HTML, the error is probably permanent or at least persistent; if the file is the one you expected to retrieve, the error was temporary and you can ignore it. The text in my CSV does not match how it looks when I view it in Drupal. If a field is configured in Drupal to use text filters , the HTML that is displayed to the user may not be exactly the same as the content of the node add/edit form field. If you check the node add/edit form, the content of the field should match the content of the CSV field. If it does, it is likely that Drupal is apply a text filter. See this issue for more information. My Islandora uses a custom media type and I need to tell Workbench what file field to use. If you need to create a media that is not one of the standard Islandora types (Image, File, Digital Document, Video, Audio, Extracted Text, or FITS Technical metadata), you will need to include the media_file_fields setting in your config file, like this: media_file_fields: - mycustommedia_machine_name: field_custom_file - myothercustommedia_machine_name: field_other_custom_file This configuration setting adds entries to the following default mapping of media types to file field names: 'file': 'field_media_file', 'document': 'field_media_document', 'image': 'field_media_image', 'audio': 'field_media_audio_file', 'video': 'field_media_video_file', 'extracted_text': 'field_media_file', 'fits_technical_metadata': 'field_media_file' EDTF 'interval' values are not rendering in Islandora properly. Islandora can display EDTF interval values (e.g., 2004-06/2006-08 , 193X/196X ) properly, but by default, the configuration that allows this is disabled (see this issue for more information). To enable it, for each field in your Islandora content types that use EDTF fields, visit the \"Manage form display\" configuration for the content type, and for each field that uses the \"Default EDTF widget\", within the widget configuration (click on the gear), check the \"Permit date intervals\" option and click \"Update\": My CSV file has a url_alias column, but the aliases are not being created. First thing to check is whether you are using the Pathauto module. It also creates URL aliases, and since by default Drupal only allows one URL alias, in most cases, the aliases it creates will take precedence over aliases created by Workbench. I'm having trouble getting Workbench to work in a cronjob The most common problem you will encounter when running Islandora Workbench in a cronjob is that Workbench can't find its configuration file, or input/output directories. The easiest way to avoid this is to use absolute file paths everywhere, including as the value of Workbench's --config parameter, in configuration files, and in file and additional_files columns in your input CSV files. In some cases, particularly if you are using a secondary task to create pages or child items, you many need to use the path_to_python and path_to_workbench_script configuration settings. get_islandora_7_content.py crashes with the error \"illegal request: Server returned status of 400. The default query may be too long for a url request.\" Islandora 7's Solr contains a lot of redundant fields. You need to reduce the number of fields to export. See the \" Exporting Islandora 7 content \" documentation for ways to reduce the number of fields. Ask in the #islandoraworkbench Slack channel if you need help.","title":"Troubleshooting"},{"location":"troubleshooting/#ask-for-help","text":"The #islandoraworkbench Slack channel is a good place to ask a question if Workbench isn't working the way you expect or if it crashes. You can also open a Github issue . If Workbench \"isn't working the way you expect\", the documentation is likely unclear. Crashes are usually caused by sloppy Python coding. Reporting either is a great way to contribute to Islandora Workbench.","title":"Ask for help"},{"location":"troubleshooting/#but-before-you-ask","text":"The first step you should take while troubleshooting a Workbench failure is to use Islandora's graphical user interface to create/edit/delete a node/media/taxonomy term (or whatever it is you're trying to do with Workbench). If Islandora works without error, you have confirmed that the problem you are experiencing is likely isolated to Workbench and is not being caused by an underlying problem with Islandora. Next, if you have eliminated Islandora as the cause of the Workbench problem you are experiencing, you might be able to fix the problem by pulling in the most recent Workbench code. The best way to keep it up to date is to pull in the latest commits from the Github repository periodically, but if you haven't done that in a while, within the \"islandora_workbench\" directory, run the following git commands: git branch , which should tell whether you're currently in the \"main\" branch. If you are: git pull , which will fetch the most recent code and and merge it into the code you are running. If git tells you it has pulled in any changes to Workbench, you will be running the latest code. If you get an error while running git, ask for help. Also, you might be asked to provide one or more of the following: your configuration file (without username and password!). You can also print your configuration to your terminal by includeing the --print_config argument to workbench, e.g. python workbench --config test.yml --check --print_config . some sample input CSV your Workbench log file details from your Drupal log, available at Admin > Reports > Recent log messages whether you have certain contrib modules installed, or other questions about how your Drupal is configured.","title":"But before you ask..."},{"location":"troubleshooting/#some-things-that-might-sound-familiar","text":"","title":"Some things that might sound familiar"},{"location":"troubleshooting/#running-workbench-results-in-lots-of-messages-like-insecurerequestwarning-unverified-https-request-is-being-made-to-host-islandoradev","text":"If you see this, and you are using an ISLE istallation whose Drupal hostname uses the traefik.me domain (for example, https://islandora.traefik.me), the HTTPS certificate for the domain has expired. This problem will be widespread so please check the Islandora Slack for any current discussion about it. If you don't get any help in Slack, try redownloading the certificates following the instructions in the Islandora documentation . If that doesn't work, you can temporarily avoid the warning messages by adding secure_ssl_only: false to your Workbench config file and updating an environment variable using the following command: export PYTHONWARNINGS=\"ignore:Unverified HTTPS request\"","title":"Running Workbench results in lots of messages like InsecureRequestWarning: Unverified HTTPS request is being made to host 'islandora.dev'."},{"location":"troubleshooting/#workbench-is-slow","text":"True, it can be slow. However, users have found that the following strategies increase Workbench's speed substantially: Running Workbench on the same server that Drupal is running on (e.g. using \"localhost\" as the value of host in your config file). While doing this negates Workbench's most important design principle - that it does not require access to the Drupal server's command line - during long-running jobs such as those that are part of migrations, this is the best way to speed up Workbench. Using local instead of remote files. If you populate your file or \"additional files\" fields with filenames that start with \"http\", Workbench downloads each of those files before ingesting them. Providing local copies of those files in advance of running Workbench will eliminate the time it takes Workbench to download them. Avoid confirming taxonomy terms' existence during --check . If you add validate_terms_exist: false to your configuration file, Workbench will not query Drupal for each taxonomy term during --check . This option is suitable if you know that the terms don't exist in the target Drupal. Note that this option only speeds up --check ; it does not have any effect when creating new nodes. Generate FITS XML files offline and then add them as media like any other file. Doing this allows Islandora to not generate FITS data, which can slow down ingests substantially. If you do this, be sure to disable the \"FITS derivatives\" Context first.","title":"Workbench is slow."},{"location":"troubleshooting/#ive-pulled-in-updates-to-islandora-workbench-from-github-but-when-i-run-it-python-complains-about-not-being-able-to-find-a-library","text":"This won't happen very often, and the cause of this message will likely have been announced in the #islandoraworkbench Slack channel. This error is caused by the addition of a new Python library to Workbench. Running setup.py will install the missing library. Details are available in the \"Updating Islandora Workbench\" section of the Requirements and Installation docs. You do not need to run setup.py every time you update the Workbench code. Introducing a new library is not a common occurance.","title":"I've pulled in updates to Islandora Workbench from Github but when I run it, Python complains about not being able to find a library."},{"location":"troubleshooting/#workbench-is-failing-to-ingest-some-nodes-and-is-leaving-messages-in-the-log-mentioning-http-response-code-422","text":"This is probably caused by unexpected data in your CSV file that Workbench's --check validation is not finding. If you encounter these messages, please open an issue and share any relevant entries in your Workbench log and Drupal log (as an admin user, go to Admin > Reports > Recent log messages) so we can track down the problem. One of the most common causes of this error is that one or more of the vocabularies being populated in your create task CSV contain required fields other than the default term name. It is possible to have Workbench create these fields, but you must do so as a separate create_terms task. See \" Creating taxonomy terms \" for more information.","title":"Workbench is failing to ingest some nodes and is leaving messages in the log mentioning HTTP response code 422."},{"location":"troubleshooting/#workbench-is-crashing-and-telling-me-there-are-problems-with-ssl-certificates","text":"To determine if this issue is specific to Workbench, from the same computer Workbench is running on, try hitting your Drupal server (or server your remote files are on) with curl . If curl also complains about SSL certificates, the problem lies in the SSL/HTTPS configuration on the server. An example curl command is curl https://wwww.lib.sfu.ca . If curl doesn't complain, the problem is specific to Workbench.","title":"Workbench is crashing and telling me there are problems with SSL certificates."},{"location":"troubleshooting/#-check-is-telling-me-that-one-the-rows-in-my-csv-file-has-more-columns-than-headers","text":"The most likely problem is that one of your CSV values contains a comma but is not wrapped in double quotes.","title":"--check is telling me that one the rows in my CSV file has more columns than headers."},{"location":"troubleshooting/#my-drupal-has-the-standalone-media-url-option-at-adminconfigmediamedia-settings-checked-and-im-using-workbenchs-standalone_media_url-true-option-in-my-config-but-im-still-getting-lots-of-errors","text":"Be sure to clear Drupal's cache every time you change the \"Standalone media URL\" option. More information can be found here .","title":"My Drupal has the \"Standalone media URL\" option at /admin/config/media/media-settings checked, and I'm using Workbench's standalone_media_url: true option in my config, but I'm still getting lots of errors."},{"location":"troubleshooting/#workbench-crashes-or-slows-down-my-drupal-server","text":"If Islandora Workbench is putting too much strain on your Drupal server, you should try enabling the pause configuration option. If that works, replace it with the adaptive_pause option and see if that also works. The former option pauses between all communication requests between Workbench and Drupal, while the latter pauses only if the server's response time for the last request is longer than the average of the last 20 requests. Note that both of these settings will slow Workbench down, which is their purpose. However, adaptive_pause should have less impact on overall speed since it only pauses between requests if it detects the server is getting slower over time. If you use adaptive_pause , you can also tune the adaptive_pause_threshold option by incrementing the value by .5 intervals (for example, from the default of 2 to 2.5, then 3, etc.) to see if doing so reduces strain on your Drupal server while keeping overall speed acceptable. You can also lower the value of adaptive_pause incrementally to balance strain with overall speed.","title":"Workbench crashes or slows down my Drupal server."},{"location":"troubleshooting/#workbench-thinks-that-a-remote-file-is-an-html-file-when-i-know-its-a-video-or-audio-or-image-etc-file","text":"Some web applications, including Drupal 7, return a human-readable HTML page instead of a expected HTTP response code when they encounter an error. If Workbench is complaining that a remote file in your file other file column in your input CSV has an extension of \".htm\" or \".html\" and you know that the file is not an HTML page, what Workbench is seeing is probably an error message. For example, Workbench might leave a message like this in your log: Error: File \"https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view\" in CSV row \"text:302175\" has an extension (html) that is not allowed in the \"field_media_file\" field of the \"file\" media type. This error can be challenging to track down since the HTML error page might have been specific to the request that Workbench just made (e.g. a timeout or some other temporary server condition). One way of determining if the error is temporary (i.e. specific to the request) is to use curl to fetch the file (e.g., curl -o test.tif https://digital.lib.sfu.ca/islandora/object/edcartoons:158/datastream/OBJ/view ). If the returned file (in this example, it will be named test.tif ) is in fact HTML, the error is probably permanent or at least persistent; if the file is the one you expected to retrieve, the error was temporary and you can ignore it.","title":"Workbench thinks that a remote file is an .html file when I know it's a video (or audio, or image, etc.) file."},{"location":"troubleshooting/#the-text-in-my-csv-does-not-match-how-it-looks-when-i-view-it-in-drupal","text":"If a field is configured in Drupal to use text filters , the HTML that is displayed to the user may not be exactly the same as the content of the node add/edit form field. If you check the node add/edit form, the content of the field should match the content of the CSV field. If it does, it is likely that Drupal is apply a text filter. See this issue for more information.","title":"The text in my CSV does not match how it looks when I view it in Drupal."},{"location":"troubleshooting/#my-islandora-uses-a-custom-media-type-and-i-need-to-tell-workbench-what-file-field-to-use","text":"If you need to create a media that is not one of the standard Islandora types (Image, File, Digital Document, Video, Audio, Extracted Text, or FITS Technical metadata), you will need to include the media_file_fields setting in your config file, like this: media_file_fields: - mycustommedia_machine_name: field_custom_file - myothercustommedia_machine_name: field_other_custom_file This configuration setting adds entries to the following default mapping of media types to file field names: 'file': 'field_media_file', 'document': 'field_media_document', 'image': 'field_media_image', 'audio': 'field_media_audio_file', 'video': 'field_media_video_file', 'extracted_text': 'field_media_file', 'fits_technical_metadata': 'field_media_file'","title":"My Islandora uses a custom media type and I need to tell Workbench what file field to use."},{"location":"troubleshooting/#edtf-interval-values-are-not-rendering-in-islandora-properly","text":"Islandora can display EDTF interval values (e.g., 2004-06/2006-08 , 193X/196X ) properly, but by default, the configuration that allows this is disabled (see this issue for more information). To enable it, for each field in your Islandora content types that use EDTF fields, visit the \"Manage form display\" configuration for the content type, and for each field that uses the \"Default EDTF widget\", within the widget configuration (click on the gear), check the \"Permit date intervals\" option and click \"Update\":","title":"EDTF 'interval' values are not rendering in Islandora properly."},{"location":"troubleshooting/#my-csv-file-has-a-url_alias-column-but-the-aliases-are-not-being-created","text":"First thing to check is whether you are using the Pathauto module. It also creates URL aliases, and since by default Drupal only allows one URL alias, in most cases, the aliases it creates will take precedence over aliases created by Workbench.","title":"My CSV file has a url_alias column, but the aliases are not being created."},{"location":"troubleshooting/#im-having-trouble-getting-workbench-to-work-in-a-cronjob","text":"The most common problem you will encounter when running Islandora Workbench in a cronjob is that Workbench can't find its configuration file, or input/output directories. The easiest way to avoid this is to use absolute file paths everywhere, including as the value of Workbench's --config parameter, in configuration files, and in file and additional_files columns in your input CSV files. In some cases, particularly if you are using a secondary task to create pages or child items, you many need to use the path_to_python and path_to_workbench_script configuration settings.","title":"I'm having trouble getting Workbench to work in a cronjob"},{"location":"troubleshooting/#get_islandora_7_contentpy-crashes-with-the-error-illegal-request-server-returned-status-of-400-the-default-query-may-be-too-long-for-a-url-request","text":"Islandora 7's Solr contains a lot of redundant fields. You need to reduce the number of fields to export. See the \" Exporting Islandora 7 content \" documentation for ways to reduce the number of fields. Ask in the #islandoraworkbench Slack channel if you need help.","title":"get_islandora_7_content.py crashes with the error \"illegal request: Server returned status of 400. The default query may be too long for a url request.\""},{"location":"updating_media/","text":"You can update media by providing a CSV file with a media_id column and other columns representing the fields of the media that you wish to update. Here is a very simple example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the media_id column can be numeric node IDs (as illustrated above), full URLs, or full URL aliases. The minimum configuration file for \"update_media\" tasks looks like this (note the task option is update_media ): task: update_media host: \"http://localhost:8000\" username: admin password: islandora input_csv: update_media.csv media_type: file media_type is required, and its value is the Drupal machine name of the type of media you are updating (e.g. image , document , file , video , etc.) Currently, the update_media routine has support for the following operations: Updating files attached to media Updating the set of track files attached to media Updating the Media Use TIDs associated with media Updating the published status of media Updating custom fields of any supported field type When updating field values on the media, the update_mode configuration option allows you to determine whether the field values are appendeded or deleted: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. Updating files attached to media Note Updating files attached to media is currently only supported for media attached to a node. Warning This operation will delete the existing file attached to the media and replace it with the file specified in the CSV file. The update_mode setting has no effect on replacing files. To update the file attached to a media, you must provide a CSV file with, at minimum, a media_id column and a file column. The media_id column should contain the ID of the media you wish to update, and the file column should contain the path to the file you wish to attach to the media (always use file and not the media-type-specific file fieldname). Here is an example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the file column can be paths to files on the local filesystem, full URLs, or full URL aliases. Updating the track files attached to media Note This functionality is currently only supported for media attached to a node. Note This operation will delete all existing track files attached to the media and replace them with the track files specified in the CSV file. To update the set of track files attached to a media, you must provide a CSV file with, at minimum, a media_id column and a column with a name that matches the media_track_file_fields setting in the configuration file. By default, the media_track_file_fields setting in the configuration file is set to field_track for both audio and video. If you have a custom setup that has a different machine name of the field on the media that holds the track file and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field For the purposes of this example, we will assume that the media_track_file_fields setting is set to field_track for both audio and video. The media_id column should contain the ID of the media you wish to update, and the field_track column should contain information related to the track files you wish to attach to the media. The field_track has a special format, which requires you to specify the following information separated by colons, in exactly the following order: - The label for the track - Its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\") - The language code for the track (\"en\", \"fr\", \"es\", etc.) - The absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case) Here is an example CSV that updates the set of track files attached to the media with ID 100: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt You may also wish to attach multiple track files to a single media. To do this, you can specify items in the field_track column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the set of track files attached to the media with ID 100 to have multiple track files: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt|French Subtitles:subtitles:fr:/path/to/french_subtitles.vtt Updating the media use TIDs associated with media To update the Media Use TIDs associated with media, you must provide a CSV file with, at minimum, a media_id column and a media_use_tid column. The media_id column should contain the ID of the media you wish to update, and the media_use_tid column should contain the TID(s) of the media use term(s) you wish to associate with the media. If a value is not specified for the media_use_tid column in a particular row, the value for the media_use_tid setting in the configuration file (Service File by default) will be used. Here is an example CSV that updates the Media Use TID associated with the media with ID 100: media_id,media_use_tid 100,17 You may also wish to associate multiple Media Use TIDs with a single media. To do this, you can specify items in the media_use_tid column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the Media Use TIDs associated with the media with ID 100 to have multiple Media Use TIDs: media_id,media_use_tid 100,17|18 Values in the media_use_tid column can be the taxonomy term ID of the media use or the taxonomy term URL alias. Updating the published status of media To update the published status of media, you must provide a CSV file with, at minimum, a media_id column and a status column. The media_id column should contain the ID of the media you wish to update, and the status column should contain one of the following case-insensitive values: \"1\" or \"True\" (to publish the media) \"0\" or \"False\" (to unpublish the media) Here is an example CSV that updates the published status of some media: media_id,status 100,tRuE 101,0 Updating custom fields attached to media To update custom fields attached to media, you must provide a CSV file with, at minimum, a media_id column and columns with the machine names of the fields you wish to update. The media_id column should contain the ID of the media you wish to update, and the other columns should contain the values you wish to set for the fields. Here is an example CSV that updates the published status of some media: media_id,name,field_my_custom_field 100,My Media,My Custom Value Leaving fields unchanged If you wish to leave a field unchanged, you can leave it blank in the column for that field. Here is an example CSV that updates the published status of some media and leaves others unchanged: media_id,status 100,1 101, 102,0","title":"Updating media"},{"location":"updating_media/#updating-files-attached-to-media","text":"Note Updating files attached to media is currently only supported for media attached to a node. Warning This operation will delete the existing file attached to the media and replace it with the file specified in the CSV file. The update_mode setting has no effect on replacing files. To update the file attached to a media, you must provide a CSV file with, at minimum, a media_id column and a file column. The media_id column should contain the ID of the media you wish to update, and the file column should contain the path to the file you wish to attach to the media (always use file and not the media-type-specific file fieldname). Here is an example CSV that updates the file attached to the media with ID 100: media_id,file 100,test.txt Values in the file column can be paths to files on the local filesystem, full URLs, or full URL aliases.","title":"Updating files attached to media"},{"location":"updating_media/#updating-the-track-files-attached-to-media","text":"Note This functionality is currently only supported for media attached to a node. Note This operation will delete all existing track files attached to the media and replace them with the track files specified in the CSV file. To update the set of track files attached to a media, you must provide a CSV file with, at minimum, a media_id column and a column with a name that matches the media_track_file_fields setting in the configuration file. By default, the media_track_file_fields setting in the configuration file is set to field_track for both audio and video. If you have a custom setup that has a different machine name of the field on the media that holds the track file and need to override these defaults, you can do so using the media_track_file_fields configuration setting: media_track_file_fields: audio: field_my_custom_track_file_field mycustommmedia: field_another_custom_track_file_field For the purposes of this example, we will assume that the media_track_file_fields setting is set to field_track for both audio and video. The media_id column should contain the ID of the media you wish to update, and the field_track column should contain information related to the track files you wish to attach to the media. The field_track has a special format, which requires you to specify the following information separated by colons, in exactly the following order: - The label for the track - Its \"kind\" (one of \"subtitles\", \"descriptions\", \"metadata\", \"captions\", or \"chapters\") - The language code for the track (\"en\", \"fr\", \"es\", etc.) - The absolute path to the track file, which must have the extension \".vtt\" (the extension may be in upper or lower case) Here is an example CSV that updates the set of track files attached to the media with ID 100: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt You may also wish to attach multiple track files to a single media. To do this, you can specify items in the field_track column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the set of track files attached to the media with ID 100 to have multiple track files: media_id,field_track 100,English Subtitles:subtitles:en:/path/to/subtitles.vtt|French Subtitles:subtitles:fr:/path/to/french_subtitles.vtt","title":"Updating the track files attached to media"},{"location":"updating_media/#updating-the-media-use-tids-associated-with-media","text":"To update the Media Use TIDs associated with media, you must provide a CSV file with, at minimum, a media_id column and a media_use_tid column. The media_id column should contain the ID of the media you wish to update, and the media_use_tid column should contain the TID(s) of the media use term(s) you wish to associate with the media. If a value is not specified for the media_use_tid column in a particular row, the value for the media_use_tid setting in the configuration file (Service File by default) will be used. Here is an example CSV that updates the Media Use TID associated with the media with ID 100: media_id,media_use_tid 100,17 You may also wish to associate multiple Media Use TIDs with a single media. To do this, you can specify items in the media_use_tid column separated by the subdelimeter specified in the configuration file (by default, this is the pipe character, \"|\"). Here is an example CSV that updates the Media Use TIDs associated with the media with ID 100 to have multiple Media Use TIDs: media_id,media_use_tid 100,17|18 Values in the media_use_tid column can be the taxonomy term ID of the media use or the taxonomy term URL alias.","title":"Updating the media use TIDs associated with media"},{"location":"updating_media/#updating-the-published-status-of-media","text":"To update the published status of media, you must provide a CSV file with, at minimum, a media_id column and a status column. The media_id column should contain the ID of the media you wish to update, and the status column should contain one of the following case-insensitive values: \"1\" or \"True\" (to publish the media) \"0\" or \"False\" (to unpublish the media) Here is an example CSV that updates the published status of some media: media_id,status 100,tRuE 101,0","title":"Updating the published status of media"},{"location":"updating_media/#updating-custom-fields-attached-to-media","text":"To update custom fields attached to media, you must provide a CSV file with, at minimum, a media_id column and columns with the machine names of the fields you wish to update. The media_id column should contain the ID of the media you wish to update, and the other columns should contain the values you wish to set for the fields. Here is an example CSV that updates the published status of some media: media_id,name,field_my_custom_field 100,My Media,My Custom Value","title":"Updating custom fields attached to media"},{"location":"updating_media/#leaving-fields-unchanged","text":"If you wish to leave a field unchanged, you can leave it blank in the column for that field. Here is an example CSV that updates the published status of some media and leaves others unchanged: media_id,status 100,1 101, 102,0","title":"Leaving fields unchanged"},{"location":"updating_nodes/","text":"You can update existing nodes by providing a CSV file with a node_id column plus field data you want to update. The type of update is determined by the value of the update_mode configuration option: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. The column headings in the CSV file other than node_id must match machine names of fields that exist in the target Islandora content type. Only include fields that you want to update. Workbench can update fields following the same CSV conventions used when creating nodes as described in the \" Fields \" documentation. For example, using the fields defined by the Islandora Defaults module for the \"Repository Item\" content type, your CSV file could look like this: node_id,field_description,field_rights,field_access_terms,field_member_of 100,This is my new title,I have changed my mind. This item is yours to keep.,27,45 The config file for update operations looks like this (note the task option is 'update'): task: update host: \"http://localhost:8000\" username: admin password: islandora content_type: my_content_type input_csv: update.csv If you want to append the values in the CSV to values that already exist in the target nodes, add the update_mode configuration option: task: update host: \"http://localhost:8000\" username: admin password: islandora content_type: my_content_type input_csv: update.csv update_mode: append Some things to note: If your target Drupal content type is not islandora_object (the default value), you must include content_type in your configuration file as illustrated above. The update_mode applies to all rows in your CSV; it cannot be specified for particular rows. Updates apply to entire fields. Workbench cannot replace individual values in field. Values in the node_id column can be numeric node IDs (e.g. 467 ) or full URLs, including URL aliases. If a node you are updating doesn't have a field named in your input CSV, Workbench will skip updating the node and add a log entry to that effect. For update tasks where the update_mode is \"replace\" or \"append\", blank/empty CSV values will do nothing; in other words, empty CSV values tell Workbench to not update the field. For update tasks where the update_mode is \"delete\", it doesn't matter if the column(s) in the input CSV are blank or contain values - the values in the corresponding Drupal fields are deleted in both cases. Islandora Workbench will never allow a field to contain more values than the field's configuration allows. Attempts to update a field with more values than the maximum number allowed will result in the surplus values being ignored during the \"update\" task. If Workbench does this, it will write an entry to the log indicating it has done so.","title":"Updating nodes"},{"location":"updating_terms/","text":"You can update existing taxonomy terms in an update_terms tasks by providing a CSV file with a term_id column plus field data you want to update. The type of update is determined by the value of the update_mode configuration option: replace (the default) will replace all existing values in a field with the values in the input CSV. append will add values in the input CSV to any existing values in a field. delete will delete all values in a field. Islandora Workbench will never allow a field to contain more values than the field's configuration allows. Attempts to update a field with more values than the maximum number allowed will result in the surplus values being ignored during the \"update\" task. If Workbench does this, it will write an entry to the log indicating it has done so. The column headings in the CSV file other than term_id must match either the machine names of fields that exist in the target Islandora content type, or their human-readable labels, with exceptions for the following fields: term_name , description , parent , weight , and published (more information about these fields is available in the \" Creating taxonomy terms \" documentation). Only include fields that you want to update. Currently, fields with the data types as described in the \" Fields \" documentation can be updated. Note If you are going to use published in your update_terms tasks, you need to remove the \"Published\" filter from the \"Term from term name\" View. For example, using the fields defined by the Islandora Defaults module for the \"Person\" vocabulary, your CSV file could look like this: term_id,term_name,description,field_authority_link 100,\"Jordan, Mark\",Mark Jordan's Person term.,http://viaf.org/viaf/106468077%%VIAF Record The config file for update operations looks like this (note the task option is 'update'): task: update_terms host: \"http://localhost:8000\" username: admin password: islandora # vocab_id is required. vocab_id: person input_csv: update.csv If you want to append the values in the CSV to values that already exist in the target nodes, add the update_mode configuration option: task: update_terms host: \"http://localhost:8000\" username: admin password: islandora vocab_id: person input_csv: update.csv update_mode: append Some things to note: The vocab_id config setting is required. The update_mode applies to all rows in your CSV; it cannot be specified for particular rows. Updates apply to entire fields. Workbench cannot replace individual values in field. Values in the term_id column can be numeric term IDs (e.g. 467 ) or string (e.g. Dogs ). If a string, it must match the existing term identically other than for trailing and leading whitespace. In tasks where you want to update the values in term_name , you should use term_id to identify the term entity. For update tasks where the update_mode is \"replace\" or \"append\", blank/empty CSV values will do nothing; in other words, empty CSV values tell Workbench to not update the field. For update tasks where the update_mode is \"delete\", it doesn't matter if the column(s) in the input CSV are blank or contain values - the values in the corresponding Drupal fields are deleted in both cases.","title":"Updating taxonomy terms"},{"location":"workflows/","text":"Islandora Workbench can be used in a variety of content ingest workflows. Several are outlined below. Batch ingest This is the most common workflow. A user prepares a CSV file and accompanying media files, and runs Workbench to ingest the content: Note that within this basic workflow, options exist for creating nodes with no media , and creating stub nodes from files (i.e., no accompanying CSV file). Distributed batch ingest It is possible to separate the tasks of creating a node and its accompanying media. This can be done in a couple of ways: creating the nodes first, using the nodes_only: true configuration option, and adding media to those nodes separately creating stub nodes directly from media files , and updating the nodes separately In this workflow, the person creating the nodes and the person updating them later need not be the same. In both cases, Workbench can create an output CSV that can be used in the second half of the workflow. Migrations Islandora Workbench is not intended to replace Drupal's Migrate framework, but it can be used in conjunction with other tools and processes as part of an \" extract, transform, load \" (ETL) workflow. The source could be any platform. If it is Islandora 7, several tools exist to extract content, including the get_islandora_7_content.py script that comes with Workbench or the Islandora Get CSV module for Islandora 7. This content can then be used as input for Islandora Workbench, as illustrated here: On the left side of the diagram, get_islandora_7_content.py or the Islandora Get CSV module are used in the \"extract\" phase of the ETL workflow, and on the right side, running the user's computer, Islandora Workbench is used in the \"load\" phase. Before loading the content, the user would modify the extracted CSV file to confirm with Workbench's CSV content requirements. The advantage of migrating to Islandora in this way is that the exported CSV file can be cleaned or supplemented (manually or otherwise) prior to using it as Workbench's input. The specific tasks required during this \"transform\" phase will vary depending on the quality and consistency of metadata and other factors. Note Workbench's ability to add multiple media to a node at one time is useful during migrations, if you want to reuse derivatives such as thumbnails and OCR transcripts from the source platform. Using this ability can speed up ingest substantially, since Islandora won't need to generate derivative media that are added this way . See the \" Adding multiple media \" section for more information. Watch folders Since Islandora workbench is a command-line tool, it can be run in a scheduled job such as Linux \"cron\". If CSV and file content are present when Workbench runs, Workbench will operate on them in the same way as if a person ran Workbench manually. Note Islandora Workbench does not detect changes in directories. While tools to do so exist, Workbench's ability to ingest Islandora content in batches makes it useful to scheduled jobs, as opposed to realtime detection of new files in a directory. An example of this workflow is depicted in the diagrams below, the source of the files is the daily output of someone scanning images. If these images are saved in the directory that is specified in Workbench's input_dir configuration option, and Workbench is run in a cron job using the \" create_from_files \" task, nodes will be created when the cron job executes (over night, for example): A variation on this workflow is to combine it with the \"Distributed\" workflow described above: In this workflow, the nodes are created overnight and then updated with CSV data the next day. Note If you are using a CSV file as input (that is, a standard create task), a feature that is useful in this workflow is that Workbench can check to see if a node already exists in the target Drupal before it creates the node. Using this feature, you could continue to append rows to your input CSV and not worry about accidentally creating duplicate nodes. Metadata maintenance Workbench can help you maintain your metadata using a variation of the extract, transform, load pattern mentioned above. For example, Rosie Le Faive demonstrates this round-tripping technique in this video (no need to log in), in which they move publisher data from the Linked Agent field to a dedicated Publisher field. Rosie uses a get_data_from_view task to export the Linked Agent field data from a group of nodes, then does some offline transformation of that data into a separate Publisher field (in this case, a Python script, but any suitable tool could be used), then finally uses a pair of update tasks to put the modified data back into Drupal. Another example of round-tripping metadata is if you need to change a Drupal field's configuration (for example, change a text field's maximum length) but Drupal won't allow you to do that directly. Using Workbench, you could export all the data in the field you want to modify, create a new Drupal field to replace it, and then use an update task to populate the replacement field. Drupal's Views Bulk Operations module (documented here ) lets you do simple metadata maintenance, but round-tripping techniques like the ones described here enables you to do things that VBO simply can't. Integrations with other systems A combination of the \"Migrations\" workflow and the \"Watch folder\" workflow can be used to automate the periodic movement of content from a source system (in the diagram below, Open Journal Systems or Archivematica) into Islandora: The extraction of data from the source system, conversion of it into the CSV and file arrangement Workbench expects, and running of Workbench can all be scripted and executed in sequence using scheduled jobs. The case study below provides a full, operational example of this workflow. Using hooks Islandora Workbench provides several \" hooks \" that allow you to execute external scripts at specific times. For example, the \"post-action script\" enables you to execute scripts immediately after a node is created or updated, or a media is created. Drupal informs Workbench if an action was successful or not, and in either case, post-action hook scripts registered in the Workbench configuration file execute. These scripts can interact with external applications: Potential uses for this ability include adding new Islandora content to external processing queues, or informing upstream applications like those described in the \"Integrations with other systems\" section above that content they provide has been (or has not been) ingested into Islandora. As a simpler example, post-action hook scripts can be used to write custom or special-purpose log files. Warning Schedulers such as Linux cron usually require that all file paths are absolute, unless the scheduler changes its current working directory when running a job. When running Islandora Workbench in a scheduled job, all paths to files and directories included in configuration files should be absolute, not relative to Workbench. Also, the path to Workbench configuration file used as the value of --config should be absolute. If a scheduled job is not executing the way you expect, the first thing you should check is whether all paths to files and directories are expressed as absolute paths, not relative paths. Sharing the input CSV with other applications Some workflows can benefit from having Workbench share its input CSV with other scripts or applications. For example, you might use Workbench to ingest nodes into Islandora but want to use the same CSV file in a script to create metadata for loading into another application such as a library discovery layer. Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names. To accommodate CSV columns that do not correspond to Drupal field names, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the non-Workbench column headers in the ignore_csv_columns configuration setting. For example, if you want to include a date_generated and a qa by column in your CSV, include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated', 'qa by'] With this setting in place, Workbench will ignore the date_generated column in the input CSV. More information on this feature is available . Sharing configuration files with other applications Islandora Workbench ignores entries in its YAML configuration files it doesn't recognize. This means that you can include YAML data that you may need for an application you are using in conjuction with Workbench. For example, in an automated deployment, you may need to unzip an archive containing inut CSV and sample images, PDFs, etc. that you want to load into Islandora as part of the deployment. If you put something like mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip in your config file, your deployment scripts could read the same config file used by Workbench, pick out the mylib_zip_location entry and get its value, download and unpack the content of the file into the Workbench input_dir location, and then run Workbench: task: create host: https://islandora.traefik.me username: admin password: password input_dir: /tmp input_csv: deployment_sample_data.csv mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip It's probably a good idea to namespace your custom/non-Workbench config entries so they are easy to identify and to reduce the chance they conflict with Workbench config settings, but it's not necessary to do so. Cross-environment deployment / Continuous Integration Workbench can be used to create the same content across different development, testing, and deployment environments. One application of this is to allow a team of developers can be sure they are all using the same Islandora content. For example, it is possible to commit one or more Workbench config files to a shared Git repository, and when a developer rebuilds their envrionment, automatically run Workbench using those configuration files to load the shared content. Some tips on making Islandora more portable across environments include: Adding the Workbench user's password to an envrionment variable , eliminating the need to include the password in the configuration file Using a Google Sheet as the input_csv value Using a remote Zip archive as the input images, PDFs, etc. This Zip archive can also contain in input CSV, eliminating the need to point to a Google Sheet. Using URL aliases in your field_member_of CSV column to avoid relying on Drupal instance-specific node IDs (note that this assumes the node aliases will be consistent across Drupal instances) Configuring a View to check if nodes already exist , if you want to rerun Workbench using the same input data but avoid creating duplicate nodes and media The same capabilities apply to using Workbench to load data for automated testing during Continuous Integration workflows and configurations. An interesting facet of using the combination of remote Zip and (optionally) a Google Sheet as input is that people other than developers, such as content managers or testers, can add content to ingest without having to commit to the Git repository. All they need to do is update the Google Sheet and Zip file. As long as the URLs of these inputs do not change, the next time the developers run Workbench, the new content will be ingested. Automatically populating staging and production environments on build is a useful way to test that these environments have deployed successfully. Running a rollback task can then delete the sample content once deployment has been confirmed. The Islandora Sandbox is populated on build using Workbench. Case study Simon Fraser University Library uses Islandora Workbench to automate the transfer of theses from its locally developed thesis registration application (called, unsurprisingly, the Thesis Registration System , or TRS) to Summit , the SFU institutional research repository running Islandora. This transfer happens through a series of scheduled tasks that run every evening. This diagram depicts the automated workflow, with an explanation of each step below the diagram. This case study is an example of the \" Integration with other systems \" workflow described above. Steps 1 and 2 do not involve Workbench directly and run as separate cron jobs an hour apart - step 1 runs at 7:00 PM, step 2 runs at 8:00 PM. Steps 3 and 4 are combined into a single Workbench \"create\" task which is run as a cron job at 9:00 PM. Step 1: Fetch the day's theses from the TRS The first scheduled task runs a script that fetches a list of all the theses approved by the SFU Library Thesis Office staff during that day. Every thesis that has been approved and does not have in its metadata a URL in Summit is in the daily list. (The fact that a thesis doesn't have a Summit URL in its TRS metadata yet will come up again in Step 4; we'll come back to that later.) After retrieving the list of theses, the script retrieves the metadata for each thesis (as a JSON file), the thesis PDF file, and, if they are present, any supplemental data files such as Excel, CSV, or video files attached to the thesis. All the data for each thesis is written to a temporary directory, where it becomes input to the script described in Step 2. Step 2: Convert the TRS data into CSV The script executed in this step converts the thesis data into a Workbench input CSV file. If there are any theses in a daily batch that have supplemental files, the script generates a second CSV file that is used in a Workbench secondary task to create compound objects (described in more detail in the next step). Step 3: Run Islandora Workbench With the thesis CSV created in step 2 in place (and the accompanying supplemental file CSV, if there are any supplemental files in the day's batch), a scheduled job executes Islandora Workbench. The main Workbench configuration file looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: sfu_thesis allow_adding_terms: true require_entity_reference_views: false subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/theses_daily.csv secondary_tasks: ['/home/utilityuser/islandora_workbench/supplemental_files_secondary_task.yml'] log_file_path: /home/zlocal/islandora_workbench/theses_daily.log node_post_create: ['/home/utilityuser/islandora_workbench/patch_summit.py'] path_to_python: /opt/rh/rh-python38/root/usr/bin/python path_to_workbench_script: /home/utilityuser/islandora_workbench/workbench The input CSV, which describes the theses (and is named in the input_csv setting in the above config file), looks like this (a single CSV row shown here): id,file,title,field_sfu_abstract,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_sfu_department,field_identifier,field_tags,field_language,field_member_of,field_sfu_permissions,field_sfu_thesis_advisor,field_sfu_thesis_type,field_resource_type,field_extent,field_display_hints,field_model 6603,/home/utilityuser/summit_data/tmp/etd21603/etd21603.pdf,\"Additively manufactured digital microfluidics\",\"With the development of lithography techniques, microfluidic systems have drastically evolved in the past decades. Digital microfluidics (DMF), which enables discrete droplet actuation without any carrying liquid as opposed to the continuous-fluid-based microfluidics, emerged as the candidate for the next generation of lab-on-a-chip systems. The DMF has been applied to a wide variety of fields including electrochemical and biomedical assays, drug delivery, and point-of-care diagnosis of diseases. Most of the DMF devices are made with photolithography which requires complicated processes, sophisticated equipment, and cleanroom setting. Based on the fabrication technology being used, these DMF manipulate droplets in a planar format that limits the increase of chip density. The objective of this study is to introduce additive manufacturing (AM) into the fabrication process of DMF to design and build a 3D-structured DMF platform for droplet actuation between different planes. The creation of additively manufactured DMF is demonstrated by fabricating a planar DMF device with ion-selective sensing functions. Following that, the design of vertical DMF electrodes overcomes the barrier for droplets to move between different actuation components, and the application of AM helps to construct free-standing xylem DMF to drive the droplet upward. To form a functional system, the horizontal and xylem DMF are integrated so that a three-dimensional (3D) droplet manipulation is demonstrated. The integrated system performs a droplet actuation speed of 1 mm/s from horizontal to vertical with various droplet sizes. It is highly expected that the 3D-structured DMF open new possibilities for the design of DMF devices that can be used in many practical applications.\",\"relators:aut:Min, Xin\",575,2021-08-27,\"Applied Sciences: School of Mechatronic Systems Engineering\",etd21603,\"Digital microfluidics%%%Additive manufacturing%%%3D printing%%%Electrowetting\",531,30044%%%30035,\"This thesis may be printed or downloaded for non-commercial research and scholarly purposes.\",\"relators:ths:Soo, Kim, Woo\",\"(Thesis) Ph.D.\",545,\"125 pages.\",519,512 Supplemental files for a thesis are created as child nodes, with the thesis node containing the PDF media as the parent. Here is the thesis in Summit created from the CSV data above. This thesis has four supplemental video files, which are created as the child nodes using a Workbench secondary task . The configuration file for this secondary task looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: islandora_object allow_adding_terms: true subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] # In case the supplemental file doesn't download, etc. we still create the node. allow_missing_files: true input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/supplemental_daily.csv log_file_path: /home/utilityuser/islandora_workbench/theses_daily.log The input CSV for this secondary task, which describes the supplemental files (and is named in the input_csv setting in the \"secondary\" config file), looks like this (only rows for children of the above item shown here): id,parent_id,title,file,field_model,field_member_of,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_description,field_display_hints 6603.1,6603,\"DMF droplet actuation_Planar\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_Planar.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.2,6603,\"DMF droplet actuation_3D electrode\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_3D electrode.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.3,6603,\"DMF droplet actuation_xylem DMF\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_xylem DMF.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.4,6603,\"Horizontal to vertical movement\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-Horizontal to vertical movement.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, Step 4: Update the TRS If a thesis (and any supplemental files, if present) have been successfully added to Summit, Workbench uses a post-node-create hook to run a script that updates the TRS in two ways: it populates the \"Summit URL\" field in the thesis record in the TRS, and it updates the student's user record in the TRS to prevent the student from logging into the TRS. The \"Summit URL\" for a thesis not only links the metadata in the TRS with the migrated item in Summit, it is also used to prevent theses from entering the daily feed described in Step 1. Specifically, theses that have been migrated to Summit (and therefore have a \"Summit URL\") are excluded from the daily feed generated by the TRS. This prevents a thesis from being migrated more than once. If for some reason the Thesis Office wants a thesis to be re-migrated to Summit, all they need to do is delete the first copy from Summit, then remove the \"Summit URL\" from the thesis metadata in the TRS. Doing so will ensure that the thesis gets into the next day's feed. Thesis Office staff do not want students logging into the TRS after their theses have been published in Summit. To prevent a student from logging in, once a thesis has been successfully ingested, Workbench executes a post-node-creation hook script that disables the student's user record in the TRS. If a student wants to log into the TRS after their thesis has been migrated, they need to contact the Thesis Office. Not depicted in the diagram After Workbench has completed executing, a final daily job runs that parses out entries from the Workbench log and the log written by the TRS daily script, and emails those entries to the Summit administrator to warn them of any errors that may have occurred. There is an additional monthly scheduled job (that runs on the first day of each month) that generates MARC records for each thesis added to Summit in the previous month. Before it finishes executing, this script emails the resulting MARC communications file to the Library's metadata staff, who load it into the Library's catalogue.","title":"Workflows"},{"location":"workflows/#batch-ingest","text":"This is the most common workflow. A user prepares a CSV file and accompanying media files, and runs Workbench to ingest the content: Note that within this basic workflow, options exist for creating nodes with no media , and creating stub nodes from files (i.e., no accompanying CSV file).","title":"Batch ingest"},{"location":"workflows/#distributed-batch-ingest","text":"It is possible to separate the tasks of creating a node and its accompanying media. This can be done in a couple of ways: creating the nodes first, using the nodes_only: true configuration option, and adding media to those nodes separately creating stub nodes directly from media files , and updating the nodes separately In this workflow, the person creating the nodes and the person updating them later need not be the same. In both cases, Workbench can create an output CSV that can be used in the second half of the workflow.","title":"Distributed batch ingest"},{"location":"workflows/#migrations","text":"Islandora Workbench is not intended to replace Drupal's Migrate framework, but it can be used in conjunction with other tools and processes as part of an \" extract, transform, load \" (ETL) workflow. The source could be any platform. If it is Islandora 7, several tools exist to extract content, including the get_islandora_7_content.py script that comes with Workbench or the Islandora Get CSV module for Islandora 7. This content can then be used as input for Islandora Workbench, as illustrated here: On the left side of the diagram, get_islandora_7_content.py or the Islandora Get CSV module are used in the \"extract\" phase of the ETL workflow, and on the right side, running the user's computer, Islandora Workbench is used in the \"load\" phase. Before loading the content, the user would modify the extracted CSV file to confirm with Workbench's CSV content requirements. The advantage of migrating to Islandora in this way is that the exported CSV file can be cleaned or supplemented (manually or otherwise) prior to using it as Workbench's input. The specific tasks required during this \"transform\" phase will vary depending on the quality and consistency of metadata and other factors. Note Workbench's ability to add multiple media to a node at one time is useful during migrations, if you want to reuse derivatives such as thumbnails and OCR transcripts from the source platform. Using this ability can speed up ingest substantially, since Islandora won't need to generate derivative media that are added this way . See the \" Adding multiple media \" section for more information.","title":"Migrations"},{"location":"workflows/#watch-folders","text":"Since Islandora workbench is a command-line tool, it can be run in a scheduled job such as Linux \"cron\". If CSV and file content are present when Workbench runs, Workbench will operate on them in the same way as if a person ran Workbench manually. Note Islandora Workbench does not detect changes in directories. While tools to do so exist, Workbench's ability to ingest Islandora content in batches makes it useful to scheduled jobs, as opposed to realtime detection of new files in a directory. An example of this workflow is depicted in the diagrams below, the source of the files is the daily output of someone scanning images. If these images are saved in the directory that is specified in Workbench's input_dir configuration option, and Workbench is run in a cron job using the \" create_from_files \" task, nodes will be created when the cron job executes (over night, for example): A variation on this workflow is to combine it with the \"Distributed\" workflow described above: In this workflow, the nodes are created overnight and then updated with CSV data the next day. Note If you are using a CSV file as input (that is, a standard create task), a feature that is useful in this workflow is that Workbench can check to see if a node already exists in the target Drupal before it creates the node. Using this feature, you could continue to append rows to your input CSV and not worry about accidentally creating duplicate nodes.","title":"Watch folders"},{"location":"workflows/#metadata-maintenance","text":"Workbench can help you maintain your metadata using a variation of the extract, transform, load pattern mentioned above. For example, Rosie Le Faive demonstrates this round-tripping technique in this video (no need to log in), in which they move publisher data from the Linked Agent field to a dedicated Publisher field. Rosie uses a get_data_from_view task to export the Linked Agent field data from a group of nodes, then does some offline transformation of that data into a separate Publisher field (in this case, a Python script, but any suitable tool could be used), then finally uses a pair of update tasks to put the modified data back into Drupal. Another example of round-tripping metadata is if you need to change a Drupal field's configuration (for example, change a text field's maximum length) but Drupal won't allow you to do that directly. Using Workbench, you could export all the data in the field you want to modify, create a new Drupal field to replace it, and then use an update task to populate the replacement field. Drupal's Views Bulk Operations module (documented here ) lets you do simple metadata maintenance, but round-tripping techniques like the ones described here enables you to do things that VBO simply can't.","title":"Metadata maintenance"},{"location":"workflows/#integrations-with-other-systems","text":"A combination of the \"Migrations\" workflow and the \"Watch folder\" workflow can be used to automate the periodic movement of content from a source system (in the diagram below, Open Journal Systems or Archivematica) into Islandora: The extraction of data from the source system, conversion of it into the CSV and file arrangement Workbench expects, and running of Workbench can all be scripted and executed in sequence using scheduled jobs. The case study below provides a full, operational example of this workflow.","title":"Integrations with other systems"},{"location":"workflows/#using-hooks","text":"Islandora Workbench provides several \" hooks \" that allow you to execute external scripts at specific times. For example, the \"post-action script\" enables you to execute scripts immediately after a node is created or updated, or a media is created. Drupal informs Workbench if an action was successful or not, and in either case, post-action hook scripts registered in the Workbench configuration file execute. These scripts can interact with external applications: Potential uses for this ability include adding new Islandora content to external processing queues, or informing upstream applications like those described in the \"Integrations with other systems\" section above that content they provide has been (or has not been) ingested into Islandora. As a simpler example, post-action hook scripts can be used to write custom or special-purpose log files. Warning Schedulers such as Linux cron usually require that all file paths are absolute, unless the scheduler changes its current working directory when running a job. When running Islandora Workbench in a scheduled job, all paths to files and directories included in configuration files should be absolute, not relative to Workbench. Also, the path to Workbench configuration file used as the value of --config should be absolute. If a scheduled job is not executing the way you expect, the first thing you should check is whether all paths to files and directories are expressed as absolute paths, not relative paths.","title":"Using hooks"},{"location":"workflows/#sharing-the-input-csv-with-other-applications","text":"Some workflows can benefit from having Workbench share its input CSV with other scripts or applications. For example, you might use Workbench to ingest nodes into Islandora but want to use the same CSV file in a script to create metadata for loading into another application such as a library discovery layer. Islandora Workbench strictly validates the columns in the input CSV to ensure that they match Drupal field names. To accommodate CSV columns that do not correspond to Drupal field names, you can tell Workbench to ignore specific columns that are present in the CSV. To do this, list the non-Workbench column headers in the ignore_csv_columns configuration setting. For example, if you want to include a date_generated and a qa by column in your CSV, include the following in your Workbench configuration file: ignore_csv_columns: ['date_generated', 'qa by'] With this setting in place, Workbench will ignore the date_generated column in the input CSV. More information on this feature is available .","title":"Sharing the input CSV with other applications"},{"location":"workflows/#sharing-configuration-files-with-other-applications","text":"Islandora Workbench ignores entries in its YAML configuration files it doesn't recognize. This means that you can include YAML data that you may need for an application you are using in conjuction with Workbench. For example, in an automated deployment, you may need to unzip an archive containing inut CSV and sample images, PDFs, etc. that you want to load into Islandora as part of the deployment. If you put something like mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip in your config file, your deployment scripts could read the same config file used by Workbench, pick out the mylib_zip_location entry and get its value, download and unpack the content of the file into the Workbench input_dir location, and then run Workbench: task: create host: https://islandora.traefik.me username: admin password: password input_dir: /tmp input_csv: deployment_sample_data.csv mylib_zip_location: https://static.mylibrary.ca/deployment_sample_data.zip It's probably a good idea to namespace your custom/non-Workbench config entries so they are easy to identify and to reduce the chance they conflict with Workbench config settings, but it's not necessary to do so.","title":"Sharing configuration files with other applications"},{"location":"workflows/#cross-environment-deployment-continuous-integration","text":"Workbench can be used to create the same content across different development, testing, and deployment environments. One application of this is to allow a team of developers can be sure they are all using the same Islandora content. For example, it is possible to commit one or more Workbench config files to a shared Git repository, and when a developer rebuilds their envrionment, automatically run Workbench using those configuration files to load the shared content. Some tips on making Islandora more portable across environments include: Adding the Workbench user's password to an envrionment variable , eliminating the need to include the password in the configuration file Using a Google Sheet as the input_csv value Using a remote Zip archive as the input images, PDFs, etc. This Zip archive can also contain in input CSV, eliminating the need to point to a Google Sheet. Using URL aliases in your field_member_of CSV column to avoid relying on Drupal instance-specific node IDs (note that this assumes the node aliases will be consistent across Drupal instances) Configuring a View to check if nodes already exist , if you want to rerun Workbench using the same input data but avoid creating duplicate nodes and media The same capabilities apply to using Workbench to load data for automated testing during Continuous Integration workflows and configurations. An interesting facet of using the combination of remote Zip and (optionally) a Google Sheet as input is that people other than developers, such as content managers or testers, can add content to ingest without having to commit to the Git repository. All they need to do is update the Google Sheet and Zip file. As long as the URLs of these inputs do not change, the next time the developers run Workbench, the new content will be ingested. Automatically populating staging and production environments on build is a useful way to test that these environments have deployed successfully. Running a rollback task can then delete the sample content once deployment has been confirmed. The Islandora Sandbox is populated on build using Workbench.","title":"Cross-environment deployment / Continuous Integration"},{"location":"workflows/#case-study","text":"Simon Fraser University Library uses Islandora Workbench to automate the transfer of theses from its locally developed thesis registration application (called, unsurprisingly, the Thesis Registration System , or TRS) to Summit , the SFU institutional research repository running Islandora. This transfer happens through a series of scheduled tasks that run every evening. This diagram depicts the automated workflow, with an explanation of each step below the diagram. This case study is an example of the \" Integration with other systems \" workflow described above. Steps 1 and 2 do not involve Workbench directly and run as separate cron jobs an hour apart - step 1 runs at 7:00 PM, step 2 runs at 8:00 PM. Steps 3 and 4 are combined into a single Workbench \"create\" task which is run as a cron job at 9:00 PM.","title":"Case study"},{"location":"workflows/#step-1-fetch-the-days-theses-from-the-trs","text":"The first scheduled task runs a script that fetches a list of all the theses approved by the SFU Library Thesis Office staff during that day. Every thesis that has been approved and does not have in its metadata a URL in Summit is in the daily list. (The fact that a thesis doesn't have a Summit URL in its TRS metadata yet will come up again in Step 4; we'll come back to that later.) After retrieving the list of theses, the script retrieves the metadata for each thesis (as a JSON file), the thesis PDF file, and, if they are present, any supplemental data files such as Excel, CSV, or video files attached to the thesis. All the data for each thesis is written to a temporary directory, where it becomes input to the script described in Step 2.","title":"Step 1: Fetch the day's theses from the TRS"},{"location":"workflows/#step-2-convert-the-trs-data-into-csv","text":"The script executed in this step converts the thesis data into a Workbench input CSV file. If there are any theses in a daily batch that have supplemental files, the script generates a second CSV file that is used in a Workbench secondary task to create compound objects (described in more detail in the next step).","title":"Step 2: Convert the TRS data into CSV"},{"location":"workflows/#step-3-run-islandora-workbench","text":"With the thesis CSV created in step 2 in place (and the accompanying supplemental file CSV, if there are any supplemental files in the day's batch), a scheduled job executes Islandora Workbench. The main Workbench configuration file looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: sfu_thesis allow_adding_terms: true require_entity_reference_views: false subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/theses_daily.csv secondary_tasks: ['/home/utilityuser/islandora_workbench/supplemental_files_secondary_task.yml'] log_file_path: /home/zlocal/islandora_workbench/theses_daily.log node_post_create: ['/home/utilityuser/islandora_workbench/patch_summit.py'] path_to_python: /opt/rh/rh-python38/root/usr/bin/python path_to_workbench_script: /home/utilityuser/islandora_workbench/workbench The input CSV, which describes the theses (and is named in the input_csv setting in the above config file), looks like this (a single CSV row shown here): id,file,title,field_sfu_abstract,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_sfu_department,field_identifier,field_tags,field_language,field_member_of,field_sfu_permissions,field_sfu_thesis_advisor,field_sfu_thesis_type,field_resource_type,field_extent,field_display_hints,field_model 6603,/home/utilityuser/summit_data/tmp/etd21603/etd21603.pdf,\"Additively manufactured digital microfluidics\",\"With the development of lithography techniques, microfluidic systems have drastically evolved in the past decades. Digital microfluidics (DMF), which enables discrete droplet actuation without any carrying liquid as opposed to the continuous-fluid-based microfluidics, emerged as the candidate for the next generation of lab-on-a-chip systems. The DMF has been applied to a wide variety of fields including electrochemical and biomedical assays, drug delivery, and point-of-care diagnosis of diseases. Most of the DMF devices are made with photolithography which requires complicated processes, sophisticated equipment, and cleanroom setting. Based on the fabrication technology being used, these DMF manipulate droplets in a planar format that limits the increase of chip density. The objective of this study is to introduce additive manufacturing (AM) into the fabrication process of DMF to design and build a 3D-structured DMF platform for droplet actuation between different planes. The creation of additively manufactured DMF is demonstrated by fabricating a planar DMF device with ion-selective sensing functions. Following that, the design of vertical DMF electrodes overcomes the barrier for droplets to move between different actuation components, and the application of AM helps to construct free-standing xylem DMF to drive the droplet upward. To form a functional system, the horizontal and xylem DMF are integrated so that a three-dimensional (3D) droplet manipulation is demonstrated. The integrated system performs a droplet actuation speed of 1 mm/s from horizontal to vertical with various droplet sizes. It is highly expected that the 3D-structured DMF open new possibilities for the design of DMF devices that can be used in many practical applications.\",\"relators:aut:Min, Xin\",575,2021-08-27,\"Applied Sciences: School of Mechatronic Systems Engineering\",etd21603,\"Digital microfluidics%%%Additive manufacturing%%%3D printing%%%Electrowetting\",531,30044%%%30035,\"This thesis may be printed or downloaded for non-commercial research and scholarly purposes.\",\"relators:ths:Soo, Kim, Woo\",\"(Thesis) Ph.D.\",545,\"125 pages.\",519,512 Supplemental files for a thesis are created as child nodes, with the thesis node containing the PDF media as the parent. Here is the thesis in Summit created from the CSV data above. This thesis has four supplemental video files, which are created as the child nodes using a Workbench secondary task . The configuration file for this secondary task looks like this: task: create host: https://summit.sfu.ca username: xxxxxxxxxxxxxxxx password: xxxxxxxxxxxxxxxx content_type: islandora_object allow_adding_terms: true subdelimiter: '%%%' media_types_override: - video: ['mp4', 'mov', 'mpv'] - file: ['zip', 'xls', 'xlsx'] # In case the supplemental file doesn't download, etc. we still create the node. allow_missing_files: true input_dir: /home/utilityuser/islandora_workbench/input_data input_csv: /home/utilityuser/summit_data/tmp/supplemental_daily.csv log_file_path: /home/utilityuser/islandora_workbench/theses_daily.log The input CSV for this secondary task, which describes the supplemental files (and is named in the input_csv setting in the \"secondary\" config file), looks like this (only rows for children of the above item shown here): id,parent_id,title,file,field_model,field_member_of,field_linked_agent,field_sfu_rights_ref,field_edtf_date_created,field_description,field_display_hints 6603.1,6603,\"DMF droplet actuation_Planar\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_Planar.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.2,6603,\"DMF droplet actuation_3D electrode\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_3D electrode.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.3,6603,\"DMF droplet actuation_xylem DMF\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-DMF droplet actuation_xylem DMF.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,, 6603.4,6603,\"Horizontal to vertical movement\",\"/home/utilityuser/summit_data/tmp/etd21603/etd21603-xin-min-Horizontal to vertical movement.mp4\",511,,\"relators:aut:Min, Xin\",575,2021-08-27,,","title":"Step 3: Run Islandora Workbench"},{"location":"workflows/#step-4-update-the-trs","text":"If a thesis (and any supplemental files, if present) have been successfully added to Summit, Workbench uses a post-node-create hook to run a script that updates the TRS in two ways: it populates the \"Summit URL\" field in the thesis record in the TRS, and it updates the student's user record in the TRS to prevent the student from logging into the TRS. The \"Summit URL\" for a thesis not only links the metadata in the TRS with the migrated item in Summit, it is also used to prevent theses from entering the daily feed described in Step 1. Specifically, theses that have been migrated to Summit (and therefore have a \"Summit URL\") are excluded from the daily feed generated by the TRS. This prevents a thesis from being migrated more than once. If for some reason the Thesis Office wants a thesis to be re-migrated to Summit, all they need to do is delete the first copy from Summit, then remove the \"Summit URL\" from the thesis metadata in the TRS. Doing so will ensure that the thesis gets into the next day's feed. Thesis Office staff do not want students logging into the TRS after their theses have been published in Summit. To prevent a student from logging in, once a thesis has been successfully ingested, Workbench executes a post-node-creation hook script that disables the student's user record in the TRS. If a student wants to log into the TRS after their thesis has been migrated, they need to contact the Thesis Office.","title":"Step 4: Update the TRS"},{"location":"workflows/#not-depicted-in-the-diagram","text":"After Workbench has completed executing, a final daily job runs that parses out entries from the Workbench log and the log written by the TRS daily script, and emails those entries to the Summit administrator to warn them of any errors that may have occurred. There is an additional monthly scheduled job (that runs on the first day of each month) that generates MARC records for each thesis added to Summit in the previous month. Before it finishes executing, this script emails the resulting MARC communications file to the Library's metadata staff, who load it into the Library's catalogue.","title":"Not depicted in the diagram"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index ef3ffb7c..4bdcb75b 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ