Common examples of OpenRefine expressions I use to work with archival description and digital collections metadata.
- Find typos
- Find duplicates
- Create new column with pre-populated value
- Add leading zeros
- Merge columns (with the same separator)
- Merge columns (with different separators)
- Split column into multiple columns
- Create a custom facet
- Add slug to all files from one series
- Trim data
- Clean up dates
- Split dates into two columns (e.g. start and end)
- When you have start and endDate columns and you want to populate the endDate column with the same value as start (if there was nothing to split!)
- Remove various characters, punctuation, and circa notations
- Converting free-text dates to ISO-8601 machine readable
- Merge multiple worksheets into one
- Reconciling data against another source
- Other common transformations
- Sources
Why this is useful?
- To avoid adding additional terms to a controlled taxonomy / vocabulary list
- Sometimes controlled vocabulary columns are turned into facets or drop-down lists, and you wouldn’t want these to include typos
Column > Facet > Text Facet
- Sort by count to quickly identify typos
Why this is useful?
- You may have columns that require unique values, e.g. identifier in a file list
Column > Customized facets > Duplicates facet
- True = duplicates
If you want to facet the identified duplicates, then:
Column > Facet > Text Facet
- Can sort by name or count
Why this is useful?
- Quickly populate a column with the same value, e.g. Level of Description
Column > Edit column > Add column based on this column...
- Give column a name
- Replace
value
with the term you want to fill down in quotation marks- “File”
Also useful for creating a URL column based on an ID column. Example:
'https://archive.org/details/'+value
Why this is useful?
- Quickly add leading zeros to boxes or file numbers so that they are consistent lengths
Column > Edit cells > Transform
Expression:
"000"[0,3-length(value)] + value
- 3 is the length, so if you want more or less leading zeros, adjust accordingly
Why this is useful?
- Sometimes you may want to merge data from multiple columns into one, e.g. when creating a ‘Citation’ field for a digital collection.
Go to one of the columns you would like to join, then
Edit column > Join columns
- You can add separator between the contents of each column.
- You can overwrite combining information into the original column or create a new column for the combining contents.
Why this is useful?
- Merge columns to create accession/box(file)
- Similar to CONCATENATE function in Excel
Go to one of the columns you would like to merge (i.e box no.), then
Edit cells > Transform
Expression:
‘B1991-0013’ + ‘/’ + value + ‘(‘ + cells[‘columnName’].value + ‘)’
^ In this example, I am appending an accession number to a values in a box and file number column so that it reads B1991-0013/box(file)
Why this is useful?
- Sometimes you want to split a column into more useful pieces of data, i.e Surname and First Name
Column > Edit column > Split into several columns...
- Can split by separator or by field lengths
- To facet multiple columns at once (e.g. subject terms spread out over multiple columns)
Select first subject column then
Facet > Custom Text Facet
Expression:
[cells.Subject1.value, cells.Subject2.value, cells.Subject3.value]
where Subject1
= column name
If column name has spaces, put it into brackets and remove . before cells, like so:
[cells["Subject 1"].value, cells["Subject 2"].value]
This assumes you have a column identifying what series each row is in
Select series column > Facet > Text facet
If there is already a slug column:
- Slug column >
Edit cells > Transform
- Expression: change
“value”
to“series-slug”
If there is no slug column:
- Series column > Edit column > Add column based on this column
- New column name:
qubitParentSlug
- Expression: change
“value”
to“your-series-slug”
For example: B2001-0003/001(01) to B2001-0003/001
- Slug colum >
Edit cells > Transform
- Expression:
value.substring(0,14)
- Keep first 14 characters
Image: "ISO 8601", XKCD
Why this is useful?
- Sometimes you need to turn free text field with approximate dates, into machine-readable dates to support date range searching and sorting
- See Cassie Schmitt, “Date Formats”, https://icantiemyownshoes.wordpress.com/2014/04/24/clean-up-dates-and-openrefine/
- Look and see what the separators are, most likely - and ,
Facet > Text facet
- Select column to split
Edit column > Split into several columns..
- By separator
[,\-]
- Make sure regular expression is checked ✓
- Remove this column is unchecked ☐
When you have start and endDate columns and you want to populate the endDate column with the same value as start (if there was nothing to split!)
- Select all blank cells in endDate column
Facet > Customized facets > Facet by blank > true
- Fill blank cells with the values from the startDate column
Edit cells > Transform
Expression: cells[‘startDate’].value
, where ‘startDate’ is the column header
Isolate rows that may have these things
- Column >
Text filter > [
then...
Edit cells > Transform
value.replace('[ca. ','').replace(']','').replace('[','')
...and whatever else might be in the date column
Depending on what you have, you will need to do a variety of things to the data such as replacing the month with it’s number.
- Again, see Cassie Schmitt, “Date Formats”, https://icantiemyownshoes.wordpress.com/2014/04/24/clean-up-dates-and-openrefine/
- Will depend on how the dates are written, but here are basic steps:
- Use
Facets
orText Filter
to isolate rows with dates that are more than just year - Split into 2 or 3 columns - day, month, year or month, year
- Replace month with numeric month (e.g. Jan to 01)
- Use
value.replace(‘Jan. ‘, ’01’).replace(‘Feb. ‘, ’02’).replace(‘Mar. ‘, ’03’).replace(‘Apr. ‘, ’04’).replace(‘May ‘, ’05’).replace(‘Jun. ’, ’06’).replace(‘Jul. ‘, ’07’).replace(‘Aug. ‘, ’08’).replace(‘Sep. ‘, ’09’).replace(‘Sept. ‘, ’09’).replace(‘Oct. ‘, ’10’).replace(‘Nov. ‘, ’11’).replace(‘Dec. ‘, ’12’)
- In the
eventStartDate
column, transform the freetext dates (4 Jan. 1972) with data from the day, month and year columnsEdit cells > Transform
cells[‘year’].value + ‘-’ + cells[‘month’].value + ‘-’ + cells[‘day’].value
When converting to date (Edit cells > Common transform > to date), output will often look like this:
1981-03-01T00:00:00Z
To change it to simple YYYY-MM-DD:
- Edit cells > Transform > value.toString('YYYY-MM-dd')
Sometime when drafting descriptions, it's easier to create seperate worksheets for each series. But to upload to AtoM, it is easier if it is all in one. If the column headers for each sheet are all the same (and in the same order).
- Create Project > select .xlsx file
- Select which worksheets to import
See “Common Transformations” - https://guides.library.illinois.edu/openrefine/commontransform
Column > Edit cells > Common transforms
- Delete blanks!
- Remove whitespace
- Unescape HTML entities
- To titlecase
- To uppercase
- To lowercase
- To number
- To date
- To text
You can also extract your steps if you think you’ll be repeating them again on another dataset
University of Illinois Library OpenRefine LibGuide