-
Notifications
You must be signed in to change notification settings - Fork 6
Module 1D: ASCAP Data (Cleaning)
The first thing we're going to do is change the values in the RoleType column to actual words. In the column drop down, select: Edit Cells > Transform.
In the Transformation window, we're to write our first Google Refine Expression Language (GREL) command:
value.replace('W', 'Writer')
In this expression, we're telling Refine to change everything in that column in the first set of quotes to the that in the second set. As you can see, it works like a charm:
Click okay to see your data transform.
Now let's try turning 'P' into Publisher:
Uh oh, what happened? Refine can't distinguish between the solo 'P' and the P at the end of 'ASCAP'. So this won't work. Luckily, there's another way.
As we saw in fixing the Twilight Zone records, the core of Refine’s power lies in its use of Facets. On the RoleType column, click the drop-down and go to Facets. Since these categories are going to be a form of text, choose Text Facet.
Our facet now tells us how many times each of the unique values appear within that column. Once again, hover your mouse over the 'P' value in the facet -- you will see options appear next to it. Click on Edit.
In the text box, remove 'P' and type 'Publisher'. Click apply. The text is now transformed; don't forget to close the RoleType facet.
Let's try moving some data around. If you click on the the dropdown menu of any particular column, under the Edit Column menu, you'll see Add Column Based on This Column, but not a command to add a new, blank column; that's because Refine isn't usually used for creating new data inside of a project. But we can work around this! Go to the column dropdown menu of the Name column and select Edit Column > Add Column Based on This Column. Type the word null into the Expression box, and title this new column 'Writers'. Repeat the same task, except title the second column 'Publishers'. Voila -- new empty columns!
Next, we will need to change the view from 'Show As Records' to 'Show As Rows'.
Then, open a Text Facet on 'RoleType'. Click the 'Writer' heading, which will turn orange. The data in the window will now be limited to those match that value in the 'RoleType' column:
Next, we will copy the data from the 'Name' column that corresponds to Writer in 'RoleType'. On the column dropdown for 'Writers', select Edit Cells > Transform and enter the following GREL expression:
cells['Name'].value
Back in the 'RoleType' facet, click on 'Publisher' and repeat the same GREL expression on the new 'Publishers' column. You should now have two columns with data pulled in from the 'Name' column; the columns 'RoleType' and 'Name' can be removed.
Our data is getting cleaner, but each record still has a lot of blank cells. These can be cleared up by collapsing records.
First, let's change the view from 'Show As Rows' back to 'Show As Records'. Because we removed 'Name' column, we lost the cells with "Total Controlled By ASCAP" next to the values in the 'Shares' column. We'll simplify this by changing the name of the 'Shares' column to 'Total Shares Controlled By ASCAP'. In the column's dropdown, select Edit Column > Rename Column and enter the new name. There should now be four named columns: 'Title', 'Total Shares Controlled By ASCAP', 'Writers', and 'Publishers'.
Most of the blank cells we need to clear up are under 'Writers' and 'Publishers'. For both of these columns, we are going to go to their respective dropdown boxes and select Edit Cells > Join Multi-Valued Cells. Refine will ask for a separator between metadata values; while some use commas, I tend to prefer a double pipe (||) because it's a character sequence that doesn't usually come up in data or metadata.
Once both are collapsed, your Refine data should look something like this:
To pull apart the data in the 'Writers' or 'Publishers' columns, you simply go to the column's dropdown box and select Edit Cells > Split Multi-Valued Cells, using the double-pipe as the separator.
Congratulations! You have just cleaned a dataset.