layout | title | permalink |
---|---|---|
default |
Converting Raw Results to Data |
/guide/converting |
Some states provide results data in a way that is not easily loaded using the data pipeline. This could be because the data is provided in a format that isn't easily machine readable, such as an image PDF, or a format that isn't easy to inspect or load incrementally, such as a database dump stored in monolithic files. Also, results may not be publicly accessible and may have been obtained on physical media or by email.
In these cases, the data needs to be converted to CSV and placed in a separate repository that is accessible over HTTP. Thes datasource should be implemented so that it points to the files in the repository.
Examples of states with data repositories include:
The inputs to this task are usually PDFs and sometimes database files provided by state-level and county-level officials (i.e. county clerks).
The outputs to this task are CSVs whose rows look like this: county,precinct,office,district,candidate,party,votes
You should look through the OpenElections github for repos with names that start with openelections-data
. We're working on writing issues that have easy PDFs for everyone to take on.
The two most common ways of performing this task are by data entry and by using Tabula, a Java-based software that extracts data from compuer-generated PDFs. Both are great methods for non-technical contributors, so we recommend both depending on the size of the PDFs.
Generally, results with less than 10 precincts can easily be done by data entry. Anything more than that should be done by Tabula, unless the PDFs are image-based PDFs (someone scanned in a page vs. a computer generated the PDF directly).
If you are looking at doing data entry:
- Download and open the raw results PDF
- In Excel or OpenOffice Calc, set your first row as
county,precinct,office,district,candidate,party,votes
- In the results PDF, count the number of precincts in the county
- For each precinct, write a row that fills in the first two cells. Example:
Harney,Burns 21
,Harney,Hines 25
- In the results PDF, count the total number of candidates that appear in the races. A couple of things to note here. First, we're only looking at federal (e.g. U.S. Senate, President) and state (e.g. Attorney General, Secretary of State, State House) races. Second, you should count write-ins, over votes and under votes each as a candidate. So, an uncontested race will really have four candidates: the candidate, write-ins, over votes and under votes.
- Copy and paste the set of precinct rows for each candidate so that each precinct-candidate combination will have a row (use
CTRL+C
,CTRL+V
here) - Next, go through each race and label the appropriate number of rows with the office and district. Example: a race for the State Senate in the 5th district will have the labels
State Senate
and5
for office and district (useCTRL+D
to label once and copy all the way down) - Fill in each race with the candidate names (use
CTRL+D
to label once and copy all the way down) - Finally, fill in the vote counts for each row (resize windows so that PDF and CSV are open at the same time)
Useful shortcuts for this method:
CTRL+Arrow key
- move to the last cell in the direction you chose that is filled with contentCTRL+C
,CTRL+V
- copy and pasteSHIFT+Arrow key
- expand the selection in the direction you chose. Combine withCTRL+Arrow key
to select all cells from current to the last one that is filledCTRL+D
- copy value/formula down in the current selection
If you are looking at using Tabula, use this tutorial video to learn how to use Tabula. The output will still require a lot of wrangling in Excel or OpenOffice Calc. These sorts of outputs require things like Text to Column, Find and Replace, LEFT() MID() RIGHT(), TRIM(), UPPER() LOWER(), INDEX() MATCH(), etc. to properly wrangle the data. There is no standard way to wrangle the data - the best you can do is use as many Excel/Calc shortcuts and tools as you can to make this go as quickly as possible.
No matter which method you used, save the CSV using the format YYYYMMDD__stateabrev__electiontype__countyname__doctype.csv
Most of the time, converting raw results is at the precinct level, so you'd use something like this: 20121106__or__general__harney__precinct.csv
. Usually, the Github issue that you are helping with will have the CSV file name to use. Otherwise, email us for help.
We standardize office names as the following:
- President
- U.S. Senate
- U.S. House
- State Senate
- State House
- Governor
- Attorney General
- State Treasurer
- Secretary of State
You should email us, ask in the Google Group, or use completed docs as reference.