Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

Adding New Datasets

Jonathan Speiser edited this page Jan 25, 2017 · 25 revisions

Overview

General Guidelines

  • All column names should be lower case and use only alphanumeric characters or underscores.
  • All data should be prepared in .CSV format
  • For every column in a CSV file, there should be an accompanying data dictionary including a sentence describing what the column represents and mentioning the units any measurement.
  • All column names must be less than 63 characters in length
  • Every desired geographic level must be pre-aggregated: if it is not in the data, it will be not shown on the site.

Part I: Integrating with Existing Attributes

Data USA is built around four major types of data: geographies, occupations, industries and educational courses. In order to integrate new data with existing data in the platform, it is important that the data are linkable. Below are details on how each one of the four major data types are structured and how they should be structured for any new data source.

Geographies

Geographic identifiers are special strings that denote different levels of the US geographic hierarchy. Here are the currently supported identifiers for geographies in Data USA:

Description Code Format Notes
Nation (United States) 01000US
State 04000USXX Where XX is a 2-digit FIPS code
County 05000USXXAAA Where XX is a 2-digit FIPS state code and AAA is a 3-digit county code
Place 16000USXXBBBBB Where XX is a 2-digit FIPS state code and BBBBB is a 5-digit place code
Metropolitan Statistical Area 31000USCCCCC Where CCCCC is a 5-digit MSA code
Tract 14000USXXAAADDDDDD Where XX is a 2-digit FIPS state code and AAA is a 3-digit county code and DDDDD is a 5-digit tract code

By convention, all geography columns should be text fields named geo. For example:

geo income
04000US25 58000
04000US36 57000
05000US2511000 52000

Industries (NAICS)

Data USA is primarily built around the PUMS NAICS. For a full listing of all PUMS NAICS codes visit the attribute list at https://api.datausa.io/attrs/naics/. As a secondary option, data sets may also use BLS NAICS codes provided as an attribute list at https://api.datausa.io/attrs/bls_naics/.

For a dataset to work appropriately, it should either be completely contained by the list of PUMS NAICS or completely contained by the list of BLS NAICS codes (mixing the two lists is considered invalid). Every row of data in a new source should correspond to a valid NAICS code (either PUMS or BLS) to be considered valid data.

By convention, all NAICS industry columns should be text fields named naics.

Occupations (SOC)

Like the industry (NAICS) codes, in Data USA the occupational data is built primarily around the SOC PUMS standard. For a complete listing of valid PUMS codes visit the SOC attribute list https://api.datausa.io/attrs/soc/. Also available as a secondary option are the BLS style SOC codes from https://api.datausa.io/attrs/bls_soc/. Again, for a dataset to be considered valid, it should either be completely contained by the list of PUMS SOC or completely contained by the list of BLS SOC codes (mixing the two lists is not considered valid). Every row of data in a new source should contain a SOC code that is present in the attribute data.

By convention, all SOC occupation columns should be text fields named soc.

Classification of Instructional Programs (CIP)

For data on educational majors, Data USA uses the 2010 CIP classification standard. A complete list of valid CIP codes and their descriptions may be found at https://api.datausa.io/attrs/cip/.

When including CIP data, ensure that each CIP code is found in the CIP attribute list.

All CIP fields should be text fields named cip.

Crosswalking

When referencing existing entity codes or classification systems that don't align

Occasionally, when working with a new dataset data may be provided in a classification system that does not match the Geographic, Occupation, Industry or Educational Course classification systems used in Data USA. In these instances, the way to proceed is to provide an additional table that maps the new codes to into the existing code space. An important element of any new crosswalk table is that every existing attribute should map to at least one new attribute if possible as this will allow the new data to be pulled in on every existing profile page.

The crosswalk table should be of the format new_attr_name to existing_attr_name. For instance, if we were to crosswalk industrial codes to the ISIC classiciation we would create a crosswalk table that had two text columns: isic & naics. Note that attribute columns should always be represented as strings.

Example crosswalk table

BLS to PUMS SOC code conversion

bls_soc pums_soc
475010 4750YY
49209X 492094
452011 452011

Part II: Adding New Top Level Attributes

Certain new datasets may introduce new top-level attribute types. For instance a health dataset may include hospital level data. Instead of repeating the hospital name at each row the raw data, the raw data should include a reference to the hospital's ID and a new attribute table must be created and provided.

Attribute table format

Each attribute table must consist (minimally) of two columns: "id" (text field) and "name" (text field). The table may contain as many more fields as needed but it must contain at least the id and name fields.

Supporting Hierarchical Sumlevel Filtering for New Attributes

In certain cases of attribute ID systems it will be useful to distinguish among different levels of detail among the raw data. Imagine an attribute could have two basic IDs types: children attributes that are the deepest level of detail in the data, and parent attributes that are a summation of multiple children attributes. If the ID nesting structure is simple (e.g. simply a truncation of digits) no additional work is required. However, if the ID nesting structure is not trivial -- that is, not derivable from the IDs themselves), then the hierarchical relationship must be specified in the new attribute table. In these non-trivial cases, each row in the new attribute table should have a parent, grandparent great_grandparent etc. field for each attribute to identify its hierarchical relationships.

Example Attribute Table

CIP Course

id name description
0110 Food Science and Technology Instructional content for this group of programs...
011001 Food Science A program that focuses on the application of biological, chemical...