Labor Classification Script

There is a full document describing this process at this link, but this is a summary.

Table Of Contents

What It Is
What To Do
What You Get
Where To Learn More

What It Is

The labor classification script is a Python script, available in both Jupyter notebook and plain Python code that processes the national PUMS person file and classifies workers into their PECAS occupation and industry/space type codes. The tasks it accomplishes are:

Develop estimates of education level for each detailed PUMS occupation code.
Classify each detailed occupation code into one of 13 occupation types used in PECAS. (See below for more detail). Most types involve splitting workers by education level; this can be specified to use either a relative amount (e.g. the lowest x% of detailed occupation codes by average worker education) or an absolute amount (e.g. occupation codes involving workers with fewer than y years of education). The absolute method was specified initially; changes in the labor force suggest that the percentile method is more reasonable going forward.
Split industry activities into production and office support activities; the workers in the 13 occupation types are used to perform this, identifying which portions of an industry are office support, and which are ‘production’, which could be (for example) workers in a factory, a store, a school or a warehouse as appropriate for a given industry.
Perform stochastic industry splits, where a given industry is split into multiple activities (typically using different floorspace types) based on probabilistically assigning workers.
Produce a number of outputs both reporting on the process and for use in other components of the model run.

13 occupation types

The hundreds of detailed PUMS occupations (occ codes) are defined into four top level lettered Supertypes by SOC, and then within those to a total of 13 occupation types by SOC and education level. The education levels are described below the list.

A: Management
- SOC 11, 13
- A1-Mgmt Bus
  - Management and business workers, like operations managers and accountants; SOC 11, 13
B: White Collar
- SOC 15-31
- B1-Prof Specialty
  - Computer, engineering, science, social science and entertainment workers with high education, like software developers and lawyers; SOC 15-23, 27
- B2-Education
  - Education workers with high education, like schoolteachers and professors; SOC 25
- B3-Health
  - Healthcare workers with high education, like registered nurses and physicians; SOC 29, 31
- B4-Technical Unskilled
  - Technical workers in any of the above SOCs with less education, like personal care aides and teaching assistants.
C: Sales / Service / Clerical
- SOC 35, 39, 41, 43
- C1-Sales Clerical Professionals
  - Workers in these occupations with a high level of education, like insurance clerks and real estate agents.
- C2-Sales Service
  - Sales and service workers with a medium amount of education, like retail supervisors and hairdressers; SOC 35, 39, 41
- C3-Clerical
  - Clerical workers with a medium amount of education, like customer service representatives and admin assistants; SOC 43
- C4-Sales Clerical Unskilled
  - Workers in these occupations with a low level of education, like cashiers and cooks.
D: Blue Collar
- SOC 33, 37, 45-55
- D1-Production Specialists
  - Workers in farming, forestry and production occupations with a high/medium amount of education, like production welding workers and inspectors; SOC 45, 51
- D2-MaintConstRepair Specialists
  - Workers in maintenance, construction or repair occupations with a high/medium amount of education, like electricians and automotive mechanics; SOC 37, 47, 49
- D3-ProtectTrans Specialists
  - Workers in protective service, military or transportation occupations with a high/medium amount of education, like truck drivers and security guards; SOC 33, 53, 55
- D4-Blue Collar Unskilled
  - Workers in any of the above occupations with a low level of education, like laborers, janitors and agricultural workers.

Education splits

Some of the occupation definitions involve splitting workers by education into higher and lower levels; higher levels tend to be more specialized, have lower elasticities, and higher wages. The splits were originally defined in 2010 using absolute values of amounts of education; while the software supports this option, it is recommended that the splits move to percentiles as education has increased over time. The splits are:

B1/B2/B3 vs B4
- This split pulls out low education technicians from more specialized white collar workers. The split uses the average number of years of education. Originally, the split was occ codes with under 14.7 years of education in B4; it is now recommended to use the lowest 36.5% of occ codes by years of education into B4.
C1 vs C2/C3
- This split pulls out high education workers that are often pseudo-white-collar (e.g. stockbrokers) from more general sales, service and clerical workers. Originally, the split was occ codes with over 13.5 years of education in C1; it is now recommended to use the highest 18.8% of occ codes by years of education.
C2/C3 vs C4
- This split pulls out low education workers that have very low wages and can shift easily between jobs in this sector. Originally, the split was occ codes with under 89% high school graduation into C4; it is now recommended to use the lowest 33.4% of occ codes by high school graduation rate.
D1/D2/D3 vs D4
- This split pulls out low education workers that have low wages and can shift easily between jobs in this sector from more specialized trades. Originally, the split was occ codes with under 76% high school graduation into D4; it is now recommended to use the lowest 49.2% of occ codes by high school graduation rate.

A Bayesian averaging (also called Laplace smoothing) method is used to pull in all national occupation education information to use as a prior for the education levels; if there are a lot of observations in the model region, then the education levels in the model region will dominate, but for occ codes with few observations in the model region, the national average will be weighed more heavily. This reduces year-to-year variance as some occupations only have a few records in the model region.

Industry splits

The workers are allocated to industries, which are in most cases split into office support and "production" (which can be manufacturing production space, but is also in other industries a classroom, a store, a farm, a forest, etc.). The above 13 occupation types are used for these splits.

In most cases (see the documentation) the split is supertype D workers to production and others to office support; for retail and wholesale activities, C and D are production workers, and for K12 education B2, B4 and D workers are production.

Some industry activities are further split stochastically based on specified probabilities; this is used for allocating construction into different types (residential, nonresidential, etc.) and for splitting manufacturing activities into heavy and light industrial space use.

PUMS adds and changes industry codes through time, both as NAICS changes and based on data availability. If PUMS industry codes aren't in the pums_activity_to_industy.csv input file, the program will attempt to automatically recode them and will print a message about what it was able to do (which may include dropping industries) -- you should verify these codings manually and add them to the input file.

What To Do

Start by downloading the national PUMS data; that'll take a while. The following are the input files:

occ_settings.yml -- this contains the majority of the settings, including the paths for the input and output files, and how to split occ codes and industies. It's well commented; more details in the full documentation
psam_pusa.csv through psam_pusd.csv – national PUMS person files from above (the names are specifiable in occ_settings.yml)
key_SCHL.csv – crosswalk file showing how many years of education are assumed for each level of educational attainment; this should be checked against the PUMS data dictionary to verify it's still correct
soc2_to_occ_code.csv – crosswalk file categorizing workers into ‘base’ occupation codes by two-digit SOC code --unlikely to change through time
occ_code_to_commodity.csv – crosswalk file recoding the occupation groups with letters into PECAS commodities --unlikely to change through time
pums_industry_to_activity.csv – crosswalk file converting PUMS industry codes to long name PECAS activity codes --unlikely to change through time
activity.csv – crosswalk file coding the long name occupation groups into PECAS activities --unlikely to change through time

The process itself takes under five minutes to run; it uses Python 3 (tested on 3.8.8) with numpy and pandas.

Space to add details about file location in SWIM system

What You Get

The two output files that are used further down the line are:

acs_occupation_forPopSim.csv - Crosswalk between detailed OCCP codes from PUMS to the 13 occupation types used in PECAS.
pums_to_split_industry.csv - Crosswalk between the detailed INDP industry and OCCP occupation codes in PUMS and the split industries used in PECAS, weighed by the numbers of workers.

For diagnostic and other use, the following files are also produced:

pums_per_processed.csv - A version of the PUMS person file, with the labor-related fields, but with PECAS occupation and activity codes added to the records. You can specify in occ_settings.yml any additional fields (e.g. age) to pass through into this table.
Occupation summary.csv - Summary statistics at the top level (13 occupation type); number of records, weighted records, average wage, average education.
Occupation detailed summary.csv - Same as above, but at the detailed PUMS OCCP code level.
Occupation x Industry summary.csv - Occupation commodity use at the 13 occupation type level by the split PECAS industry groups.
bayesian_education.csv - summary of Bayesian education smoothing; for each occ code, the education levels in the model region (OR), the national data (US), and the resulting Bayesian smoothed value (bayes) along with count of weighted records (PWGTP)

Where To Learn More

There is a detailed document available here: Tech Memo 02 - Enhanced classification 06 - draft final.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labor Classification Script

What It Is

13 occupation types

Education splits

Industry splits

What To Do

What You Get

Where To Learn More

Guides

Modules

Appendix

Clone this wiki locally