Deep Dive Notes

Oct. 5 2016

Kwame Robinson

(to be fleshed out with more details, graphics at later date next week)

Feel free to comment upon re: Friday's Tech. Deep Dive

Topic: Canonical Job Normalization Test Set

The WDI is in the process of opening up its source code and making cooperative job data available.

Inviting researchers to collaborate on open problems in normalization, representation, classification
We have a bulk of private NDA'ed data that can't be shared out to corporate entities :(
WDI side machine learning is dependent on NDA'data. So some code will less useful to the public and it also may reveal NDA'ed metadata?

On Generating Public Test Set

Thoughts/Proposal

##Public Job Normalization Benchmark

Get permission/see if we already have it in place to release Job Titles from data partners, if so, release as a living versioned table quarterly. This sets us up for the next point:
"Grow the benchmark virally": Require that research partners release/provide annotated job normalization test data, specifically for each test instance: a) their machine readable representation of the instance (e.g. a vector) b) human readable representation of the instance (e.g., the unnormalized job title), c) the correct job title (probably noisy) and d) the research partner's normalized job title.
We then build upon the test set, validate it and improve it, by using an active learning framework, with human annotators, to compare (b) to (c) (improves normalized labeling), (c) to (d) (finds where correct title is in fact wrong). Can we ask users to go from wrong (d) to correct (c), from wrong (c) to correct (c)?
The representations in (a) are used w/in the active learning frame work to only request labeling for those instances were the confidence is low (or between class boundaries). This allows us to use highly confidence instances as additional labeled data, adding to the benchmark dataset, w/o requiring human intervention.
Furthermore, the full test instance data are made (anonymously) available as partner representations for follow on research (think transfer learning, EDA analysis, etc. etc.) by the world.

Why do this?

This a) forces partners to contribute data instead of just consuming it (virtuous cycle), b) provides an ongoing, growing, pool of test data for humans to validate for low friction participation in WDI while expanding test data for the research side of things.

##Baseline Estimator (backing the API)

Does not require partner data, make https://github.com/workforce-data-initiative/labor/blob/master/scripts/job_normalizer/esa_jobtitle_normalizer.py into a library
Currently backed by ONET job descriptions only, can be expanded with other ONET information?
Provides nice entry point into suite of ONET data manipulation classes
Need to tweak for better performance but already can normalized cupcake ninja and computer programmer

Can improve searching on Elasticsearch solution, provide @Sam Leitner with a temp ES instance to improve search querying, get different perspective on improving data quality
Could do query expansion, a la hypernym_product https://github.com/workforce-data-initiative/labor/blob/master/scripts/job_normalizer/esa_jobtitle_normalizer.py#L58-L67 to improve results

Assume normalizer results can be described by binomial distribution, take the statistic over marking results as correct/not correct from:
Normalizer query on results of the same type (e.g., return N results of the same normalized job title, ask someone to mark them)
Normalizer query on stratified random sample (e.g. stratified sample of job titles)
Estimate confidence interval for proportion correct results.
From reference below and assumptions we can use N = 30 or even smaller values.
Would be nice we could wrap up a sample, user labelling into an app of sorts (pre skills tinder)
references: "Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions." Alan Agresti; Brent A. Coull." and Wilson's Score.