Skip to content
Kwame Porter Robinson edited this page Nov 29, 2016 · 1 revision

Technical Deep Dive Write up, Notes

Oct. 5 2016


Kwame Robinson

(to be fleshed out with more details, graphics at later date next week)

Feel free to comment upon re: Friday's Tech. Deep Dive

Topic: Canonical Job Normalization Test Set

The WDI is in the process of opening up its source code and making cooperative job data available.

  • Inviting researchers to collaborate on open problems in normalization, representation, classification
  • We have a bulk of private NDA'ed data that can't be shared out to corporate entities :(
  • WDI side machine learning is dependent on NDA'data. So some code will less useful to the public and it also may reveal NDA'ed metadata?

On Generating Public Test Set

  • However, I'm pretty sure we're allowed to release derived anonymized data
  • This includes vectorial representations of the data, components of interest

Thoughts/Proposal

##Public Job Normalization Benchmark

  • Get permission/see if we already have it in place to release Job Titles from data partners, if so, release as a living versioned table quarterly. This sets us up for the next point:
  • "Grow the benchmark virally": Require that research partners release/provide annotated job normalization test data, specifically for each test instance: a) their machine readable representation of the instance (e.g. a vector) b) human readable representation of the instance (e.g., the unnormalized job title), c) the correct job title (probably noisy) and d) the research partner's normalized job title.
  • We then build upon the test set, validate it and improve it, by using an active learning framework, with human annotators, to compare (b) to (c) (improves normalized labeling), (c) to (d) (finds where correct title is in fact wrong). Can we ask users to go from wrong (d) to correct (c), from wrong (c) to correct (c)?
  • The representations in (a) are used w/in the active learning frame work to only request labeling for those instances were the confidence is low (or between class boundaries). This allows us to use highly confidence instances as additional labeled data, adding to the benchmark dataset, w/o requiring human intervention.
  • Furthermore, the full test instance data are made (anonymously) available as partner representations for follow on research (think transfer learning, EDA analysis, etc. etc.) by the world.

Why do this?

  • This a) forces partners to contribute data instead of just consuming it (virtuous cycle), b) provides an ongoing, growing, pool of test data for humans to validate for low friction participation in WDI while expanding test data for the research side of things.

##Baseline Estimator (backing the API)

Publicly Usable Baseline ML Normalizer

WDI Usable Baseline ML Job Normalizer

Normalizer Testing

  • Assume normalizer results can be described by binomial distribution, take the statistic over marking results as correct/not correct from:
  • Normalizer query on results of the same type (e.g., return N results of the same normalized job title, ask someone to mark them)
  • Normalizer query on stratified random sample (e.g. stratified sample of job titles)
  • Estimate confidence interval for proportion correct results.
  • From reference below and assumptions we can use N = 30 or even smaller values.
  • Would be nice we could wrap up a sample, user labelling into an app of sorts (pre skills tinder)
  • references: "Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions." Alan Agresti; Brent A. Coull." and Wilson's Score.