Jingwu Xu
Build a training data creation framework, say, using Snorkel for automatic ML schema extraction from data files for predicting feature type based on attribute name, sample values, and statistics. Compare with recent work on manually labeled dataset.
Talk on this project:
https://docs.google.com/presentation/d/10WcjgF9gn27G3NHCuA87VFUsoimpPN60hMNQbwTMRdc/edit?usp=sharing
Machine learning(ML) engineers spend the majority of their time dealing with data pre-processing in order to fit the feature data into machine understandable data type, like numerical value or categorical value. However, one major step even before data pre-processing is understanding the feature data type. Traditionally, ML enigeers manually inspect the source file and specify the intended data type of each feature column. However, this is very time-consuming and expensive task in reality, and the problem goes worse when it comes with large data files. Here, we present a label creation framework which outputs the feature type by applying a set of labeling heuristics on feature data statistics. By comparing the performance of downstream model trained using programmatically generated feature type with that using manually labeled feature type, it turns out the label creation framework has significantly speed up the labeling process (months to days development efforts) with lower cost and the downstream model trained using generated feature type has comparable accuracy than that using manually labeled feature type, 87% compared to 93%.
Wednesday 1.30pm.
Category | Case | |
---|---|---|
Usable directly numeric | Case a. Should be usable directly as a number feature for ML | numeric |
Usable with extraction | Case b. A number present along with unit of measure string Case c. A text corpus with semantic meaning Case d. Date or time stamp |
textual |
Usable directly categorical | Case e. Yes/No type values, including binary 0/1 answers Case f. Country names, city names, food type names, and other object type names that are not cases l or m below Case g. Coded numbers that are short forms of names in case f that are not cases l or m below Case h. Short names that indicate type from a known finite set/domain that are not case l below Case i. Handful of coded numbers that repeat themselves but arbitrary arithmetic on them is not meaningful and that are also not case l or n below Case j. A coded number that encodes real-world entities from a known finite domain set |
categorical |
Unusable | Case k. A number indicating the position of a record in its dataset table Case l. An attribute that is likely the primary key in its dataset table |
unusable |
Context dependent | Case m. Person name, company name, or any entity name that is not generic Case n. Coded numbers or id for people, company, or other entity names from case m that are not cases g, i, j, k, or l above. |
dependent |
Record_id | y_pred | y_act | Reason | y_Arun | Check | Attribute_name | Total_val | num of dist_val | Num of nans | mean | std_dev | min_val | max_val | sample_1 | sample_2 | sample_3 | sample_4 | sample_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
53 | Unusable | Unusable | l | Unusable | id | 50000 | 50000 | 0 | 44432.4548 | 15773.45744 | 17283 | 73469 | 17283 | 17284 | 17285 | 17286 | 17287 | ||
48 | Unusable | Unusable | l | Unusable | BeerID | 73861 | 73861 | 0 | 36931 | 21321.83411 | 1 | 73861 | 1 | 2 | 3 | 4 | 5 | ||
51 | Unusable | Context_specific | n | Context_specific | check | animal_id | 29421 | 28209 | 95.88049353 | 0 | 0 | 0 | 0 | 0 | A684346 | A685067 | A678580 | A675405 | A670420 |
34 | Unusable | Context_specific | l | Context_specific | id | 671205 | 671205 | 0 | 993248.5937 | 196611.129 | 653047 | 1340339 | 653051 | 653053 | 653068 | 653063 | 653084 | ||
50 | Unusable | Context_specific | m | Context_specific | interesting | ACTOR1_ID | 165808 | 3032 | 1.828621056 | 25061 | 2587.796692 | 1030.062165 | 1 | 3960 | 1071 | 2037 | 1077 | 2191 | |
37 | Usable with extraction | Context_specific | n | Context_specific | check | Loan Theme ID | 15736 | 718 | 4.562785968 | 0 | 0 | 0 | 0 | 0 | a1050000000slfi | a10500000068jPe | a1050000002X1Uu | a1050000007VvXr | a1050000000weyk |
Regarding to Snorkel:
-
If two labeling functions give two different labels, how does Snorkel deal with it? I assume the Snorkel model only generates one label for each data point. 'Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor'
-
Need more explanation on labeling function comparison generated by Snorkel? How do I tell which labeling function is better than the other? how do I tell which labeling function takes effect when doing the prediction.
Regarding to data:
-
What does it mean when predicting '1,2,3,4,5'?
1062 68.146296 hm15life NaN 6540 a 350 9597 Usable directly numeric NaN NaN ... NaN NaN 1 5 7 2 1265.812302 NaN Context_specific 1 1063 68.219235 hm15owner NaN 6547 a 350 9597 Usable directly numeric NaN NaN ... NaN NaN 3 1 2 0 0.486729 NaN Context_specific 3
print(*np.unique(pred),sep='\n') > 1 > 2 > 3 > 4 > 5 > Context_specific > Unusable > Usable directly categorical > Usable directly numeric > Usable with extraction
According the provided files, I noticed that there exists a lot of inconsistence. X-axis represents predicted label, Y-axis represents actual labels. And the Confusion Matrix is:
-
Most problematic field is
context specific
, which is largely misclassified intousable directly numeric
. -
Record
208-247
attribute name with 'xxxid' are classified intousable directly numeric
-
Record
1041-1079
classified intonumbers (1-5)?
-
What does columns means?
'Unnamed: 2', 'Unnamed: 9', 'check',
1/30:
- walk through Snorkel tutorials
- play around with data and get some insights
- looking for related papers, suggestions from professor?
- http://www.vldb.org/pvldb/vol12/p223-varma.pdf
- http://cidrdb.org/cidr2019/papers/p58-ratner-cidr19.pdf
2/6:
-
Extract features from raw data
-
Performed character level LSTM on variable name, (91% accuracy) (5000 train vs 1000 test)
Problems: lots of duplicated records, hard to generalize on unseen names.
-
How could I design hierarchical labelling functions?
-
How to fit trained NN model into labeling function?
-
Choices between RNN and CNN? Which fits best to different types of features?
-
What else information from data could be valuable?
-
Can I get some existing labeling functions from previous work?
-
Plan to fit LSTM on histogram.
-
Plan to fit CNN for other features.
-
How to Fit the end model after snorkel?
-
Extremely long time to extract features.
['name', 'total', 'nunique', '%unique', '#null', 'std', 'var', 'min', 'max', 'mean', 'median', 'mode', 'hist0', 'hist1', 'hist2', 'hist3', 'hist4', 'hist5', 'hist6', 'hist7', 'hist8', 'hist9']
def LF1:
if values == strings:
goto LF2
if values == numbers:
goto LF3
2/13:
-
Get to know to snorkel takes in labeling function matrix
-
Writing labeling functions Usable with extraction
LF RESULTS Explanations lf_date_extraction_name [ 6., 21., 149., 12., 1.] datetime, time, date in name lf_date_extraction_samples [ 5., 0., 187., 8., 16.] samples in datetime format lf_extractable_name [ 51., 16., 122., 32., 8.] url, comment etc. in name lf_extractable_list [ 2., 3., 26., 5., 0.] samples with list, dict, format lf_extractable_sample_length [123., 3., 274., 37., 57.] samples with long length lf_extractable_units [1., 0., 3., 0., 0.] samples in (unit) num, unit format lf_extractable_number_sci [1., 1., 0., 0., 0.] samples with scientific rep lf_extractable_pattern [25., 1., 78., 14., 14.] samples where texts follow pattern while differ in numbers -
Q: a lF only produces one category or none?
-
TODO: Write more labeling functions
-
TODO: Feed m*n data into snorkel
2/20:
LF | CHECKS | reasoning | sample1 | sample2 |
---|---|---|---|---|
lf_cast_to_numbers | Case a. Should be usable directly as a number feature for ML | 4/5 Samples are of float values | 12.34 | 24.54 |
lf_extractable_units | Case b. A number present along with unit of measure string | unit + number + unit | 50 hz | $10 |
lf_extractable_number_sci | Case b. A number with scientific representation | \d+[eE^,]\d+ | 5,000 | 1e9 |
lf_extractable_pattern | Case c A String representation following some pattern | pg 1, pg 2, pg 5 | HT-1, HT-2 | |
lf_date_extraction_name | Case d. Date or time stamp | date, time in attribute name | ||
lf_extractable_name | Case c. A text corpus with semantic meaning, URL, address | url, text in attribute name | review text | remarks |
lf_extractable_list | Case c. A list of items in a single sample separated by symbols | start and end by {} [] () | {man:clothing , woman:cloing} | |
lf_extractable_sample_length | Case c extremely long textual data (integer could not be that long) | len(str)>25 | url | |
lf_date_extraction_samples | Case d. Date or time stamp | regex match dattime | 7/11/2018 | 12:20 |
lf_binary_category | Case e. Yes/No type values, including binary 0/1 answers | dist == 2 / 3 | ||
lf_name_category | Case f. Country names, city names, food type names, and other object type names | 'city, state, country, ...' in Attribute_name | ||
lf_coded_abbreviation | Case g. Coded numbers that are short forms of names | upper case + same length + only alpha | CHN USA | CS, MATH |
lf_coded_number | Case g. Coded numbers that are short forms of names Case i. Handful of coded numbers that repeat themselves but arbitrary arithmetic on them is not meaningful |
1. attribute name contains code 2. all codes are of the same length 3. all codes are consisted of numbers |
80525, 92092 | 1995, 2018 |
lf_finite_set_name | Case h. Short names that indicate type from a known finite set/domain Case j. A coded number that encodes real-world entities from a known finite/ domain set |
1. attribute names indicates the samples ['job title', 'type', 'gender'] | ||
lf_finite_set_sample | Case h. Short names that indicate type from a known finite set/domain | 1. attribute samples are usually with meadian length 10-25, 2. attribute samples are mostly composed by alphabeta letters |
- A number present along with unit of measure string: unit + number + unit
- Scientific representation: \d+[eE,^]\d+
- A String representation following some pattern among all its samples
- A text corpus with semantic meaning, URL, address in its attribute name
- A text corpus with date time in its attribute name
- A structured representation among all its samples
- Avg. length above 25 for all samples, long texts
- A structured date/time representation among all its samples
- Email or url in samples
CATE | TOTAL | MATCHED | Mismatched | Abstained | Accuracy | Coverage |
---|---|---|---|---|---|---|
Usable With Extraction | 650 | 581 | 559 | 69 | .5096 | .8938 |
Usable Directly Categorical | 2087 | 1482 | 823 | 605 | .6430 | .7101 |
Usable Directly Numeric | 5063 | 5055 | 3459 | 8 | .5937 | .9984 |
Unusable | 891 | 856 | 576 | 35 | .5978 | .9607 |
TODO:
- argmax deterministic random forest
- prob distribution cnn
- argmax deterministic cnn
2/26
- The downstream model is highly dependent on the snorkel output?
- Should the CNN model output probabilities or category?
- What can I do to increase the labeling accuracy?
- Why snorkel model accuracy goes down? Refer to the table.
Categories | Noisy | Disc. Model | Model Acc. | ne | md | epoch |
---|---|---|---|---|---|---|
4 | 0.855 | 0.867 | 0.875 | 128 | 64 | 20 |
4 | 0.855 | 0.848 | 0.846 | 128 | 64 | 10 |
4 | 0.855 | 0.850 | 0.868 | 128 | 64 | 50 |
4 | 0.855 | 0.855 | 0.875 | 128 | 128 | 20 |
4 | 0.855 | 0.860 | 0.875 | 64 | 32 | 20 |
4 | 0.855 | 0.864 | 0.885 | 32 | 32 | 50 |
###1. Train a random forest based on argmax of probabilistic data.
The original probabilistic data has accuracy 0.68 by simply taking the argmax amongst all probabilities.
A random forest learning model was learnt mapping input features (9 features) into deterministic labels. The best learnt model is 92% on testing with 0.66 real accuracy.