-
Notifications
You must be signed in to change notification settings - Fork 4
SampleClean User Guide
SCINITIALIZE creates a SampleClean view, this view is maintained in memory and its schema is recorded in an in-memory catalog. We currently support uniform samples from the entire base table.
SCINITIALIZE sampleName(arg1, arg2, ..., argk) FROM table SAMPLEWITH ratio;
Internally, we maintain two views: _clean and _dirty, which also contain a hash and a duplicate count. Before running any data cleaning techniques, these two views are identical. Example Creation Query:
sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) FROM cities SAMPLEWITH 0.1;
Currently, we only support samples of entire tables. A work around is to create a temporary table with a CREATE TABLE ... AS SELECT
statement. Furthermore, we currently do not support typing of data and represent all attributes as JAVA Strings; which are automatically processed as numbers if you run a numerical query.
If we are unhappy with our data cleaning results, we can reset our clean sample back to its original state with:
screset sampleName;
To delete a sample:
sampleclean> drop table sampleName_clean;
sampleclean> drop table sampleName_dirty;
TODO