-
Notifications
You must be signed in to change notification settings - Fork 4
SampleClean Retreat Summer 2014 Demo
Create the table
sampleclean> CREATE TABLE wikipedia (title string, text string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'
LINES TERMINATED BY '\n';
OK
Time taken: 0.035 seconds
Load the data into the table
sampleclean> LOAD DATA LOCAL INPATH 'data/files/wikipedia_abstracts.csv'
OVERWRITE INTO TABLE wikipedia;
OK
Time taken: 14.109 seconds
Run some queries on the dataset
sampleclean> SELECT COUNT(1) FROM wikipedia;
OK
4004479
Time taken: 8.535 seconds
Number of articles referring to "Apple"
sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%apple%';
OK
11261
Time taken: 36.853 seconds
Number of articles referring to "Google"
sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%google%';
OK
7400
Time taken: 33.109 seconds
We don't support stratified sampling yet, so in there interim run the following command:
sampleclean> CREATE TABLE wikipedia_stratified AS SELECT * FROM WIKIPEDIA
WHERE (lower(text) LIKE '%google%')
OR (lower(text) LIKE '%apple%')
OR (rand() < 0.01);
OK
Time taken: 65.346 seconds
sampleclean> select count(1) from wikipedia_stratified;
OK
58491
Time taken: 1.545 seconds
Initialize the SampleClean view sampleclean> scinitialize wikipedia_stratified_sample (title,text) from wikipedia_stratified samplewith 0.1;
Get the sample size: sampleclean> sccount wikipedia_stratified_sample; OK 5795 Time taken: 3.249 seconds sampleclean> set sampleclean.datset.size = 58491; sampleclean> set sampleclean.sample.size = 5795;
Get the approximate versions of the initial query:
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%';
OK
11719.995426829268 +/- 1880.0 (95% Confidence Clean)
Time taken: 1.661 seconds
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%';
225.500: [GC 321692K->29898K(1025536K), 0.0028120 secs]
226.379: [GC 331978K->30925K(1025024K), 0.0030140 secs]
OK
7222.211890243902 +/- 1545.0 (95% Confidence Clean)
Filter out "fruits":
sampleclean> scfilter wikipedia_stratified_sample lower(text) not like '%fruit%';
OK
Time taken: 2.052 seconds
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%';
OK
10659.94512195122 +/- 1813.0 (95% Confidence Clean)
Time taken: 1.47 seconds
Filter out "plants" (eg. apple blossom):
sampleclean>scfilter wikipedia_stratified_sample lower(text) not like '%plant%';
OK
Time taken: 1.991 seconds
sampleclean>selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%';
OK
10154.6875 +/- 1779.0 (95% Confidence Clean)
Time taken: 1.452 seconds
Find apple only on word boundaries (eg. Appleton). First, we replace all punctuation with spaces:
sampleclean> scformat wikipedia_stratified_sample text replace ',' ' ';
OK
Time taken: 1.812 seconds
sampleclean> scformat wikipedia_stratified_sample text replace '\.' ' ';
OK
Time taken: 1.987 seconds
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '% apple %';
OK
5181.367378048781 +/- 1335.0 (95% Confidence Clean)
Time taken: 1.287 seconds
Apple now has a smaller count that Google. Next, we filter out lists.
sampleclean> scfilter wikipedia_stratified_sample lower(title) not like 'list%';
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '% apple %';
4963.413109756098 +/- 1309.0 (95% Confidence Clean)
Time taken: 1.324 seconds
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%';
1578.959: [GC 355552K->28115K(1038336K), 0.0028100 secs]
OK
5934.300304878048 +/- 1418.0 (95% Confidence Clean)
Time taken: 1.249 seconds
We can now run any query we want on the cleaned dataset and estimate results. For example, if we want to know how often they show up in the same article.
sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%' and
lower(text) like '%apple %';
OK
188.23323170731706 +/- 266.0 (95% Confidence Clean)
Time taken: 1.304 seconds