Skip to content
This repository has been archived by the owner on Aug 24, 2023. It is now read-only.

Latest commit

 

History

History
68 lines (53 loc) · 1.7 KB

README.md

File metadata and controls

68 lines (53 loc) · 1.7 KB

Create sample partition data

$ bundle install
$ ./console.rb
irb(main):001:0> data = SamplePartitionData.new('bucket_name')
irb(main):002:0> data.create # create a partition
irb(main):003:0> data.bulk_create(1000) # create 1000 partitions

Result

2017/03/30
Athena: us-west-2
S3 bucket: ap-northeast-1

scan 10,000 partitions at once

  • 7 minutes 59 seconds
  • 265 minutes 17 seconds => cancel
  • 263 minutes 41 seconds => cancel
  • 263 minutes 16 seconds => cancel
  • 262 minutes 53 seconds => cancel
  • 262 minutes 57 seconds => cancel
  • 75 minutes 9 seconds => cancel

Although the queries had ran over 30 minutes, those didn't killed by Athena.
This behavior confused me, because the document said Query timeout: 30 minutes.

image1

scan additional 1 partition after 10,000 partitions scaned

  • 43 minutes 12 seconds
  • 46 minutes 1 seconds
  • 43 minutes 53 seconds
  • 44 minutes 15 seconds
  • 44 minutes 28 seconds

Query for Athena

CREATE EXTERNAL TABLE IF NOT EXISTS partitioning_test (
  `first` string,
  `second` string,
  `third` string,
  `fourth` string
) PARTITIONED BY (
  partition_id int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://bucket/';
MSCK REPAIR TABLE partitioning_test;

References

License

This is available as open source under the terms of the MIT License.

Copyright © pataiji and Speee, Inc.