-
Notifications
You must be signed in to change notification settings - Fork 0
Datalake AWS S3
Rafael Cartenet edited this page Jun 7, 2021
·
5 revisions
OpenGeoScales datalake stores the different stages of ghg emissions data that have been selected after the exploration phase.
The first version of OGS datalake is organized as follows:
- Two main folders distinguishes between to developement work
ogs-dev
and production workogs-prd
. - A
data
folder contains all stored data. For now, only data is stored in ogs-datalake. Maybe in the future, source code or documents will be also integrated. - A first split of data folder is made by data stage:
raw
-->stagingf
-
raw
: contains raw data as collected from data providers without any modification, neither in terms of format, structure or content. -
staging
: contains a first stage transformation of raw data by mapping it with a standardized data schema. All mapped data are stored in json files with respect to the specification of OGS staging data model
-
The following schema presents datalake structure:
ogs-dev/
├── data/ # notebooks for exploring raw data
├── raw/
├── ghg-emissions # folder for data source 1
├── source 1 # folder for data source 1
├── dataset 1 # Notebook for exploring data source 1
└── dataset 2 # Notebook for exploring data source 1
└── dataset n # Notebook for exploring data source 1
├── source 2 # folder for data source 1
├── dataset 1 # Notebook for exploring data source 1
└── dataset 2 # Notebook for exploring data source 1
└── dataset n # Notebook for exploring data source 1
├── source n # folder for data source 1
├── socio-economic
└── geo-ref
├── staging/
├── ghg-emissions # folder for data source 1
├── source 1 # folder for data source 1
└── json file # Notebook for exploring data source 1
├── source 2 # folder for data source 1
└── json file # Notebook for exploring data source 1
├── source n # folder for data source 1
-
The first step is to install the module boto3 using your preferred method (pip,conda).
-
To access S3, we first need to authenticate by defining a ServiceRessourceObject using the following code
s3 = boto3.resource(
service_name='s3',
region_name='eu-west-3',
aws_access_key_id='mykey',
aws_secret_access_key='mysecretkey'
)
- The service name should be s3.
- The region needs to be set to the one that is affiliated with the s3
- aws_access_key_id and aws_secret_access_key are credentials that should have been transmitted to you when the account was created
After having properly done the authentication, you can start having access to the different buckets in the s3. To see a list of all available buckets, you can use the following code
for bucket in s3.buckets.all():
print(bucket.name)
- To upload a file, one can use the following line of code
s3.Bucket(bucket_name).upload_file(Filename=local_file, Key=server_file)
- bucket_name should be one of the available buckets that were printed before
- local_file is the path to the local file that has to be uploaded to s3
- serer_file is the path to where the file will be uploaded in the bucket on s3
Exemple:
s3.Bucket('ogs-dev').upload_file(Filename='grid_edgar.dat', Key='data/raw/goe-ref/grid_edgar.dat')
This uploads the file 'grid_edgar.dat' in the path ogs-dev/data/raw/goe-ref/grid_edgar.dat on S3.