Datalake AWS S3

Introduction

OpenGeoScales datalake stores the different stages of ghg emissions data that have been selected after the exploration phase.

Datalake Organization

The first version of OGS datalake is organized as follows:

Two main folders distinguishes between to developement work ogs-dev and production work ogs-prd.
A data folder contains all stored data. For now, only data is stored in ogs-datalake. Maybe in the future, source code or documents will be also integrated.
A first split of data folder is made by data stage: raw--> stagingf
- raw: contains raw data as collected from data providers without any modification, neither in terms of format, structure or content.
- staging: contains a first stage transformation of raw data by mapping it with a standardized data schema. All mapped data are stored in json files with respect to the specification of OGS staging data model

The following schema presents datalake structure:

ogs-dev/
   ├── data/                       # notebooks for exploring raw data
       ├── raw/        
            ├── ghg-emissions           # folder for data source 1
                   ├── source 1            # folder for data source 1
                       ├── dataset 1      # Notebook for exploring data source 1
                       └── dataset 2      # Notebook for exploring data source 1
                       └── dataset n      # Notebook for exploring data source 1
                   ├── source 2            # folder for data source 1
                       ├── dataset 1      # Notebook for exploring data source 1
                       └── dataset 2      # Notebook for exploring data source 1
                       └── dataset n      # Notebook for exploring data source 1
                   ├── source n            # folder for data source 1
            ├── socio-economic 
            └── geo-ref 
       ├── staging/        
            ├── ghg-emissions           # folder for data source 1
                   ├── source 1            # folder for data source 1
                       └── json file       # Notebook for exploring data source 1
                   ├── source 2            # folder for data source 1
                       └── json file      # Notebook for exploring data source 1
                   ├── source n            # folder for data source 1

S3 connexion with python

The first step is to install the module boto3 using your preferred method (pip,conda).
To access S3, we first need to authenticate by defining a ServiceRessourceObject using the following code

s3 = boto3.resource(
    service_name='s3',
    region_name='eu-west-3',
    aws_access_key_id='mykey',
    aws_secret_access_key='mysecretkey'
)

The service name should be s3.
The region needs to be set to the one that is affiliated with the s3
aws_access_key_id and aws_secret_access_key are credentials that should have been transmitted to you when the account was created

After having properly done the authentication, you can start having access to the different buckets in the s3. To see a list of all available buckets, you can use the following code

for bucket in s3.buckets.all():
    print(bucket.name)

To upload a file, one can use the following line of code

s3.Bucket(bucket_name).upload_file(Filename=local_file, Key=server_file)

bucket_name should be one of the available buckets that were printed before
local_file is the path to the local file that has to be uploaded to s3
serer_file is the path to where the file will be uploaded in the bucket on s3

Exemple:

s3.Bucket('ogs-dev').upload_file(Filename='grid_edgar.dat', Key='data/raw/goe-ref/grid_edgar.dat')

This uploads the file 'grid_edgar.dat' in the path ogs-dev/data/raw/goe-ref/grid_edgar.dat on S3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datalake AWS S3

Introduction

Datalake Organization

S3 connexion with python

S3 connexion with R

Clone this wiki locally