This library helps to read and write data from most of the data sources. It accelerate the ML and ETL process without worrying about the multiple data connectors.
pip install -U dataligo
Install from sources
Alternatively, you can also clone the latest version from the repository and install it directly from the source code:
pip install -e .
>>> from dataligo import Ligo
>>> from transformers import pipeline
>>> ligo = Ligo('./ligo_config.yaml') # Check the sample_ligo_config.yaml for reference
>>> print(ligo.get_supported_data_sources_list())
['s3',
'gcs',
'azureblob',
'bigquery',
'snowflake',
'redshift',
'starrocks',
'postgresql',
'mysql',
'oracle',
'mssql',
'mariadb',
'sqlite',
'elasticsearch',
'mongodb',
'dynamodb',
'redis']
>>> mongodb = ligo.connect('mongodb')
>>> df = mongodb.read_as_dataframe(database='reviewdb',collection='reviews',return_type='pandas') # Default return_type is pandas
>>> df.head()
_id Review
0 64272bb06a14f52787e0a09e good and interesting
1 64272bb06a14f52787e0a09f This class is very helpful to me. Currently, I...
2 64272bb06a14f52787e0a0a0 like!Prof and TAs are helpful and the discussi...
3 64272bb06a14f52787e0a0a1 Easy to follow and includes a lot basic and im...
4 64272bb06a14f52787e0a0a2 Really nice teacher!I could got the point eazl...
>>> classifier = pipeline("sentiment-analysis")
>>> reviews = df.Review.tolist()
>>> results = classifier(reviews,truncation=True)
>>> for result in results:
>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.9997
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.999
label: POSITIVE, with score: 0.9967
>>> df['predicted_label'] = [result['label'] for result in results]
>>> df['predicted_score'] = [round(result['score'], 4) for result in results]
# Write the results to the MongoDB
>>> mongodb.write_dataframe(df,'reviewdb','review_sentiments')
Data Sources | Type | pandas | polars | dask |
---|---|---|---|---|
S3 | datalake |
|
|
|
GCS | datalake |
|
|
|
Azure Blob Storage | datalake |
|
|
|
Snowflake | datawarehouse |
|
|
|
BigQuery | datawarehouse |
|
|
|
StarRocks | datawarehouse |
|
|
|
Redshift | datawarehouse |
|
|
|
PostgreSQL | database |
|
|
|
MySQL | database |
|
|
|
MariaDB | database |
|
|
|
MsSQL | database |
|
|
|
Oracle | database |
|
|
|
SQLite | database |
|
|
|
MongoDB | nosql |
|
|
|
ElasticSearch | nosql |
|
|
|
DynamoDB | nosql |
|
|
|
Redis(beta) | nosql |
|
|
|
Some functionalities of DataLigo are inspired by the following packages.
-
DataLigo used Connectorx to read data from most of the RDBMS databases to utilize the performance benefits and inspired the return_type parameter from it
-
DataLigo used dynamo-pandas to read and write data from DynamoDB