diff --git a/README.md b/README.md index 81289eea..8ec9f4d3 100644 --- a/README.md +++ b/README.md @@ -3,15 +3,30 @@ ![Alt text](./logo.png?raw=true "Rosetta") ## Overview -Rosetta is a declarative data modeler and transpiler that converts database objects from one database to another. Define your database in DBML and rosetta generates the target DDL for you. + +RosettaDB is an open-source, declarative data modeling and transpilation tool that simplifies database migrations, data quality assurance, and data exploration. With support for schema extraction, AI-driven querying, and automated code generation, RosettaDB equips data engineers and developers to manage complex data workflows across diverse platforms with ease. Rosetta utilizes JDBC to extract schema metadata from a database and generates declarative DBML models that can be used for conversion to alternate database targets. -Generate DDL from a given source and transpile to the desired target. +Key Features + +- **Declarative Data Modeling**: Define your database schema using DBML (Database Markup Language), and RosettaDB generates target database-specific DDL (Data Definition Language) automatically. +- **Transpilation**: Seamlessly convert database objects from one database platform to another. RosettaDB eliminates the manual effort in migrating between heterogeneous database systems. +- **Data Quality and Validation**: Implement and automate data quality checks using built-in test rules to ensure data accuracy, consistency, and reliability. +- **DBT Model Generation**: Generate dbt models from your database schema effortlessly, empowering robust and scalable analytics workflows. +- **AI-Powered Data Exploration**: Query and explore your data using natural language, leveraging AI to simplify complex SQL tasks and uncover insights. +- **Spark Code Generation**: Automatically generate PySpark or Scala code for transferring data between source and target systems, streamlining data movement in big data pipelines. + +Whether you’re modernizing your data architecture, migrating legacy systems, implementing data validation pipelines, or orchestrating data transfer in Spark environments, RosettaDB provides a comprehensive suite of tools tailored to your needs. + +**Get Involved** + +Join our growing community of developers and data engineers on [RosettaDB Slack](https://join.slack.com/t/rosettadb/shared_invite/zt-1fq6ajsl3-h8FOI7oJX3T4eI1HjcpPbw), and visit our GitHub repository to explore supported databases, translations, and use cases. + -[Join RosettaDB Slack](https://join.slack.com/t/rosettadb/shared_invite/zt-1fq6ajsl3-h8FOI7oJX3T4eI1HjcpPbw) +## Supported Databases and Translations -Currently, supported databases and translations are shown below in the table. +The table below lists the currently supported databases and their respective translation capabilities. | | **BigQuery** | **Snowflake** | **MySQL** | **Postgres** | **Kinetica** | **Google Cloud Spanner** | **SQL Server** | **DB2** | **Oracle** | |--------------------------|:--------------:|:-------------:|:------------:|:---------------:|:------------:|:--------------------------:|:---------------:|:-----------:|:--------------:| @@ -25,62 +40,51 @@ Currently, supported databases and translations are shown below in the table. | **DB2** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | / | ✅ | | **Oracle** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | / | -## Getting Started -### Prerequisites +## Getting Started -You need the JDBC drivers to connect to the sources/targets that you will use with the rosetta tool. -The JDBC drivers for the rosetta supported databases can be downloaded from the following URLs: +Follow these steps to get started with RosettaDB: -- [BigQuery JDBC 4.2](https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip) -- [Snowflake JDBC 3.13.19](https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.13.19/snowflake-jdbc-3.13.19.jar) -- [Postgresql JDBC 42.3.7](https://jdbc.postgresql.org/download/postgresql-42.3.7.jar) -- [MySQL JDBC 8.0.30](https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip) -- [Kinetica JDBC 7.1.7.7](https://github.com/kineticadb/kinetica-client-jdbc/archive/refs/tags/v7.1.7.7.zip) -- [Google Cloud Spanner JDBC 2.6.2](https://search.maven.org/remotecontent?filepath=com/google/cloud/google-cloud-spanner-jdbc/2.6.2/google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar) -- [SQL Server JDBC 12.2.0](https://go.microsoft.com/fwlink/?linkid=2223050) -- [DB2 JDBC jcc4](https://repo1.maven.org/maven2/com/ibm/db2/jcc/db2jcc/db2jcc4/db2jcc-db2jcc4.jar) -- [Oracle JDBC 23.2.0.0](https://download.oracle.com/otn-pub/otn_software/jdbc/232-DeveloperRel/ojdbc11.jar) +### 1. Download and initialize RosettaDB -### ROSETTA_DRIVERS Environment +**Linux/MacOS**: -Set the ENV variable `ROSETTA_DRIVERS` to point to the location of your JDBC drivers. +- **Linux (x64)**: Compatible with 64-bit Intel/AMD processors. +- **MacOS (x64)**: For Intel-based Mac systems. +- **MacOS (Arch64)**: For Apple Silicon (M1/M2) Mac systems. +Run the following command to download and set up RosettaDB: ``` -export ROSETTA_DRIVERS= +curl -L "https://github.com/AdaptiveScale/rosetta/releases/download/v2.6.0/rosetta_setup.sh" -o rosetta_setup && chmod u+x rosetta_setup && ./rosetta_setup ``` -example: +**Windows (x64)** + +Compatible with 64-bit Intel/AMD processors running Windows ``` -export ROSETTA_DRIVERS=/Users/adaptivescale/drivers/* +curl -L "https://github.com/AdaptiveScale/rosetta/releases/download/v2.6.0/rosetta_setup.bat" -o rosetta_setup.bat && .\rosetta_setup.bat ``` -### rosetta binary +### 2. Initialize a New Project -1. Download the rosetta binary for the supported OS ([releases page](https://github.com/AdaptiveScale/rosetta/releases)). - ``` - rosetta--linux-x64.zip - rosetta--mac_aarch64.zip - rosetta--mac_x64.zip - rosetta--win_x64.zip - ``` -2. Unzip the downloaded file -3. Run rosetta commands using `./rosetta` which is located inside `bin` directory. -4. Create new project using `rosetta init` command: +This step is automatically executed if you completed Step #1. + +Create a new RosettaDB project with the following command: ``` rosetta init database-migration ``` +This will create a `database-migration` directory containing the `main.conf` file, which is used to configure connections to data sources. + +The `rosetta init` command also prompts you to specify source and target databases and automatically downloads the necessary drivers. -The `rosetta init` command will create a new rosetta project within `database-migration` directory containing the `main.conf` (for configuring the connections to data sources). +### 3. Configure connections in `main.conf` -5. Configure connections in `main.conf` -example: connections for postgres and mysql +An example configuration for connecting to PostgreSQL and MySQL: ``` -# If your rosetta project is linked to a Git repo, during apply you can automatically commit/push the new version of your model.yaml -# The default value of git_auto_commit is false +# Automatically commit and push changes if linked to a Git repository (default: false). git_auto_commit: false connections: - name: mysql @@ -99,800 +103,42 @@ connections: password: sakila ``` -6. Extract the schema from postgres and translate it to mysql: +### 4. Extract and Transpile Your Schemas + +Extract the schema from PostgreSQL and transpile it to MySQL: ``` rosetta extract -s pg -t mysql ``` -The extract command will create two directories `pg` and `mysql`. `pg` directory will have the extracted schema -from Postgres DB. The `mysql` directory will contain the translated YAML which is ready to be used in MySQL DB. - -7. Migrate the translated schema to MySQL DB: +Migrate the translated schema to the target MySQL database: ``` rosetta apply -s mysql ``` -The apply command will migrate the translated Postgres schema to MySQL. - - -## Rosetta DB YAML Config - -### YAML Config File - -Rosetta by default expects the YAML config file to be named `main.conf` and looks for it by default in the current folder. The configuration file can be overridden by using the `--config, -c` command line argument (see [Command Line Arguments](#command-line-arguments) below for more details). - -Here is the list of available configurations in the `main.conf` file: - -```yaml -connections: - # The name of the connection - - name: bigquery_prod - - # The name of the default database to use - databaseName: bigquery-public-data - - # The name of the default schema to use - schemaName: breathe - - # The type of the database - dbType: bigquery - - # The connection uri for the database - url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];... - - # The name of the database user - userName: user - - # The password of the database user - password: password - - # The name of tables to include which is optional - tables: - - table_one - - table_two -``` - -In the YAML config file you can also use environment variables. An example usage of environment variables in config file: - -``` -connections: - - name: snowflake_weather_prod - databaseName: SNOWFLAKE_SAMPLE_DATA - schemaName: WEATHER - dbType: snowflake - url: jdbc:snowflake://.snowflakecomputing.com/? - userName: ${USER} - password: ${PASSWORD} -``` - -### Example connection string configurations for databases - -### BigQuery (service-based authentication OAuth 0) -``` -url: jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=;AdditionalProjects=bigquery-public-data;OAuthType=0;OAuthServiceAcctEmail=;OAuthPvtKeyPath= -``` - -### BigQuery (pre-generated token authentication OAuth 2) -``` -jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=2;ProjectId=;OAuthAccessToken=;OAuthRefreshToken=;OAuthClientId=;OAuthClientSecret=; -``` - -### BigQuery (application default credentials authentication OAuth 3) -``` -jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=; -``` - -### Snowflake -``` -url: jdbc:snowflake://:443/?db=&user=&password= -``` - -### PostgreSQL -``` -url: jdbc:postgresql://:5432/?user=&password= -``` - -### MySQL -``` -url: jdbc:mysql://:@:3306/ -``` - -### Kinetica -``` -url: jdbc:kinetica:URL=http://:9191;CombinePrepareAndExecute=1 -``` - -### Google Cloud Spanner -``` -url: jdbc:cloudspanner:/projects/my-project/instances/my-instance/databases/my-db;credentials=/path/to/credentials.json -``` - -### Google CLoud Spanner (Emulator) -``` -url: jdbc:cloudspanner://localhost:9010/projects/test/instances/test/databases/test?autoConfigEmulator=true -``` - -### SQL Server -``` -url: jdbc:sqlserver://:1433;databaseName= -``` - -### DB2 -``` -url: jdbc:db2://:50000; -``` - -### ORACLE -``` -url: jdbc:oracle:thin::1521: -``` - -### Translation -This module will read the database structure from the source and map it to a target type. For example, source metadata was BigQuery and we want to convert it to Snowflake. This will be done by using a CSV file that contain mappings like in the following example: -```344;;bigquery;;string;;snowflake;;string -345;;bigquery;;timestamp;;snowflake;;timestamp -346;;bigquery;;int64;;snowflake;;integer -347;;bigquery;;float64;;snowflake;;float -348;;bigquery;;array;;snowflake;;array -349;;bigquery;;date;;snowflake;;date -350;;bigquery;;datetime;;snowflake;;datetime -351;;bigquery;;boolean;;snowflake;;boolean -352;;bigquery;;time;;snowflake;;time -353;;bigquery;;geography;;snowflake;;geography -354;;bigquery;;numeric;;snowflake;;numeric -355;;bigquery;;bignumeric;;snowflake;;number -356;;bigquery;;bytes;;snowflake;;binary -357;;bigquery;;struct;;snowflake;;object -``` - - -### Using external translator - -RosettaDB allows users to use their own translator. For the supported databases you can extend or create your version -of translation CSV file. To use an external translator you need to set the `EXTERNAL_TRANSLATION_FILE` ENV variable -to point to the external file. - -Set the ENV variable `EXTERNAL_TRANSLATION_FILE` to point to the location of your custom translator CSV file. - -``` -export EXTERNAL_TRANSLATION_FILE= -``` - -example: - -``` -export EXTERNAL_TRANSLATION_FILE=/Users/adaptivescale/translation.csv -``` - -Make sure you keep the same format as the CSV example given above. - -### Translation Attributes - -Rosetta uses an additional file to maintain translation specific attributes. -It stores translation_id, the attribute_name and attribute_value: - -``` -1;;302;;columnDisplaySize;;38 -2;;404;;columnDisplaySize;;30 -3;;434;;columnDisplaySize;;17 -``` - -The supported attribute names are: -- ordinalPosition -- autoincrement -- nullable -- primaryKey -- primaryKeySequenceId -- columnDisplaySize -- scale -- precision - -Set the ENV variable `EXTERNAL_TRANSLATION_ATTRIBUTE_FILE` to point to the location of your custom translation attribute CSV file. +Need More Help? -``` -export EXTERNAL_TRANSLATION_ATTRIBUTE_FILE= -``` - -example: - -``` -export EXTERNAL_TRANSLATION_ATTRIBUTE_FILE=/Users/adaptivescale/translation_attributes.csv -``` +For detailed installation instructions and advanced setup, refer to the Installation Guide [here](docs/markdowns/installation.md). -Make sure you keep the same format as the CSV example given above. - -### Indices (Index) - -Indices are supported in Google Cloud Spanner. An example on how they are represented in model.yaml - -``` -tables: -- name: "ExampleTable" - type: "TABLE" - schema: "" - indices: - - name: "PRIMARY_KEY" - schema: "" - tableName: "ExampleTable" - columnNames: - - "Id" - - "UserId" - nonUnique: false - indexQualifier: "" - type: 1 - ascOrDesc: "A" - cardinality: -1 - - name: "IDX_ExampleTable_AddressId_299189FB00FDAFA5" - schema: "" - tableName: "ExampleTable" - columnNames: - - "AddressId" - nonUnique: true - indexQualifier: "" - type: 2 - ascOrDesc: "A" - cardinality: -1 - - name: "TestIndex" - schema: "" - tableName: "ExampleTable" - columnNames: - - "ClientId" - - "DisplayName" - nonUnique: true - indexQualifier: "" - type: 2 - ascOrDesc: "A" - cardinality: -1 -``` ## Rosetta Commands -### Available commands -- init -- validate -- extract -- compile -- dbt -- diff -- test -- apply -- generate -- query - -#### init -This command will generate a project (directory) if specified, a default configuration file located in the current directory with example connections for `bigquery` and `snowflake`, and the model directory. - - rosetta init [PROJECT_NAME] - -Parameter | Description ---- | --- -(Optional) PROJECT_NAME | Project name (directory) where the configuration file and model directory will be created. - -Example: -```yaml -#example with 2 connections -connections: - - name: snowflake_weather_prod - databaseName: SNOWFLAKE_SAMPLE_DATA - schemaName: WEATHER - dbType: snowflake - url: jdbc:snowflake://.snowflakecomputing.com/? - userName: bob - password: bobPassword - - name: bigquery_prod - databaseName: bigquery-public-data - schemaName: breathe - dbType: bigquery - url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];... - userName: user - password: password - tables: - - bigquery_table -``` - -#### validate -This command validates the configuration and tests if rosetta can connect to the configured source. - - rosetta [-c, --config CONFIG_FILE] validate [-h, --help] [-s, --source CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection name to extract schema from. - - -#### extract -This is the command that extracts the schema from a database and generates declarative DBML models that can be used for conversion to alternate database targets. - - rosetta [-c, --config CONFIG_FILE] extract [-h, --help] [-s, --source CONNECTION_NAME] [-t, --convert-to CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection name to extract schema from. --t, --convert-to CONNECTION_NAME (Optional) | The target connection name in which source DBML converts to. - -Example: -```yaml ---- -safeMode: false -databaseType: bigquery -operationLevel: database -tables: -- name: "profiles" - type: "TABLE" - schema: "breathe" - columns: - - name: "id" - typeName: "INT64" - jdbcDataType: "4" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 10 - scale: 0 - precision: 10 - primaryKey: false - nullable: false - autoincrement: true - - name: "name" - typeName: "STRING" - jdbcDataType: "12" - ordinalPosition: 0 - primaryKeySequenceId: 0 - columnDisplaySize: 255 - scale: 0 - precision: 255 - primaryKey: false - nullable: false - autoincrement: false -``` - -#### compile -This command generates a DDL for a target database based on the source DBML which was generated by the previous command (`extract`). - - rosetta [-c, --config CONFIG_FILE] compile [-h, --help] [-t, --target CONNECTION_NAME] [-s, --source CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME (Optional) | The source connection name where models are generated. --t, --target CONNECTION_NAME | The target connection name in which source DBML converts to. --d, --with-drop | Add query to drop tables when generating ddl. - -Example: -```yaml -CREATE SCHEMA breathe; -CREATE TABLE breathe.profiles(id INTEGER not null AUTO_INCREMENT, name STRING not null); -``` - -#### dbt -This is the command that generates dbt models for a source DBML which was generated by the previous command (`extract`). - - rosetta [-c, --config CONFIG_FILE] dbt [-h, --help] [-s, --source CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection name where models are generated. - - -#### diff -Show the difference between the local model and the database. Check if any table is removed, or added or if any columns have changed. - - rosetta [-c, --config CONFIG_FILE] diff [-h, --help] [-s, --source CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. --m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is `model.yaml` - - -Example: -``` -There are changes between local model and targeted source -Table Changed: Table 'actor' columns changed -Column Changed: Column 'actor_id' in table 'actor' changed 'Precision'. New value: '1', old value: '5' -Column Changed: Column 'actor_id' in table 'actor' changed 'Autoincrement'. New value: 'true', old value: 'false' -Column Changed: Column 'actor_id' in table 'actor' changed 'Primary key'. New value: 'false', old value: 'true' -Column Changed: Column 'actor_id' in table 'actor' changed 'Nullable'. New value: 'true', old value: 'false' -Table Added: Table 'address' -``` - -#### test -This command runs tests for columns using assertions. Then they are translated into query commands, executed, and compared with an expected value. Currently supported assertions are: `equals(=), not equals(!=), less than(<), more than(>), less than or equals(<=), more than or equals(>=), contains(in), is null, is not null, like, between`. Examples are shown below: - - rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] - - rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection is used to specify which models and connections to use. --t, --target CONNECTION_NAME (Optional) | The target connection is used to specify the target connection to use for testing the data. The source tests needs to match the values from the tarrget connection. - -**Note:** Value for BigQuery Array columns should be comma separated value ('a,b,c,d,e'). - -Example: -```yaml ---- -safeMode: false -databaseType: "mysql" -operationLevel: database -tables: - - name: "actor" - type: "TABLE" - columns: - - name: "actor_id" - typeName: "SMALLINT UNSIGNED" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 5 - scale: 0 - precision: 5 - nullable: false - primaryKey: true - autoincrement: false - tests: - assertion: - - operator: '=' - value: 16 - expected: 1 - - name: "first_name" - typeName: "VARCHAR" - ordinalPosition: 0 - primaryKeySequenceId: 0 - columnDisplaySize: 45 - scale: 0 - precision: 45 - nullable: false - primaryKey: false - autoincrement: false - tests: - assertion: - - operator: '!=' - value: 'Michael' - expected: 1 -``` - -When running the tests against a target connection, you don't have to specify the expected value. - -```yaml ---- -safeMode: false -databaseType: "mysql" -operationLevel: database -tables: - - name: "actor" - type: "TABLE" - columns: - - name: "actor_id" - typeName: "SMALLINT UNSIGNED" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 5 - scale: 0 - precision: 5 - nullable: false - primaryKey: true - autoincrement: false - tests: - assertion: - - operator: '=' - value: 16 - - name: "first_name" - typeName: "VARCHAR" - ordinalPosition: 0 - primaryKeySequenceId: 0 - columnDisplaySize: 45 - scale: 0 - precision: 45 - nullable: false - primaryKey: false - autoincrement: false - tests: - assertion: - - operator: '!=' - value: 'Michael' -``` - -If you need to overwrite the test column query (e.x. for Geospatial data), you can use `columnDef`. -```yaml ---- -safeMode: false -databaseType: "mysql" -operationLevel: database -tables: - - name: "actor" - type: "TABLE" - columns: - - name: "actor_id" - typeName: "SMALLINT UNSIGNED" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 5 - scale: 0 - precision: 5 - nullable: false - primaryKey: true - autoincrement: false - tests: - assertion: - - operator: '=' - value: 16 - expected: 1 - - name: "wkt" - typeName: "GEOMETRY" - ordinalPosition: 0 - primaryKeySequenceId: 0 - columnDisplaySize: 1000000000 - scale: 0 - precision: 1000000000 - columnProperties: [] - nullable: true - primaryKey: false - autoincrement: false - tests: - assertion: - - operator: '>' - value: 434747 - expected: 4 - columnDef: 'ST_AREA(wkt, 1)' -``` - -Output example: -```bash -Running tests for mysql. Found: 2 - -1 of 2, RUNNING test ('=') on column: 'actor_id' -1 of 2, FINISHED test on column: 'actor_id' (expected: '1' - actual: '1') ......................... [PASS in 0.288s] -2 of 2, RUNNING test ('!=') on column: 'first_name' -2 of 2, FINISHED test on column: 'first_name' (expected: '1' - actual: '219') ..................... [FAIL in 0.091s] -``` - -#### apply -Gets current model and compares with state of database, generates ddl for changes and applies to database. If you set `git_auto_commit` to `true` in `main.conf` it will automatically push the new model to your Git repo of the rosetta project. - - rosetta [-c, --config CONFIG_FILE] apply [-h, --help] [-s, --source CONNECTION_NAME] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. --m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is `model.yaml` - - -Example: - -(Actual database) -```yaml ---- -safeMode: false -databaseType: "mysql" -operationLevel: database -tables: - - name: "actor" - type: "TABLE" - columns: - - name: "actor_id" - typeName: "SMALLINT UNSIGNED" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 5 - scale: 0 - precision: 5 - nullable: false - primaryKey: true - autoincrement: false - tests: - assertion: - - operator: '=' - value: 16 - expected: 1 -``` - -(Expected database) -```yaml ---- -safeMode: false -databaseType: "mysql" -operationLevel: database -tables: - - name: "actor" - type: "TABLE" - columns: - - name: "actor_id" - typeName: "SMALLINT UNSIGNED" - ordinalPosition: 0 - primaryKeySequenceId: 1 - columnDisplaySize: 5 - scale: 0 - precision: 5 - nullable: false - primaryKey: true - autoincrement: false - tests: - assertion: - - operator: '=' - value: 16 - expected: 1 - - name: "first_name" - typeName: "VARCHAR" - ordinalPosition: 0 - primaryKeySequenceId: 0 - columnDisplaySize: 45 - scale: 0 - precision: 45 - nullable: false - primaryKey: false - autoincrement: false - tests: - assertion: - - operator: '!=' - value: 'Michael' - expected: 1 -``` - -Description: Our actual database does not contain `first_name` so we expect it to alter the table and add the column, inside the source directory there will be the executed DDL and a snapshot of the current database. - -#### generate -This command will generate Spark Python (file) or Spark Scala (file), firstly it extracts a schema from a source database and gets connection properties from the source connection, then it creates a python (file) or scala (file) that translates schemas, which is ready to transfer data from source to target. - - rosetta [-c, --config CONFIG_FILE] generate [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] [--pyspark] [--scala] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection name to extract schema from. --t, --target CONNECTION_NAME| The target connection name where the data will be transfered. ---pyspark | Generates the Spark SQL file. ---scala | Generates the Scala SQL file. - -#### query -The query command allows you to use natural language commands to query your databases, transforming these commands into SQL SELECT statements. By leveraging the capabilities of AI and LLMs, specifically OpenAI models, it interprets user queries and generates the corresponding SQL queries. For effective use of this command, users need to provide their OpenAI API Key and specify the OpenAI model to be utilized. The output will be written to a CSV file. The max number of rows that will be returned is 200. You can overwrite this value, or remove completely the limit. The default openai model that is used is gpt-3.5-turbo. - - rosetta [-c, --config CONFIG_FILE] query [-h, --help] [-s, --source CONNECTION_NAME] [-q, --query "Natural language QUERY"] [--output "Output DIRECTORY or FILE"] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. --s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. --q --query "Natural language QUERY" | pecifies the natural language query to be transformed into an SQL SELECT statement. --l --limit Response Row limit (Optional) | Limits the number of rows in the generated CSV file. If not specified, the default limit is set to 200 rows. ---no-limit (Optional) | Specifies that there should be no limit on the number of rows in the generated CSV file. - - -**Example** (Setting the key and model) : - -(Config file) -``` -openai_api_key: "sk-abcdefghijklmno1234567890" -openai_model: "gpt-4" -connections: - - name: mysql - databaseName: sakila - schemaName: - dbType: mysql - url: jdbc:mysql://root:sakila@localhost:3306/sakila - userName: root - password: sakila - - name: pg - databaseName: postgres - schemaName: public - dbType: postgres - url: jdbc:postgresql://localhost:5432/postgres?user=postgres&password=sakila - userName: postgres - password: sakila -``` - -***Example*** (Query) -``` - rosetta query -s mysql -q "Show me the top 10 customers by revenue." -``` -***CSV Output Example*** -```CSV -customer_name,total_revenue,location,email -John Doe,50000,New York,johndoe@example.com -Jane Smith,45000,Los Angeles,janesmith@example.com -David Johnson,40000,Chicago,davidjohnson@example.com -Emily Brown,35000,San Francisco,emilybrown@example.com -Michael Lee,30000,Miami,michaellee@example.com -Sarah Taylor,25000,Seattle,sarahtaylor@example.com -Robert Clark,20000,Boston,robertclark@example.com -Lisa Martinez,15000,Denver,lisamartinez@example.com -Christopher Anderson,10000,Austin,christopheranderson@example.com -Amanda Wilson,5000,Atlanta,amandawilson@example.com - -``` -**Note:** When giving a request that will not generate a SELECT statement the query will be generated but will not be executed rather be given to the user to execute on their own. - - -#### drivers -This command can list drivers that are listed in a `drivers.yaml` file and by choosing a driver you can download it to the `ROSETTA_DRIVERS` directory which will be automatically ready to use. - - rosetta drivers [-h, --help] [-f, --file] [--list] [-dl, --download] - -Parameter | Description ---- | --- --h, --help | Show the help message and exit. --f, --file DRIVERS_FILE | YAML drivers file path. If none is supplied it will use drivers.yaml in the current directory and then fallback to our default one. ---list | Used to list all available drivers. --dl, --download | Used to download selected driver by index. -indexToDownload | Chooses which driver to download depending on the index of the driver. - - -***Example*** (drivers.yaml) - -```yaml -- name: MySQL 8.0.30 - link: https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip -- name: Postgresql 42.3.7 - link: https://jdbc.postgresql.org/download/postgresql-42.3.7.jar -``` - -### Safety Operation -In `model.yaml` you can find the attribute `safeMode` which is by default disabled (false). If you want to prevent any DROP operation during -`apply` command, set `safeMode: true`. - -### Operation level -In `model.yaml` you can find the attribute `operationLevel` which is by default set to `schema`. If you want to apply changes on to database level in your model instead of the specific schema in -`apply` command, set `operationLevel: schema`. - -### Fallback Type -In `model.yaml` you can define the attribute `fallbackType` for columns that are of custom types, not supported for translations or not included in the translation matrix. -If a given column type cannot be translated then the fallbackType will be used for the translation. `fallbackType` is optional. - -## RosettaDB CLI JAR and RosettaDB Source - -### Setting Up the CLI JAR (Optional) - -1. Download the rosetta CLI JAR ([releases page](https://github.com/AdaptiveScale/rosetta/releases)) -2. Create an alias command - -```bash -alias rosetta='java -cp ":" com.adaptivescale.rosetta.cli.Main' -``` - -example: - -```bash -alias rosetta='java -cp "/Users/adaptivescale/cli-1.0.0.jar:/Users/adaptivescale/drivers/*" com.adaptivescale.rosetta.cli.Main' -``` - -**Note:** If we are using the **cli** JAR file, we need to specify the location of the JDBC drivers (directory). See the Getting Started section. - -### Build from the source (Optional) - -```gradle binary:runtimeZip``` - -### Google Cloud Spanner JDBC Fix - -**Note:** If you face one of the following errors with Google Cloud Spanner JDBC - -``` -java.sql.SQLException: No suitable driver - -or - -java.lang.SecurityException: Invalid signature file digest for Manifest main attributes -``` - -you can fix it by running the following command where your driver is located: -``` -zip -d google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF' -``` +RosettaDB provides a comprehensive set of commands to cover various aspects of database modeling, validation, and migration. Each command is documented in detail for your convenience. + +### Available Commands +- **[config](docs/markdowns/config.md)**: Manage RosettaDB configuration settings. +- **[init](docs/markdowns/init.md)**: Initialize a new RosettaDB project with required configuration files. +- **[validate](docs/markdowns/validate.md)**: Validate database connections. +- **[drivers](docs/markdowns/drivers.md)**: List and manage supported database drivers. +- **[extract](docs/markdowns/extract.md)**: Extract schema metadata from a source database. +- **[compile](docs/markdowns/compile.md)**: Compile DBML models into target DDL statements. +- **[apply](docs/markdowns/apply.md)**: Apply generated DDL to the target database. +- **[diff](docs/markdowns/diff.md)**: Compare and display differences between the DBML model and the database. +- **[test](docs/markdowns/test.md)**: Run data quality and validation tests against your database. +- **[dbt](docs/markdowns/dbt.md)**: Generate dbt models for analytics workflows. +- **[generate](docs/markdowns/generate.md)**: Generate Spark code for data transfers (Python or Scala). +- **[query](docs/markdowns/query.md)**: Explore and query your data using AI-driven capabilities. ## Copyright and License Information Unless otherwise specified, all content, including all source code files and documentation files in this repository are: diff --git a/binary/src/main/resources/unix_template.txt b/binary/src/main/resources/unix_template.txt index 0a3f99a4..a88c68ca 100644 --- a/binary/src/main/resources/unix_template.txt +++ b/binary/src/main/resources/unix_template.txt @@ -66,14 +66,18 @@ esac CLASSPATH="\$APP_HOME/lib/*" +DRIVERS_PATH="\${ROSETTA_DRIVERS}" + if [ x"\${ROSETTA_DRIVERS}" = "x" ]; then DRIVERS_PATH="\${APP_HOME}/drivers/" - mkdir -p \$DRIVERS_PATH - CLASSPATH="\${CLASSPATH}:\${DRIVERS_PATH}*" -else - CLASSPATH="\${CLASSPATH}:\${ROSETTA_DRIVERS}" + mkdir -p \${DRIVERS_PATH} fi +# Add all .jar files from DRIVERS_PATH and its subdirectories to CLASSPATH +for jar in \$(find "\${DRIVERS_PATH}" -name "*.jar"); do + CLASSPATH="\${CLASSPATH}:\${jar}" +done + JAVA_HOME="\$APP_HOME" JAVACMD="\$JAVA_HOME/bin/java" diff --git a/binary/src/main/resources/windows_template.txt b/binary/src/main/resources/windows_template.txt index f557e906..71be17b8 100644 --- a/binary/src/main/resources/windows_template.txt +++ b/binary/src/main/resources/windows_template.txt @@ -49,13 +49,17 @@ set CMD_LINE_ARGS=%* :execute @rem Setup the command line -set CLASSPATH=%JAVA_HOME:"=%/lib/* +set CLASSPATH=%APP_HOME:"=%/lib/* -if x%ROSETTA_DRIVERS% == x ( - if not exist %APP_HOME%/drivers mkdir "%APP_HOME%/drivers/" - set CLASSPATH=%CLASSPATH%;%APP_HOME%/drivers/* -) else ( - set CLASSPATH=%CLASSPATH%;%ROSETTA_DRIVERS% +set DRIVERS_PATH="%ROSETTA_DRIVERS%" + +if "x%ROSETTA_DRIVERS%" == "x" ( + set DRIVERS_PATH=%APP_HOME%/drivers/ + if not exist %DRIVERS_PATH% mkdir "%DRIVERS_PATH%" +) + +for /R "%DRIVERS_PATH%" %%f in (*.jar) do ( + set CLASSPATH=%CLASSPATH%;%%f ) <% if ( System.properties['BADASS_CDS_ARCHIVE_FILE_WINDOWS'] ) { %> @@ -74,10 +78,10 @@ if "%ERRORLEVEL%"=="0" goto mainEnd :fail rem Set variable ${exitEnvironmentVar} if you need the _script_ return code instead of rem the _cmd.exe /c_ return code! -if not "" == "%${exitEnvironmentVar}%" exit 1 +if not "" == "%${exitEnvironmentVar}%" exit 1 exit /b 1 :mainEnd if "%OS%"=="Windows_NT" endlocal -:omega \ No newline at end of file +:omega diff --git a/build.gradle b/build.gradle index fc57e43f..be706384 100644 --- a/build.gradle +++ b/build.gradle @@ -9,7 +9,7 @@ repositories { allprojects { group = 'com.adaptivescale' - version = '2.5.5' + version = '2.6.0' sourceCompatibility = 11 targetCompatibility = 11 } diff --git a/cli/build.gradle b/cli/build.gradle index d65ccb72..6ae88a77 100644 --- a/cli/build.gradle +++ b/cli/build.gradle @@ -16,8 +16,8 @@ dependencies { implementation project(':queryhelper') implementation group: 'info.picocli', name: 'picocli', version: '4.6.3' - implementation group: 'org.slf4j', name: 'slf4j-simple', version: '2.0.5' - implementation group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.7' + implementation group: 'ch.qos.logback', name: 'logback-classic', version: '1.5.6' + implementation group: 'org.slf4j', name: 'slf4j-api', version: '2.0.9' implementation group: 'commons-io', name: 'commons-io', version: '2.11.0' implementation group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.13.3' implementation group: 'com.fasterxml.jackson.dataformat', name: 'jackson-dataformat-yaml', version: '2.13.3' //debug only diff --git a/cli/src/main/java/com/adaptivescale/rosetta/cli/Cli.java b/cli/src/main/java/com/adaptivescale/rosetta/cli/Cli.java index cab2ea3d..7ae79415 100644 --- a/cli/src/main/java/com/adaptivescale/rosetta/cli/Cli.java +++ b/cli/src/main/java/com/adaptivescale/rosetta/cli/Cli.java @@ -1,5 +1,7 @@ package com.adaptivescale.rosetta.cli; +import ch.qos.logback.classic.Level; +import ch.qos.logback.classic.Logger; import com.adaptivescale.rosetta.cli.helpers.DriverHelper; import com.adaptivescale.rosetta.cli.model.Config; import com.adaptivescale.rosetta.cli.outputs.DbtSqlModelOutput; @@ -21,33 +23,44 @@ import com.adaptivescale.rosetta.ddl.change.ChangeHandler; import com.adaptivescale.rosetta.ddl.change.model.Change; import com.adaptivescale.rosetta.ddl.utils.TemplateEngine; -import com.adaptivescale.rosetta.test.assertion.*; import com.adaptivescale.rosetta.test.assertion.AssertionSqlGenerator; -import com.adaptivescale.rosetta.test.assertion.generator.AssertionSqlGeneratorFactory; +import com.adaptivescale.rosetta.test.assertion.DefaultAssertTestEngine; import com.adaptivescale.rosetta.test.assertion.DefaultSqlExecution; +import com.adaptivescale.rosetta.test.assertion.generator.AssertionSqlGeneratorFactory; import com.adaptivescale.rosetta.diff.DiffFactory; import com.adaptivescale.rosetta.diff.Diff; import com.adaptivescale.rosetta.translator.Translator; import com.adaptivescale.rosetta.translator.TranslatorFactory; import com.adataptivescale.rosetta.source.core.SourceGeneratorFactory; -import com.adataptivescale.rosetta.source.core.interfaces.Generator; import com.adataptivescale.rosetta.source.dbt.DbtModelGenerator; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.dataformat.yaml.YAMLFactory; -import lombok.extern.slf4j.Slf4j; import org.apache.commons.io.FileUtils; import org.apache.commons.io.FilenameUtils; +import org.slf4j.LoggerFactory; import picocli.CommandLine; import queryhelper.pojo.GenericResponse; import queryhelper.service.AIService; -import java.io.*; +import java.io.BufferedReader; +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.text.SimpleDateFormat; -import java.util.*; +import java.util.AbstractMap; +import java.util.Arrays; +import java.util.Collection; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.Set; import java.util.concurrent.Callable; import java.util.function.Consumer; @@ -55,15 +68,16 @@ import java.util.stream.Collectors; import java.util.stream.Stream; -import static com.adaptivescale.rosetta.cli.Constants.*; +import static com.adaptivescale.rosetta.cli.Constants.CONFIG_NAME; +import static com.adaptivescale.rosetta.cli.Constants.TEMPLATE_CONFIG_NAME; -@Slf4j @CommandLine.Command(name = "cli", mixinStandardHelpOptions = true, - version = "2.5.5", + version = "2.6.0", description = "Declarative Database Management - DDL Transpiler" ) class Cli implements Callable { + private static final Logger log = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME); public static final String DEFAULT_MODEL_YAML = "model.yaml"; public static final String DEFAULT_OUTPUT_DIRECTORY = "data"; @@ -78,6 +92,15 @@ class Cli implements Callable { description = "YAML config file. If none is supplied it will use main.conf in the current directory if it exists.") private Config config; + @CommandLine.Option(names = {"-v", "--verbose"}, scope = CommandLine.ScopeType.INHERIT, description = "Enable Verbose output") + public void setVerbose(boolean verbose) { + if (verbose) { + log.setLevel(Level.DEBUG); + } else { + log.setLevel(Level.INFO); + } + } + @Override public Void call() { throw new CommandLine.ParameterException(spec.commandLine(), "Missing required subcommand"); @@ -282,9 +305,15 @@ private void test( } } - @CommandLine.Command(name = "init", description = "Creates a sample config (main.conf) and model directory.", mixinStandardHelpOptions = true) - private void init(@CommandLine.Parameters(index = "0", description = "Project name.", defaultValue = "") - String projectName) throws IOException { + @CommandLine.Command( + name = "init", + description = "Creates a sample config (main.conf) and model directory.", + mixinStandardHelpOptions = true + ) + private void init( + @CommandLine.Parameters(index = "0", description = "Project name.", defaultValue = "") String projectName, + @CommandLine.Option(names = "--skip-db-selection", description = "Skip database selection and driver download process.") boolean skipDBSelection + ) throws IOException { Path fileName = Paths.get(projectName, CONFIG_NAME); InputStream resourceAsStream = getClass().getResourceAsStream("/" + TEMPLATE_CONFIG_NAME); Path projectDirectory = Path.of(projectName); @@ -300,6 +329,58 @@ private void init(@CommandLine.Parameters(index = "0", description = "Project na if (!projectName.isEmpty()) { log.info("In order to start using the newly created project please change your working directory."); } + + if (skipDBSelection) { + log.info("Skipping database selection and driver download process."); + return; + } + + Path driversPath = Path.of(DEFAULT_DRIVERS_YAML); + + DriverHelper.printDrivers(driversPath); + System.out.println("Please select the source database from the list above by typing its number (or press Enter to skip):"); + String sourceDB = new BufferedReader(new InputStreamReader(System.in)).readLine().trim(); + if (sourceDB.isEmpty()) { + sourceDB = "skip"; + } + handleDriverDownload(sourceDB, driversPath, "source"); + + DriverHelper.printDrivers(driversPath); + System.out.println("Please select the source database from the list above by typing its number (or press Enter to skip):"); + String targetDB = new BufferedReader(new InputStreamReader(System.in)).readLine().trim(); + if (targetDB.isEmpty()) { + targetDB = "skip"; + } + handleDriverDownload(targetDB, driversPath, "target"); + } + + private void handleDriverDownload(String dbChoice, Path driversPath, String dbType) { + if ("skip".equalsIgnoreCase(dbChoice)) { + log.info("Skipped downloading the {} DB driver.", dbType); + return; + } + + Integer driverId = null; + + try { + driverId = Integer.parseInt(dbChoice); + } catch (NumberFormatException e) { + System.out.println("Invalid choice. Please select a valid option."); + return; + } + + List drivers = DriverHelper.getDrivers(driversPath); + if (drivers.isEmpty()) { + System.out.println("No drivers found in the specified YAML file."); + return; + } + + if (driverId < 1 || driverId > drivers.size()) { + System.out.println("Invalid choice. Please select a valid option."); + return; + } + + DriverHelper.getDriver(driversPath, driverId); } @CommandLine.Command(name = "dbt", description = "Extract dbt models chosen from connection config.", mixinStandardHelpOptions = true) @@ -430,9 +511,10 @@ private void diff(@CommandLine.Option(names = {"-s", "--source"}) String sourceN } @CommandLine.Command(name = "drivers", description = "Show available drivers for download", mixinStandardHelpOptions = true) - private void drivers(@CommandLine.Option(names = {"--list"}, description = "Used to list all available drivers") boolean isList, - @CommandLine.Option(names = {"-dl", "--download"}, description = "Used to download selected driver by index") boolean isDownload, - @CommandLine.Option(names = {"-f", "--file"}, defaultValue = DEFAULT_DRIVERS_YAML) String file, + private void drivers(@CommandLine.Option(names = {"--list"}, description = "Used to list all available drivers.") boolean isList, + @CommandLine.Option(names = {"--show"}, description = "Used to show downloaded drivers.") boolean isShow, + @CommandLine.Option(names = {"-dl", "--download"}, description = "Used to download selected driver by index.") boolean isDownload, + @CommandLine.Option(names = {"-f", "--file"}, description = "Used to change the drivers yaml file.", defaultValue = DEFAULT_DRIVERS_YAML) String file, @CommandLine.Parameters(index = "0", arity = "0..1") Integer driverId) { Path driversPath = Path.of(file); @@ -440,6 +522,15 @@ private void drivers(@CommandLine.Option(names = {"--list"}, description = "Used DriverHelper.printDrivers(driversPath); System.out.println("To download a driver use: rosetta drivers {index} --download"); + System.out.println("To set a custom drivers path use ROSETTA_DRIVERS environment variable."); + return; + } + + if (isShow) { + DriverHelper.printDownloadedDrivers(); + + System.out.println("To download a driver use: rosetta drivers {index} --download"); + System.out.println("To set a custom drivers path use ROSETTA_DRIVERS environment variable."); return; } diff --git a/cli/src/main/java/com/adaptivescale/rosetta/cli/helpers/DriverHelper.java b/cli/src/main/java/com/adaptivescale/rosetta/cli/helpers/DriverHelper.java index caea9451..2dc58cf7 100644 --- a/cli/src/main/java/com/adaptivescale/rosetta/cli/helpers/DriverHelper.java +++ b/cli/src/main/java/com/adaptivescale/rosetta/cli/helpers/DriverHelper.java @@ -6,7 +6,9 @@ import com.fasterxml.jackson.dataformat.yaml.YAMLFactory; import java.io.IOException; +import java.net.URISyntaxException; import java.net.URL; +import java.nio.file.DirectoryStream; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; @@ -34,7 +36,7 @@ public static void printDrivers(Path path) { return; } - System.out.println("Downloadable Drivers:"); + System.out.println("Downloadable drivers:"); System.out.println("====================="); IntStream.range(0, drivers.size()).forEach(index -> { @@ -93,20 +95,7 @@ public static DriverInfo getDriver(Path path, Integer driverId) { */ private static void downloadDriver(DriverInfo driverInfo) { try { - // Attempt to get the ROSETTA_DRIVERS environment variable - String rosettaDriversPath = System.getenv("ROSETTA_DRIVERS"); - if (rosettaDriversPath == null) { - // Fall back to 'drivers' folder one level up if ROSETTA_DRIVERS is not set - rosettaDriversPath = Paths.get("..", "drivers").toString(); - } - - // Construct the destination directory path - Path rosettaPath = Paths.get(rosettaDriversPath); - - // Ensure the directory exists and is writable, or fall back - if (!Files.exists(rosettaPath) || !Files.isWritable(rosettaPath)) { - throw new IllegalArgumentException("No writable directory available for drivers"); - } + Path rosettaPath = resolveDriverDirectory(); // Open a connection to the URL of the driver URL url = new URL(driverInfo.getLink()); @@ -166,4 +155,93 @@ private static void unzipFile(Path zipFilePath, Path destDir) throws IOException } } } + + /** + * Prints all the downloaded .jar drivers in the specified directory, including those in subdirectories. + */ + public static void printDownloadedDrivers() { + try { + Path rosettaPath = resolveDriverDirectory(); + + // Print the list of downloaded .jar drivers + System.out.printf("Downloaded .jar drivers (%s):%n", rosettaPath.toRealPath()); + System.out.println("====================="); + boolean hasDrivers = listJarFiles(rosettaPath); + + if (!hasDrivers) { + System.out.println("No downloaded .jar drivers found."); + } + System.out.println("====================="); + + } catch (IOException e) { + System.out.println("Error reading the drivers directory: " + e.getMessage()); + } + } + + /** + * Recursively lists all .jar files in the specified directory and its subdirectories. + * + * @param directory The directory to search for .jar files. + * @return True if any .jar files are found, otherwise false. + * @throws IOException If an I/O error occurs. + */ + private static boolean listJarFiles(Path directory) throws IOException { + boolean hasDrivers = false; + + try (DirectoryStream stream = Files.newDirectoryStream(directory)) { + for (Path entry : stream) { + if (Files.isDirectory(entry)) { + // Recursively search in subdirectories + hasDrivers |= listJarFiles(entry); + } else if (entry.getFileName().toString().endsWith(".jar")) { + // Print the full path of the .jar file + System.out.println("File: " + entry.toRealPath()); + hasDrivers = true; + } + } + } + + return hasDrivers; + } + + /** + * Resolves the directory where drivers are stored. + * + * @return Path to the drivers directory + * @throws RuntimeException if no valid directory is found + */ + private static Path resolveDriverDirectory() { + // Attempt to get the ROSETTA_DRIVERS environment variable + String rosettaDriversPath = System.getenv("ROSETTA_DRIVERS"); + Path rosettaPath = null; + + if (rosettaDriversPath != null) { + // Remove any trailing '/*' or '/' from the path + rosettaDriversPath = rosettaDriversPath.replaceAll("/\\*$", "").replaceAll("/$", ""); + // If ROSETTA_DRIVERS is set, use that path + rosettaPath = Paths.get(rosettaDriversPath); + } else { + try { + // Get the path to the executing JAR file + Path jarPath = Paths.get(DriverHelper.class.getProtectionDomain().getCodeSource().getLocation().toURI()); + Path jarDirectory = jarPath.getParent(); // Directory where the JAR file is located + + // First check the directory where the JAR file is located + rosettaPath = jarDirectory.getParent().resolve("drivers"); + if (!Files.exists(rosettaPath)) { + // Fail if neither path exists + throw new RuntimeException("No drivers directory found in any expected location, please set ROSETTA_DRIVERS to a directory."); + } + } catch (URISyntaxException e) { + throw new RuntimeException("Failed to locate the directory of the executing JAR file.", e); + } + } + + // Check if the final resolved path exists and is writable + if (!Files.exists(rosettaPath) || !Files.isWritable(rosettaPath)) { + throw new RuntimeException(String.format("ROSETTA_DRIVERS (%s) directory path not found, re-check your configuration!", rosettaPath.toAbsolutePath())); + } + + return rosettaPath; + } } diff --git a/cli/src/main/resources/logback.xml b/cli/src/main/resources/logback.xml new file mode 100644 index 00000000..3454e2aa --- /dev/null +++ b/cli/src/main/resources/logback.xml @@ -0,0 +1,12 @@ + + + + %d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - %msg%n + + + + + + + + \ No newline at end of file diff --git a/docs/markdowns/apply.md b/docs/markdowns/apply.md new file mode 100644 index 00000000..925b0d48 --- /dev/null +++ b/docs/markdowns/apply.md @@ -0,0 +1,91 @@ +## Apply generated DDL to the target database + +### Command: apply +The apply command compares the current database state with the model defined in your Rosetta project. It generates the necessary DDL to align the database with the model and applies the changes to the database. If the git_auto_commit setting in main.conf is set to true, Rosetta will also automatically commit and push the updated model to the associated Git repository. + + rosetta [-c, --config CONFIG_FILE] apply [-h, --help] [-s, --source CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. +-m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is `model.yaml` + + +Example: + +Actual Database (Current State) +```yaml +--- +safeMode: false +databaseType: "mysql" +operationLevel: database +tables: + - name: "actor" + type: "TABLE" + columns: + - name: "actor_id" + typeName: "SMALLINT UNSIGNED" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 5 + scale: 0 + precision: 5 + nullable: false + primaryKey: true + autoincrement: false + tests: + assertion: + - operator: '=' + value: 16 + expected: 1 +``` + +Expected Database (Target State) +```yaml +--- +safeMode: false +databaseType: "mysql" +operationLevel: database +tables: + - name: "actor" + type: "TABLE" + columns: + - name: "actor_id" + typeName: "SMALLINT UNSIGNED" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 5 + scale: 0 + precision: 5 + nullable: false + primaryKey: true + autoincrement: false + tests: + assertion: + - operator: '=' + value: 16 + expected: 1 + - name: "first_name" + typeName: "VARCHAR" + ordinalPosition: 0 + primaryKeySequenceId: 0 + columnDisplaySize: 45 + scale: 0 + precision: 45 + nullable: false + primaryKey: false + autoincrement: false + tests: + assertion: + - operator: '!=' + value: 'Michael' + expected: 1 +``` + +The apply command detects that the first_name column is missing in the actual database. It generates a DDL statement to alter the actor table and add the first_name column. + +**Outputs**: +- A snapshot of the updated database schema is saved in the source directory. +- The executed DDL is logged for reference. \ No newline at end of file diff --git a/docs/markdowns/compile.md b/docs/markdowns/compile.md new file mode 100644 index 00000000..870567d1 --- /dev/null +++ b/docs/markdowns/compile.md @@ -0,0 +1,34 @@ +## Compile DBML models into target DDL statements + +### Command: compile + +The `compile` command generates DDL (Data Definition Language) statements for a target database based on the DBML (Database Markup Language) extracted from a source database by the previous (`extract`) command. It builds schemas and tables in the target database using the extracted database schema. + + rosetta [-c, --config CONFIG_FILE] compile [-h, --help] [-t, --target CONNECTION_NAME] [-s, --source CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME (Optional) | The source connection name where models are generated. +-t, --target CONNECTION_NAME | The target connection name in which source DBML converts to. +-d, --with-drop | Add query to drop tables when generating ddl. + +Example: +```yaml +CREATE SCHEMA breathe; +CREATE TABLE breathe.profiles(id INTEGER not null AUTO_INCREMENT, name STRING not null); +``` + +##### Example Command: +Assuming `main.conf` is present in your working directory and configured for both source and target connections, a basic usage example is as follows: + + rosetta compile -s source_db_connection -t target_db_connection + +**This command:** + 1. Connects to `source_db_connection` to retrieve DBML data. + 2. Converts the DBML into DDL specific to `target_db_connection`. + +##### Additional Notes +- The `--with-drop` option should be used with caution, as it will delete existing tables in the target database. +- Ensure that the target connection name is correctly set in `main.conf` or passed directly as a parameter. \ No newline at end of file diff --git a/docs/markdowns/config.md b/docs/markdowns/config.md new file mode 100644 index 00000000..ce46477a --- /dev/null +++ b/docs/markdowns/config.md @@ -0,0 +1,90 @@ +## Manage RosettaDB configuration settings + +### YAML Config File + +Rosetta by default expects the YAML config file to be named `main.conf` and looks for it by default in the current folder. The configuration file can be overridden by using the `--config, -c` command line argument (see [Command Line Arguments](#command-line-arguments) below for more details). + +Here is the list of available configurations in the `main.conf` file: + +```yaml +connections: + # The name of the connection + - name: bigquery_prod + + # The name of the default database to use + databaseName: bigquery-public-data + + # The name of the default schema to use + schemaName: breathe + + # The type of the database + dbType: bigquery + + # The connection uri for the database + url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];... + + # The name of the database user + userName: user + + # The password of the database user + password: password + + # The name of tables to include which is optional + tables: + - table_one + - table_two +``` + +In the YAML config file you can also use environment variables. An example usage of environment variables in config file: + +``` +connections: + - name: snowflake_weather_prod + databaseName: SNOWFLAKE_SAMPLE_DATA + schemaName: WEATHER + dbType: snowflake + url: jdbc:snowflake://.snowflakecomputing.com/? + userName: ${USER} + password: ${PASSWORD} +``` + + + +### Using External Translator and Custom Attributes +RosettaDB supports custom translators and translation attributes, allowing users to define or extend database-specific configurations via external CSV files. +- External Translator: Users can specify a custom CSV file for translations by setting the EXTERNAL_TRANSLATION_FILE environment variable. This file allows adjustments in how database schemas are interpreted. +- Translation Attributes: Additional attributes like ordinalPosition, autoincrement, nullable, and primaryKey can be defined in a separate attributes CSV file. Set the EXTERNAL_TRANSLATION_ATTRIBUTE_FILE environment variable to the file’s location to apply these attributes. +- Indices: Rosetta supports index definitions in databases like Google Cloud Spanner, configured directly in model.yaml files to manage primary and secondary keys effectively. + +For detailed setup instructions and examples, refer [here](docs/markdowns/translation.md). + + +### Safety Operation +In `model.yaml` you can find the attribute `safeMode` which is by default disabled (false). If you want to prevent any DROP operation during +`apply` command, set `safeMode: true`. + +### Operation level +In `model.yaml` you can find the attribute `operationLevel` which is by default set to `schema`. If you want to apply changes on to database level in your model instead of the specific schema in +`apply` command, set `operationLevel: schema`. + +### Fallback Type +In `model.yaml` you can define the attribute `fallbackType` for columns that are of custom types, not supported for translations or not included in the translation matrix. +If a given column type cannot be translated then the fallbackType will be used for the translation. `fallbackType` is optional. + + +### Google Cloud Spanner JDBC Fix + +**Note:** If you face one of the following errors with Google Cloud Spanner JDBC + +``` +java.sql.SQLException: No suitable driver + +or + +java.lang.SecurityException: Invalid signature file digest for Manifest main attributes +``` + +you can fix it by running the following command where your driver is located: +``` +zip -d google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF' +``` \ No newline at end of file diff --git a/docs/markdowns/dbt.md b/docs/markdowns/dbt.md new file mode 100644 index 00000000..14586872 --- /dev/null +++ b/docs/markdowns/dbt.md @@ -0,0 +1,26 @@ +## Generate dbt models for analytics workflows + +### Command: dbt +The `dbt` command generates dbt models based on the DBML (Database Markup Language) extracted from a source database. This DBML should have been generated by the previous (`extract`) command, providing a foundation for creating structured data transformations within `dbt`. + + rosetta [-c, --config CONFIG_FILE] dbt [-h, --help] [-s, --source CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection name where models are generated. + +##### Example Command: +Here’s a basic example command that uses the dbt function: + + rosetta dbt -s source_db_connection + + +**This command will:** +1. Use source_db_connection to locate the DBML generated from the extract command. +2. Generate corresponding dbt models that reflect the structure of the source database. + +##### Additional Notes +- **Integration with dbt**: The generated `dbt` models allow for scalable and reusable SQL transformations, helping align your data structure with your analytics or ETL workflows. +- **Configuration**: Ensure that the configuration file (main.conf or specified config) contains accurate connection details for the source database, as it serves as the base for generating `dbt` models. \ No newline at end of file diff --git a/docs/markdowns/diff.md b/docs/markdowns/diff.md new file mode 100644 index 00000000..bc8e6e3b --- /dev/null +++ b/docs/markdowns/diff.md @@ -0,0 +1,39 @@ +## Compare and display differences between the DBML model and the database + +### Command: diff +The diff command shows the differences between the current local model and the state of the database. This can help identify any tables that have been added or removed, or columns that have been modified in the database schema. It’s a valuable tool for tracking schema changes and maintaining consistency between development and production environments. + + rosetta [-c, --config CONFIG_FILE] diff [-h, --help] [-s, --source CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. +-m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is `model.yaml` + + +##### Example Output: +When there are differences between the local model and the targeted database schema, diff provides a detailed report, highlighting table and column changes. Below is a sample output from the `diff` command: +``` +There are changes between local model and targeted source +Table Changed: Table 'actor' columns changed +Column Changed: Column 'actor_id' in table 'actor' changed 'Precision'. New value: '1', old value: '5' +Column Changed: Column 'actor_id' in table 'actor' changed 'Autoincrement'. New value: 'true', old value: 'false' +Column Changed: Column 'actor_id' in table 'actor' changed 'Primary key'. New value: 'false', old value: 'true' +Column Changed: Column 'actor_id' in table 'actor' changed 'Nullable'. New value: 'true', old value: 'false' +Table Added: Table 'address' +``` +##### Example Command: +To use the `diff` command with the default configuration file and model file, you might run: + + rosetta -s source_db_connection + +**In this example:** +1. The command compares the `source_db_connection` schema with the specified local model. +2. Differences are displayed, such as table and column changes. + +##### Additional Notes +- **Usage of `--model`**: When using a specific model file other than `model.yaml`, specify it with the `--model` parameter. +- **Table and Column Change Detection**: The output categorizes schema differences into table changes, column modifications, and new or removed tables. +- **Precision in Changes**: Each change specifies old and new values, helping identify unintended modifications or updates needed in the target database. \ No newline at end of file diff --git a/docs/markdowns/download_drivers.md b/docs/markdowns/download_drivers.md new file mode 100644 index 00000000..05f7bd5f --- /dev/null +++ b/docs/markdowns/download_drivers.md @@ -0,0 +1,75 @@ +## Downloading Drivers +You need the JDBC drivers to connect to the sources/targets that you will use with the rosetta tool. +The JDBC drivers for the rosetta supported databases can be downloaded from the following URLs: + +- [BigQuery JDBC 4.2](https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip) +- [Snowflake JDBC 3.13.19](https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.13.19/snowflake-jdbc-3.13.19.jar) +- [Postgresql JDBC 42.3.7](https://jdbc.postgresql.org/download/postgresql-42.3.7.jar) +- [MySQL JDBC 8.0.30](https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip) +- [Kinetica JDBC 7.1.7.7](https://github.com/kineticadb/kinetica-client-jdbc/archive/refs/tags/v7.1.7.7.zip) +- [Google Cloud Spanner JDBC 2.6.2](https://search.maven.org/remotecontent?filepath=com/google/cloud/google-cloud-spanner-jdbc/2.6.2/google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar) +- [SQL Server JDBC 12.2.0](https://go.microsoft.com/fwlink/?linkid=2223050) +- [DB2 JDBC jcc4](https://repo1.maven.org/maven2/com/ibm/db2/jcc/db2jcc/db2jcc4/db2jcc-db2jcc4.jar) +- [Oracle JDBC 23.2.0.0](https://download.oracle.com/otn-pub/otn_software/jdbc/232-DeveloperRel/ojdbc11.jar) + +### Example connection string configurations for databases + +### BigQuery (service-based authentication OAuth 0) +``` +url: jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=;AdditionalProjects=bigquery-public-data;OAuthType=0;OAuthServiceAcctEmail=;OAuthPvtKeyPath= +``` + +### BigQuery (pre-generated token authentication OAuth 2) +``` +jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=2;ProjectId=;OAuthAccessToken=;OAuthRefreshToken=;OAuthClientId=;OAuthClientSecret=; +``` + +### BigQuery (application default credentials authentication OAuth 3) +``` +jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=; +``` + +### Snowflake +``` +url: jdbc:snowflake://:443/?db=&user=&password= +``` + +### PostgreSQL +``` +url: jdbc:postgresql://:5432/?user=&password= +``` + +### MySQL +``` +url: jdbc:mysql://:@:3306/ +``` + +### Kinetica +``` +url: jdbc:kinetica:URL=http://:9191;CombinePrepareAndExecute=1 +``` + +### Google Cloud Spanner +``` +url: jdbc:cloudspanner:/projects/my-project/instances/my-instance/databases/my-db;credentials=/path/to/credentials.json +``` + +### Google CLoud Spanner (Emulator) +``` +url: jdbc:cloudspanner://localhost:9010/projects/test/instances/test/databases/test?autoConfigEmulator=true +``` + +### SQL Server +``` +url: jdbc:sqlserver://:1433;databaseName= +``` + +### DB2 +``` +url: jdbc:db2://:50000; +``` + +### ORACLE +``` +url: jdbc:oracle:thin::1521: +``` \ No newline at end of file diff --git a/docs/markdowns/drivers.md b/docs/markdowns/drivers.md new file mode 100644 index 00000000..82eff5f3 --- /dev/null +++ b/docs/markdowns/drivers.md @@ -0,0 +1,24 @@ +## List and manage supported database JDBC drivers + +### Command: drivers +This command can list drivers that are listed in a `drivers.yaml` file and by choosing a driver you can download it to the `ROSETTA_DRIVERS` directory which will be automatically ready to use. + + rosetta drivers [-h, --help] [-f, --file] [--list] [-dl, --download] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-f, --file DRIVERS_FILE | YAML drivers file path. If none is supplied it will use drivers.yaml in the current directory and then fallback to our default one. +--list | Used to list all available drivers. +-dl, --download | Used to download selected driver by index. +indexToDownload | Chooses which driver to download depending on the index of the driver. + + +***Example*** (drivers.yaml) + +```yaml +- name: MySQL 8.0.30 + link: https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip +- name: Postgresql 42.3.7 + link: https://jdbc.postgresql.org/download/postgresql-42.3.7.jar +``` \ No newline at end of file diff --git a/docs/markdowns/extract.md b/docs/markdowns/extract.md new file mode 100644 index 00000000..4f01c02a --- /dev/null +++ b/docs/markdowns/extract.md @@ -0,0 +1,48 @@ +## Extract schema metadata from a source database + +### Command: extract +This is the command that extracts the schema from a database and generates declarative DBML models that can be used for conversion to alternate database targets. + + rosetta [-c, --config CONFIG_FILE] extract [-h, --help] [-s, --source CONNECTION_NAME] [-t, --convert-to CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection name to extract schema from. +-t, --convert-to CONNECTION_NAME (Optional) | The target connection name in which source DBML converts to. + +Example: +```yaml +--- +safeMode: false +databaseType: bigquery +operationLevel: database +tables: +- name: "profiles" + type: "TABLE" + schema: "breathe" + columns: + - name: "id" + typeName: "INT64" + jdbcDataType: "4" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 10 + scale: 0 + precision: 10 + primaryKey: false + nullable: false + autoincrement: true + - name: "name" + typeName: "STRING" + jdbcDataType: "12" + ordinalPosition: 0 + primaryKeySequenceId: 0 + columnDisplaySize: 255 + scale: 0 + precision: 255 + primaryKey: false + nullable: false + autoincrement: false +``` \ No newline at end of file diff --git a/docs/markdowns/generate.md b/docs/markdowns/generate.md new file mode 100644 index 00000000..d39f8ccc --- /dev/null +++ b/docs/markdowns/generate.md @@ -0,0 +1,31 @@ +## Generate Spark code for data transfers (Python or Scala) + +### Command: generate +This command will generate Spark Python (file) or Spark Scala (file), firstly it extracts a schema from a source database and gets connection properties from the source connection, then it creates a python (file) or scala (file) that translates schemas, which is ready to transfer data from source to target. + + rosetta [-c, --config CONFIG_FILE] generate [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] [--pyspark] [--scala] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection name to extract schema from. +-t, --target CONNECTION_NAME| The target connection name where the data will be transfered. +--pyspark | Generates the Spark SQL file. +--scala | Generates the Scala SQL file. + +##### Example Command: +Here’s a basic example command that uses the `generate` function: + + rosetta generate -s source_db_connection -t target_db_connection --pyspark + +**This command will:** + +1. Connect to the specified source and target databases using the connection details provided. +2. Extract the schema from the source. +3. Generate a PySpark or Scala script, depending on the selected flag `(--pyspark or --scala)`, which is ready to transfer data from source to target. + +##### Additional Notes +- **JDBC Drivers**: Ensure you have the correct JDBC drivers for both the source and target databases. These drivers should be specified in the `spark.driver.extraClassPath`. +- **Database Configuration**: Modify the `source_jdbc_url` ,`target_jdbc_url`, and other connection parameters as per your environment setup. +- **Mode Options**: The `mode("overwrite")` option in `.save()` will overwrite any existing data in the target table. Change it as needed (e.g., `append`, `ignore`, `error`). diff --git a/docs/markdowns/init.md b/docs/markdowns/init.md new file mode 100644 index 00000000..2a840add --- /dev/null +++ b/docs/markdowns/init.md @@ -0,0 +1,32 @@ +## Initialize a new RosettaDB project with required configuration files + +### Command: init +This command will generate a project (directory) if specified, a default configuration file located in the current directory with example connections for `bigquery` and `snowflake`, and the model directory. + + rosetta init [PROJECT_NAME] + +Parameter | Description +--- | --- +(Optional) PROJECT_NAME | Project name (directory) where the configuration file and model directory will be created. + +Example: +```yaml +#example with 2 connections +connections: + - name: snowflake_weather_prod + databaseName: SNOWFLAKE_SAMPLE_DATA + schemaName: WEATHER + dbType: snowflake + url: jdbc:snowflake://.snowflakecomputing.com/? + userName: bob + password: bobPassword + - name: bigquery_prod + databaseName: bigquery-public-data + schemaName: breathe + dbType: bigquery + url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];... + userName: user + password: password + tables: + - bigquery_table +``` \ No newline at end of file diff --git a/docs/markdowns/installation.md b/docs/markdowns/installation.md new file mode 100644 index 00000000..7fa330cf --- /dev/null +++ b/docs/markdowns/installation.md @@ -0,0 +1,98 @@ +## Installation Instructions + +### ROSETTA_DRIVERS Environment + +Set the ENV variable `ROSETTA_DRIVERS` to point to the location of your JDBC drivers. + +``` +export ROSETTA_DRIVERS= +``` + +example: + +``` +export ROSETTA_DRIVERS=/Users/adaptivescale/drivers/* +``` + +### rosetta binary + +1. Download the rosetta binary for the supported OS ([releases page](https://github.com/AdaptiveScale/rosetta/releases)). + ``` + rosetta--linux-x64.zip + rosetta--mac_aarch64.zip + rosetta--mac_x64.zip + rosetta--win_x64.zip + ``` +2. Unzip the downloaded file +3. Run rosetta commands using `./rosetta` which is located inside `bin` directory. +4. Create new project using `rosetta init` command: + +``` + rosetta init database-migration +``` + +The `rosetta init` command will create a new rosetta project within `database-migration` directory containing the `main.conf` (for configuring the connections to data sources). + +5. Configure connections in `main.conf` + example: connections for postgres and mysql + +``` +# If your rosetta project is linked to a Git repo, during apply you can automatically commit/push the new version of your model.yaml +# The default value of git_auto_commit is false +git_auto_commit: false +connections: + - name: mysql + databaseName: sakila + schemaName: + dbType: mysql + url: jdbc:mysql://root:sakila@localhost:3306/sakila + userName: root + password: sakila + - name: pg + databaseName: postgres + schemaName: public + dbType: postgres + url: jdbc:postgresql://localhost:5432/postgres?user=postgres&password=sakila + userName: postgres + password: sakila +``` + +6. Extract the schema from postgres and translate it to mysql: + +``` + rosetta extract -s pg -t mysql +``` + +The extract command will create two directories `pg` and `mysql`. `pg` directory will have the extracted schema +from Postgres DB. The `mysql` directory will contain the translated YAML which is ready to be used in MySQL DB. + +7. Migrate the translated schema to MySQL DB: + +``` + rosetta apply -s mysql +``` + +The apply command will migrate the translated Postgres schema to MySQL. + +## RosettaDB CLI JAR and RosettaDB Source + +### Setting Up the CLI JAR (Optional) + +1. Download the rosetta CLI JAR ([releases page](https://github.com/AdaptiveScale/rosetta/releases)) +2. Create an alias command + +```bash +alias rosetta='java -cp ":" com.adaptivescale.rosetta.cli.Main' +``` + +example: + +```bash +alias rosetta='java -cp "/Users/adaptivescale/cli-1.0.0.jar:/Users/adaptivescale/drivers/*" com.adaptivescale.rosetta.cli.Main' +``` + +**Note:** If we are using the **cli** JAR file, we need to specify the location of the JDBC drivers (directory). See the Getting Started section. + +### Build from the source (Optional) + +```gradle binary:runtimeZip``` \ No newline at end of file diff --git a/docs/markdowns/query.md b/docs/markdowns/query.md new file mode 100644 index 00000000..3393f27d --- /dev/null +++ b/docs/markdowns/query.md @@ -0,0 +1,61 @@ +## Explore and query your data using AI-driven capabilities + +### Command: query +The query command allows you to use natural language commands to query your databases, transforming these commands into SQL SELECT statements. By leveraging the capabilities of AI and LLMs, specifically OpenAI models, it interprets user queries and generates the corresponding SQL queries. For effective use of this command, users need to provide their OpenAI API Key and specify the OpenAI model to be utilized. The output will be written to a CSV file. The max number of rows that will be returned is 200. You can overwrite this value, or remove completely the limit. The default openai model that is used is gpt-3.5-turbo. + + rosetta [-c, --config CONFIG_FILE] query [-h, --help] [-s, --source CONNECTION_NAME] [-q, --query "Natural language QUERY"] [--output "Output DIRECTORY or FILE"] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. +-q --query "Natural language QUERY" | pecifies the natural language query to be transformed into an SQL SELECT statement. +-l --limit Response Row limit (Optional) | Limits the number of rows in the generated CSV file. If not specified, the default limit is set to 200 rows. +--no-limit (Optional) | Specifies that there should be no limit on the number of rows in the generated CSV file. + + +**Example** (Setting the key and model) : + +(Config file) +``` +openai_api_key: "sk-abcdefghijklmno1234567890" +openai_model: "gpt-4" +connections: + - name: mysql + databaseName: sakila + schemaName: + dbType: mysql + url: jdbc:mysql://root:sakila@localhost:3306/sakila + userName: root + password: sakila + - name: pg + databaseName: postgres + schemaName: public + dbType: postgres + url: jdbc:postgresql://localhost:5432/postgres?user=postgres&password=sakila + userName: postgres + password: sakila +``` + +***Example*** (Query) +``` + rosetta query -s mysql -q "Show me the top 10 customers by revenue." +``` +***CSV Output Example*** +```CSV +customer_name,total_revenue,location,email +John Doe,50000,New York,johndoe@example.com +Jane Smith,45000,Los Angeles,janesmith@example.com +David Johnson,40000,Chicago,davidjohnson@example.com +Emily Brown,35000,San Francisco,emilybrown@example.com +Michael Lee,30000,Miami,michaellee@example.com +Sarah Taylor,25000,Seattle,sarahtaylor@example.com +Robert Clark,20000,Boston,robertclark@example.com +Lisa Martinez,15000,Denver,lisamartinez@example.com +Christopher Anderson,10000,Austin,christopheranderson@example.com +Amanda Wilson,5000,Atlanta,amandawilson@example.com + +``` +**Note:** When giving a request that will not generate a SELECT statement the query will be generated but will not be executed rather be given to the user to execute on their own. + diff --git a/docs/markdowns/test.md b/docs/markdowns/test.md new file mode 100644 index 00000000..7dbe44da --- /dev/null +++ b/docs/markdowns/test.md @@ -0,0 +1,154 @@ +## Run data quality and validation tests against your database + +### Command: test +This command runs tests for columns using assertions. Then they are translated into query commands, executed, and compared with an expected value. Currently supported assertions are: `equals(=), not equals(!=), less than(<), more than(>), less than or equals(<=), more than or equals(>=), contains(in), is null, is not null, like, between`. Examples are shown below: + + rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] + + rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection is used to specify which models and connections to use. +-t, --target CONNECTION_NAME (Optional) | The target connection is used to specify the target connection to use for testing the data. The source tests needs to match the values from the tarrget connection. + +**Note:** Value for BigQuery Array columns should be comma separated value ('a,b,c,d,e'). + +Example: +```yaml +--- +safeMode: false +databaseType: "mysql" +operationLevel: database +tables: + - name: "actor" + type: "TABLE" + columns: + - name: "actor_id" + typeName: "SMALLINT UNSIGNED" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 5 + scale: 0 + precision: 5 + nullable: false + primaryKey: true + autoincrement: false + tests: + assertion: + - operator: '=' + value: 16 + expected: 1 + - name: "first_name" + typeName: "VARCHAR" + ordinalPosition: 0 + primaryKeySequenceId: 0 + columnDisplaySize: 45 + scale: 0 + precision: 45 + nullable: false + primaryKey: false + autoincrement: false + tests: + assertion: + - operator: '!=' + value: 'Michael' + expected: 1 +``` + +When running the tests against a target connection, you don't have to specify the expected value. + +```yaml +--- +safeMode: false +databaseType: "mysql" +operationLevel: database +tables: + - name: "actor" + type: "TABLE" + columns: + - name: "actor_id" + typeName: "SMALLINT UNSIGNED" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 5 + scale: 0 + precision: 5 + nullable: false + primaryKey: true + autoincrement: false + tests: + assertion: + - operator: '=' + value: 16 + - name: "first_name" + typeName: "VARCHAR" + ordinalPosition: 0 + primaryKeySequenceId: 0 + columnDisplaySize: 45 + scale: 0 + precision: 45 + nullable: false + primaryKey: false + autoincrement: false + tests: + assertion: + - operator: '!=' + value: 'Michael' +``` + +If you need to overwrite the test column query (e.x. for Geospatial data), you can use `columnDef`. +```yaml +--- +safeMode: false +databaseType: "mysql" +operationLevel: database +tables: + - name: "actor" + type: "TABLE" + columns: + - name: "actor_id" + typeName: "SMALLINT UNSIGNED" + ordinalPosition: 0 + primaryKeySequenceId: 1 + columnDisplaySize: 5 + scale: 0 + precision: 5 + nullable: false + primaryKey: true + autoincrement: false + tests: + assertion: + - operator: '=' + value: 16 + expected: 1 + - name: "wkt" + typeName: "GEOMETRY" + ordinalPosition: 0 + primaryKeySequenceId: 0 + columnDisplaySize: 1000000000 + scale: 0 + precision: 1000000000 + columnProperties: [] + nullable: true + primaryKey: false + autoincrement: false + tests: + assertion: + - operator: '>' + value: 434747 + expected: 4 + columnDef: 'ST_AREA(wkt, 1)' +``` + +Output example: +```bash +Running tests for mysql. Found: 2 + +1 of 2, RUNNING test ('=') on column: 'actor_id' +1 of 2, FINISHED test on column: 'actor_id' (expected: '1' - actual: '1') ......................... [PASS in 0.288s] +2 of 2, RUNNING test ('!=') on column: 'first_name' +2 of 2, FINISHED test on column: 'first_name' (expected: '1' - actual: '219') ..................... [FAIL in 0.091s] +``` \ No newline at end of file diff --git a/docs/markdowns/translation.md b/docs/markdowns/translation.md new file mode 100644 index 00000000..814f5807 --- /dev/null +++ b/docs/markdowns/translation.md @@ -0,0 +1,98 @@ +## Using External Translator + +RosettaDB allows users to use their own translator. For the supported databases you can extend or create your version +of translation CSV file. To use an external translator you need to set the `EXTERNAL_TRANSLATION_FILE` ENV variable +to point to the external file. + +Set the ENV variable `EXTERNAL_TRANSLATION_FILE` to point to the location of your custom translator CSV file. + +``` +export EXTERNAL_TRANSLATION_FILE= +``` + +example: + +``` +export EXTERNAL_TRANSLATION_FILE=/Users/adaptivescale/translation.csv +``` + +Make sure you keep the same format as the CSV example given above. + +### Translation Attributes + +Rosetta uses an additional file to maintain translation specific attributes. +It stores translation_id, the attribute_name and attribute_value: + +``` +1;;302;;columnDisplaySize;;38 +2;;404;;columnDisplaySize;;30 +3;;434;;columnDisplaySize;;17 +``` + +The supported attribute names are: +- ordinalPosition +- autoincrement +- nullable +- primaryKey +- primaryKeySequenceId +- columnDisplaySize +- scale +- precision + +Set the ENV variable `EXTERNAL_TRANSLATION_ATTRIBUTE_FILE` to point to the location of your custom translation attribute CSV file. + +``` +export EXTERNAL_TRANSLATION_ATTRIBUTE_FILE= +``` + +example: + +``` +export EXTERNAL_TRANSLATION_ATTRIBUTE_FILE=/Users/adaptivescale/translation_attributes.csv +``` + +Make sure you keep the same format as the CSV example given above. + +### Indices (Index) + +Indices are supported in Google Cloud Spanner. An example on how they are represented in model.yaml + +``` +tables: +- name: "ExampleTable" + type: "TABLE" + schema: "" + indices: + - name: "PRIMARY_KEY" + schema: "" + tableName: "ExampleTable" + columnNames: + - "Id" + - "UserId" + nonUnique: false + indexQualifier: "" + type: 1 + ascOrDesc: "A" + cardinality: -1 + - name: "IDX_ExampleTable_AddressId_299189FB00FDAFA5" + schema: "" + tableName: "ExampleTable" + columnNames: + - "AddressId" + nonUnique: true + indexQualifier: "" + type: 2 + ascOrDesc: "A" + cardinality: -1 + - name: "TestIndex" + schema: "" + tableName: "ExampleTable" + columnNames: + - "ClientId" + - "DisplayName" + nonUnique: true + indexQualifier: "" + type: 2 + ascOrDesc: "A" + cardinality: -1 +``` diff --git a/docs/markdowns/validate.md b/docs/markdowns/validate.md new file mode 100644 index 00000000..b16dbf33 --- /dev/null +++ b/docs/markdowns/validate.md @@ -0,0 +1,12 @@ +## Validate database connections + +### Command: validate +This command validates the configuration and tests if rosetta can connect to the configured source. + + rosetta [-c, --config CONFIG_FILE] validate [-h, --help] [-s, --source CONNECTION_NAME] + +Parameter | Description +--- | --- +-h, --help | Show the help message and exit. +-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. +-s, --source CONNECTION_NAME | The source connection name to extract schema from. \ No newline at end of file