Existing Duck db file as a data source #2910

DingoEatingFuzz · 2023-08-12T23:05:26Z

Is your feature request related to a problem? Please describe.

I have existing duckdb files lying around (and also a program that takes a zip and emits a duck) that I would like to analyze with Rill.

Describe the solution you'd like

My ideal (albeit maybe naive) workflow would be:

Add data source
Local file
Select a duckdb file
Internally Rill Attaches the db
Each table is now a data source

Describe alternatives you've considered

As far as I can tell the only alternative is to use CSV/JSON/Parquet files, but this is not a realistic option. The major reason is that the value have going straight from zip to duck is codifying the data modeling of files in the zip--especially complex MAP(VARCHAR, STRUCT(...)) types for deeply-nested JSON.

Additional context

Rill looks amazing! Maybe I'm just holding it wrong with this feature request, but I really want the dashboarding features without necessarily using the data modeling features.

The text was updated successfully, but these errors were encountered:

begelundmuller · 2023-08-14T09:35:59Z

Hey, thanks for the request. The use case definitely makes sense to me.

Firstly, here's a way you might be able to get this working today – use rill start --db my_duckdb.db to connect Rill to your DuckDB database file (instead of the default stage.db file). Your existing tables will not show up in Rill, but you can reference them in model or dashboard definitions (make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them).

Secondly, I'm thinking about how we can better support your use case. If we added support for the following, do you think that would address your use case?

Support an init.sql file where you can use ATTACH to attach existing databases
Show all pre-existing tables in the database in the sidebar

Third, do you have a hosting use case for your dashboards? If yes, can you share where you store/host the DuckDB .db files you want to connect to?

DingoEatingFuzz · 2023-08-14T17:05:27Z

Oooo, I'm going to try your suggesting of using --db today!

Support an init.sql file where you can use ATTACH to attach existing databases

I can see this being useful. It's still fuzzy in my mind how I'd like the overall workflow to go, but this meets the requirement of Person A writes init.sql and Person B clones a repo or whatever and launches Rill without needing to get into the weeds.

Show all pre-existing tables in the database in the sidebar

I would love this. Rill as a database client and also a dashboarding tool <3

Third, do you have a hosting use case for your dashboards? If yes, can you share where you store/host the DuckDB .db files you want to connect to?

Not yet! I'll probably try a few things while prototyping. Right now I think something low-tech would work best. Like a private S3 bucket. I think this would be a sweet spot between local analysis and full-bore data pipelines, and it would be hard for someone to screw up--worst case is they upload an invalid file; that still wouldn't corrupt existing data.

Thank you for your quick response!

DingoEatingFuzz · 2023-08-14T20:40:34Z

Coming back to say that the --db flag worked like a charm ✨

kmatt · 2023-08-22T22:13:55Z

@begelundmuller Would the --db flag allow use of a DuckDB file containing views that can be referenced by models?

This allows definition of data sources reading local data files, but with transforms in the database that Rill sees as tables.

(make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them)

I'd suggest a change where Rill throws an error rather than overwriting DuckDB objects for this approach.

begelundmuller · 2023-08-28T09:28:12Z

@begelundmuller Would the --db flag allow use of a DuckDB file containing views that can be referenced buy models?

This allows definition of data sources reading local data files, but with transforms in the database that Rill sees as tables.

Hey @kmatt, yes it would allow that, but if it's a view that queries files directly, you might run into performance issues since Rill issues a lot of queries and therefore benefits from data being ingested into DuckDB's native format (we use CREATE TABLE ... AS (SELECT ...) for sources instead of CREATE VIEW ...).

However, in the last release (0.32), we actually added support for free-form DuckDB SQL sources – I think they should address your use case. The syntax is:

# sources/my_source.yaml
type: duckdb
sql: "SELECT * FROM read_parquet('./data/my_data.parquet')" # can be any valid DuckDB SELECT statement

(make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them)

I'd suggest a change where Rill throws an error rather than overwriting DuckDB objects for this approach.

Unfortunately we can't easily do this. Since code file definitions and the underlying data can change in-between Rill sessions, we don't always know if a table in DuckDB was created by a previous version of Rill or by the user. We keep some metadata about ingested data in an internal table (rill.catalog) that could help with that, but there are some error conditions where it will go out of sync (DuckDB transactions could help here, but we have found them to be too unstable, so we avoid them as much as possible for the time being).

What if we instead added support for creating tables in a custom database schema? For example, we could add a flag rill start --duckdb-schema custom, where Rill would create views and tables as custom.my_source instead of main.my_source (main is the default schema in DuckDB). Would that be useful?

k-anshul · 2023-12-17T09:45:25Z

Hey

We just merged the changes that allows DuckDB data file to be used as a source. This will be available in the next release version but can now be used with the nightly as well (rill upgrade --nightly can be used to install the latest nightly version).
Here's a sample source YAML that uses a duckDB file as a data source.

type: duckdb
db: /path/to/src.db
sql: select * from table

DingoEatingFuzz added the Type:Feature New feature request label Aug 12, 2023

k-anshul linked a pull request Dec 15, 2023 that will close this issue

Runtime: unified duckdb connector for motherduck and external db files #3700

Merged

k-anshul closed this as completed in #3700 Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Existing Duck db file as a data source #2910

Existing Duck db file as a data source #2910

DingoEatingFuzz commented Aug 12, 2023

begelundmuller commented Aug 14, 2023 •

edited

Loading

DingoEatingFuzz commented Aug 14, 2023

DingoEatingFuzz commented Aug 14, 2023

kmatt commented Aug 22, 2023 •

edited

Loading

begelundmuller commented Aug 28, 2023 •

edited

Loading

k-anshul commented Dec 17, 2023 •

edited

Loading

Existing Duck db file as a data source #2910

Existing Duck db file as a data source #2910

Comments

DingoEatingFuzz commented Aug 12, 2023

begelundmuller commented Aug 14, 2023 • edited Loading

DingoEatingFuzz commented Aug 14, 2023

DingoEatingFuzz commented Aug 14, 2023

kmatt commented Aug 22, 2023 • edited Loading

begelundmuller commented Aug 28, 2023 • edited Loading

k-anshul commented Dec 17, 2023 • edited Loading

begelundmuller commented Aug 14, 2023 •

edited

Loading

kmatt commented Aug 22, 2023 •

edited

Loading

begelundmuller commented Aug 28, 2023 •

edited

Loading

k-anshul commented Dec 17, 2023 •

edited

Loading