Skip to content

Commit

Permalink
[Feature] Add multiple LLM API providers (#13)
Browse files Browse the repository at this point in the history
This PR expands the LLM features to multiple selectable providers.

- Groq running Llama 3 70B
- Anthropic Claude 3 Opus
- OpenAI GPT-4 Turbo

It should be relatively simple to add more providers in the future.
Officially removing the 'experimental' tag from the LLM features.

GPT-4 Turbo is the clear best pick so made that the default setting.
  • Loading branch information
gwenwindflower authored Apr 21, 2024
2 parents 04fcf0f + dbc0cdd commit bb2c128
Show file tree
Hide file tree
Showing 19 changed files with 773 additions and 318 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,11 @@
# Go workspace file
go.work

# build directory
dist/

# Project specific
build
test_build
tbd

dist/
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ It's designed to be super fast and easy to use with a friendly TUI that fast for
### It's the **_easy button_** for dbt projects.

#### Quickstart

```bash
brew tap gwenwindflower/homebrew-tbd
brew install tbd
Expand All @@ -37,6 +38,7 @@ If you're new to dbt, [check out the wiki](https://github.com/gwenwindflower/tbd
- [x] DuckDB

If you don't have a cloud warehouse, but want to spin up a dbt project with `tbd` I recommend either:

- **BigQuery** — they have a generous free tier, authenticating with `gcloud` CLI is super easy, and `tbd` requires very few manual configurations. They also have a ton of great public datasets you can model.
- **DuckDB** — you can work completely locally and skip the cloud altogether. You will need to find some data, but DuckDB can _very_ easily ingest CSVs, JSON, or Parquet, so if you have some raw data you want to work with, this is a great option as well.

Expand All @@ -59,6 +61,7 @@ go install github.com/gwenwindflower/tbd@latest
That's it! It's a single binary and has no dependencies on `dbt` itself, for maximum speed it operates directly with your warehouse, so you don't even need to have `dbt` installed to use it. That said, it _can_ leverage the profiles in your `~/.dbt/profiles.yml` file if you have them set up, so you can use the same connection information to save yourself some typing.

## 🔐 Warehouse-specific setup

`tbd` at present, for security, only supports SSO methods of authentication. Please check out the below guides for your target warehouse before using `tbd` to ensure a smooth experience.

### ❄️ Snowflake
Expand Down Expand Up @@ -103,13 +106,15 @@ your_build_dir/

### 🦙 LLM features

`tbd` has some neat alpha features that are still in development. One of these is the ability to generate documentation and tests for your sources via LLM. It uses [Groq](https://groq.com) running `llama3-70b-8192` to do its inference. It's not perfect, but it's pretty good! It requires setting an environment variable with your Groq API key beforehand that you'll then pass the name of.
`tbd` has some neat alpha features that infer documentation and tests for your columns. There are multiple supported LLMs via API: Groq running Llama 3 70B, Anthropic Claude 3 Opus, and OpenAI GPT-4 Turbo. They have very different rate limits (these are limitations in the API that `tbd` respects):

The biggest thing to flag is that while Groq is in free beta, they have a very low rate limit on their API: 30 requests per minute. The actual inference on Groq is _super_ fast, but for now I've had to rate limit the API calls so it will take a few minutes or quite awhile depending on your schema size. Once Groq is out of beta, I'll remove the rate limit, but you'll of course have to pay for the API calls via your Groq account.
- **Groq** 30 requests per minute
- **Claude 3 Opus** 5 requests per minute
- **GPT-4 Turbo** 500 request per minute

I will _definitely_ be adding other LLM providers in the future, probably Anthropic Claude 3 Opus as the next one so you can choose between maximum quality (Claude) or maximum speed (Groq, when I can remove the rate limit).
As you can see, if you have anything but a very smol schema, you should stick with OpenAI. When Groq ups their rate limit after they're out of beta, that will be the fastest option, but for now, OpenAI is the best bet. The good news is that GPT-4 Turbo is _really_ good at this task (honestly better than Claude Opus) and pretty dang fast! The results are great in my testing.

I'm going to experiment very soon with using structured output conformed to dbt's JSON schema and passing entire tables, rather than iterating through columns, and see how it does with that. If it works that will be significantly faster as it can churn out entire files quickly and the rate limit will be less of a factor.
I'm going to experiment very soon with using structured output conformed to dbt's JSON schema and passing entire tables, rather than iterating through columns, and see how it does with that. If it works that will be significantly faster as it can churn out entire files (and perhaps improve quality through having more context) and the rate limits will be less of a factor.

### 🌊 Example workflows

Expand All @@ -118,18 +123,20 @@ I'm going to experiment very soon with using structured output conformed to dbt'
## 😅 To Do

- [ ] Get to 100% test coverage
- [ ] Add Claude 3 Opus option
- [x] Add Claude 3 Opus option
- [x] Add OpenAI GPT-4 Turbo option
- [x] Add support for Snowflake
- [x] Add support for BigQuery
- [ ] Add support for Redshift
- [ ] Add support for Databricks
- [x] Add support for Postgres
- [x] Add support for DuckDB
- [x] Add support for MotherDuck
- [ ] Build on Linux
- [ ] Build on Windows

## 🤗 Contributing

I welcome Discussions, Issues, and PRs! This is pre-release software and without folks using it and opening Issues or Discussions I won't be able to find the rough edges and smooth them out. So please if you get stuck open an Issue and let's figure out how to fix it!
I welcome Discussions, Issues, and PRs! This is pre-release software and without folks using it and opening Issues or Discussions I won't be able to find the rough edges and smooth them out. So please if you get stuck open an Issue and let's figure out how to fix it!

If you're a dbt user and aren't familiar with Go, but interested in learning a bit of it, I'm also happy to help guide you through opening a PR, just let me know 💗.
92 changes: 49 additions & 43 deletions forms.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ import (
"errors"
"fmt"
"strconv"
"strings"

"github.com/charmbracelet/huh"
"github.com/fatih/color"
)

type FormResponse struct {
Path string
Username string
Password string
Host string
BuildDir string
SslMode string
Database string
Expand All @@ -21,21 +22,23 @@ type FormResponse struct {
ProjectName string
Warehouse string
Account string
GroqKeyEnvVar string
Password string
DbtProfileName string
LlmKeyEnvVar string
DbtProfileOutput string
DbtProfileName string
Path string
Port string
Host string
Username string
Prefix string
Llm string
GenerateDescriptions bool
ScaffoldProject bool
CreateProfile bool
UseDbtProfile bool
Confirm bool
}

var not_empty = func(s string) error {
func notEmpty(s string) error {
s = strings.TrimSpace(s)
if len(s) == 0 {
return fmt.Errorf("cannot be empty, please enter a value")
}
Expand All @@ -55,17 +58,16 @@ func getProfileOptions(ps DbtProfiles) []huh.Option[string] {

func Forms(ps DbtProfiles) (FormResponse, error) {
dfr := FormResponse{
BuildDir: "build",
GroqKeyEnvVar: "GROQ_API_KEY",
Prefix: "stg",
Host: "localhost",
Port: "5432",
BuildDir: "build",
LlmKeyEnvVar: "OPENAI_API_KEY",
Prefix: "stg",
Host: "localhost",
Port: "5432",
}
pinkUnderline := color.New(color.FgMagenta).Add(color.Bold, color.Underline).SprintFunc()
greenBold := color.New(color.FgGreen).Add(color.Bold).SprintFunc()
yellowItalic := color.New(color.FgHiYellow).Add(color.Italic).SprintFunc()
greenBoldItalic := color.New(color.FgHiGreen).Add(color.Bold).SprintFunc()
redBold := color.New(color.FgHiRed).Add(color.Bold).SprintFunc()
err := huh.NewForm(
huh.NewGroup(
huh.NewNote().
Expand Down Expand Up @@ -94,14 +96,14 @@ https://github.com/gwenwindflower/tbd
Title("What *prefix* for your staging files?").
Value(&dfr.Prefix).
Placeholder("stg").
Validate(not_empty),
Validate(notEmpty),
),

huh.NewGroup(huh.NewInput().
Title("What is the *name* of your dbt project?").
Value(&dfr.ProjectName).
Placeholder("rivendell").
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return !dfr.ScaffoldProject
}),
Expand All @@ -123,17 +125,17 @@ https://github.com/gwenwindflower/tbd
Title("Which *output* in that profile do you want to use?").
Value(&dfr.DbtProfileOutput).
Placeholder("dev").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What *schema* do you want to generate?").
Value(&dfr.Schema).
Placeholder("raw").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What *database* is that schema in?").
Value(&dfr.Database).
Placeholder("jaffle_shop").
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return !dfr.UseDbtProfile
}),
Expand All @@ -157,22 +159,22 @@ https://github.com/gwenwindflower/tbd
Title("What is your username?").
Value(&dfr.Username).
Placeholder("[email protected]").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is your Snowflake account id?").
Value(&dfr.Account).
Placeholder("elfstone-consulting.us-west-1").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is the *schema* you want to generate?").
Value(&dfr.Schema).
Placeholder("minas-tirith").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What *database* is that schema in?").
Value(&dfr.Database).
Placeholder("gondor").
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return dfr.Warehouse != "snowflake"
}),
Expand All @@ -182,12 +184,12 @@ https://github.com/gwenwindflower/tbd
Title("What GCP *project id* do you want to generate?").
Value(&dfr.Project).
Placeholder("legolas_inc").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is the *dataset* you want to generate?").
Value(&dfr.Dataset).
Placeholder("mirkwood").
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return dfr.Warehouse != "bigquery"
}),
Expand All @@ -198,17 +200,17 @@ https://github.com/gwenwindflower/tbd
Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
Value(&dfr.Path).
Placeholder("/path/to/duckdb.db").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is the *database* you want to generate?").
Value(&dfr.Database).
Placeholder("gimli_corp").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is the *schema* you want to generate?").
Value(&dfr.Schema).
Placeholder("moria").
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return dfr.Warehouse != "duckdb"
}),
Expand All @@ -217,7 +219,7 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
huh.NewInput().
Title("What is your Postgres *host*?").
Value(&dfr.Host).
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is your Postgres *port*?").
Value(&dfr.Port).
Expand All @@ -232,22 +234,22 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
Title("What is your Postgres *username*?").
Value(&dfr.Username).
Placeholder("galadriel").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is your Postgres *password*?").
Value(&dfr.Password).
Validate(not_empty).
Validate(notEmpty).
EchoMode(huh.EchoModePassword),
huh.NewInput().
Title("What is the *database* you want to generate?").
Value(&dfr.Database).
Placeholder("lothlorien").
Validate(not_empty),
Validate(notEmpty),
huh.NewInput().
Title("What is the *schema* you want to generate?").
Value(&dfr.Schema).
Placeholder("mallorn_trees").
Validate(not_empty),
Validate(notEmpty),
huh.NewSelect[string]().
Title("What ssl mode do you want to use?").
Value(&dfr.SslMode).
Expand All @@ -258,32 +260,36 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
huh.NewOption("Verify-full", "verify-full"),
huh.NewOption("Prefer", "prefer"),
huh.NewOption("Allow", "allow")).
Validate(not_empty),
Validate(notEmpty),
).WithHideFunc(func() bool {
return dfr.Warehouse != "postgres"
}),

huh.NewGroup(
huh.NewNote().
Title(fmt.Sprintf("🤖 %s LLM generation 🦙✨", redBold("Experimental"))).
Description(fmt.Sprintf(`%s features via Groq.
Currently generates:
Title(fmt.Sprintf("🤖 %s LLM generation 🦙✨", yellowItalic("Optional"))).
Description(fmt.Sprintf(`Infers:
✴︎ column %s
✴︎ relevant %s
_Requires a_ %s _stored in an env var_:
Get one at https://groq.com.`, yellowItalic("Optional"), pinkUnderline("descriptions"), pinkUnderline("tests"), greenBoldItalic("Groq API key"))),
_Requires an_ %s _stored in an env var_.`, pinkUnderline("descriptions"), pinkUnderline("tests"), greenBoldItalic("API key"))),
huh.NewConfirm().
Title("Do you want to infer descriptions and tests?").
Value(&dfr.GenerateDescriptions),
),

huh.NewGroup(
huh.NewSelect[string]().
Options(
huh.NewOption("OpenAI", "openai"),
huh.NewOption("Groq", "groq"),
huh.NewOption("Anthropic", "anthropic"),
).Value(&dfr.Llm),
huh.NewInput().
Title("What env var holds your Groq key?").
Placeholder("GROQ_API_KEY").
Value(&dfr.GroqKeyEnvVar).
Validate(not_empty),
Title("What env var holds your LLM API key?").
Placeholder("OPENAI_API_KEY").
Value(&dfr.LlmKeyEnvVar).
Validate(notEmpty),
).WithHideFunc(func() bool {
return !dfr.GenerateDescriptions
}),
Expand All @@ -293,7 +299,7 @@ Get one at https://groq.com.`, yellowItalic("Optional"), pinkUnderline("descript
Title("What directory do you want to build into?\n Must be new or empty.").
Value(&dfr.BuildDir).
Placeholder("build").
Validate(not_empty),
Validate(notEmpty),
huh.NewConfirm().
Title("🚦Are you ready to do this thing?🚦").
Value(&dfr.Confirm),
Expand Down
Loading

0 comments on commit bb2c128

Please sign in to comment.