[Feature] Add multiple LLM API providers (#13)

This PR expands the LLM features to multiple selectable providers. - Groq running Llama 3 70B - Anthropic Claude 3 Opus - OpenAI GPT-4 Turbo It should be relatively simple to add more providers in the future. Officially removing the 'experimental' tag from the LLM features. GPT-4 Turbo is the clear best pick so made that the default setting.
gwenwindflower · Apr 21, 2024 · bb2c128 · bb2c128
2 parents 04fcf0f + dbc0cdd
commit bb2c128
Show file tree

Hide file tree

Showing 19 changed files with 773 additions and 318 deletions.
diff --git a/.gitignore b/.gitignore
@@ -24,9 +24,11 @@
 # Go workspace file
 go.work
 
+# build directory
+dist/
+
 # Project specific
 build
 test_build
 tbd
 
-dist/
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@ It's designed to be super fast and easy to use with a friendly TUI that fast for
 ### It's the **_easy button_** for dbt projects.
 
 #### Quickstart
+
 ```bash
 brew tap gwenwindflower/homebrew-tbd
 brew install tbd
@@ -37,6 +38,7 @@ If you're new to dbt, [check out the wiki](https://github.com/gwenwindflower/tbd
 - [x] DuckDB
 
 If you don't have a cloud warehouse, but want to spin up a dbt project with `tbd` I recommend either:
+
 - **BigQuery** — they have a generous free tier, authenticating with `gcloud` CLI is super easy, and `tbd` requires very few manual configurations. They also have a ton of great public datasets you can model.
 - **DuckDB** — you can work completely locally and skip the cloud altogether. You will need to find some data, but DuckDB can _very_ easily ingest CSVs, JSON, or Parquet, so if you have some raw data you want to work with, this is a great option as well.
 
@@ -59,6 +61,7 @@ go install github.com/gwenwindflower/tbd@latest
 That's it! It's a single binary and has no dependencies on `dbt` itself, for maximum speed it operates directly with your warehouse, so you don't even need to have `dbt` installed to use it. That said, it _can_ leverage the profiles in your `~/.dbt/profiles.yml` file if you have them set up, so you can use the same connection information to save yourself some typing.
 
 ## 🔐 Warehouse-specific setup
+
 `tbd` at present, for security, only supports SSO methods of authentication. Please check out the below guides for your target warehouse before using `tbd` to ensure a smooth experience.
 
 ### ❄️ Snowflake
@@ -103,13 +106,15 @@ your_build_dir/
 
 ### 🦙 LLM features
 
-`tbd` has some neat alpha features that are still in development. One of these is the ability to generate documentation and tests for your sources via LLM. It uses [Groq](https://groq.com) running `llama3-70b-8192` to do its inference. It's not perfect, but it's pretty good! It requires setting an environment variable with your Groq API key beforehand that you'll then pass the name of.
+`tbd` has some neat alpha features that infer documentation and tests for your columns. There are multiple supported LLMs via API: Groq running Llama 3 70B, Anthropic Claude 3 Opus, and OpenAI GPT-4 Turbo. They have very different rate limits (these are limitations in the API that `tbd` respects):
 
-The biggest thing to flag is that while Groq is in free beta, they have a very low rate limit on their API: 30 requests per minute. The actual inference on Groq is _super_ fast, but for now I've had to rate limit the API calls so it will take a few minutes or quite awhile depending on your schema size. Once Groq is out of beta, I'll remove the rate limit, but you'll of course have to pay for the API calls via your Groq account.
+- **Groq** 30 requests per minute
+- **Claude 3 Opus** 5 requests per minute
+- **GPT-4 Turbo** 500 request per minute
 
-I will _definitely_ be adding other LLM providers in the future, probably Anthropic Claude 3 Opus as the next one so you can choose between maximum quality (Claude) or maximum speed (Groq, when I can remove the rate limit).
+As you can see, if you have anything but a very smol schema, you should stick with OpenAI. When Groq ups their rate limit after they're out of beta, that will be the fastest option, but for now, OpenAI is the best bet. The good news is that GPT-4 Turbo is _really_ good at this task (honestly better than Claude Opus) and pretty dang fast! The results are great in my testing.
 
-I'm going to experiment very soon with using structured output conformed to dbt's JSON schema and passing entire tables, rather than iterating through columns, and see how it does with that. If it works that will be significantly faster as it can churn out entire files quickly and the rate limit will be less of a factor.
+I'm going to experiment very soon with using structured output conformed to dbt's JSON schema and passing entire tables, rather than iterating through columns, and see how it does with that. If it works that will be significantly faster as it can churn out entire files (and perhaps improve quality through having more context) and the rate limits will be less of a factor.
 
 ### 🌊 Example workflows
 
@@ -118,18 +123,20 @@ I'm going to experiment very soon with using structured output conformed to dbt'
 ## 😅 To Do
 
 - [ ] Get to 100% test coverage
-- [ ] Add Claude 3 Opus option
+- [x] Add Claude 3 Opus option
+- [x] Add OpenAI GPT-4 Turbo option
 - [x] Add support for Snowflake
 - [x] Add support for BigQuery
 - [ ] Add support for Redshift
 - [ ] Add support for Databricks
 - [x] Add support for Postgres
 - [x] Add support for DuckDB
+- [x] Add support for MotherDuck
 - [ ] Build on Linux
 - [ ] Build on Windows
 
 ## 🤗 Contributing
 
-I welcome Discussions, Issues, and PRs! This is pre-release software and without folks using it and opening Issues or Discussions I won't be able to find the rough edges and smooth them out. So please if you get stuck open an Issue and let's figure out how to fix it! 
+I welcome Discussions, Issues, and PRs! This is pre-release software and without folks using it and opening Issues or Discussions I won't be able to find the rough edges and smooth them out. So please if you get stuck open an Issue and let's figure out how to fix it!
 
 If you're a dbt user and aren't familiar with Go, but interested in learning a bit of it, I'm also happy to help guide you through opening a PR, just let me know 💗.
diff --git a/forms.go b/forms.go
@@ -4,14 +4,15 @@ import (
 	"errors"
 	"fmt"
 	"strconv"
+	"strings"
 
 	"github.com/charmbracelet/huh"
 	"github.com/fatih/color"
 )
 
 type FormResponse struct {
-	Path                 string
-	Username             string
+	Password             string
+	Host                 string
 	BuildDir             string
 	SslMode              string
 	Database             string
@@ -21,21 +22,23 @@ type FormResponse struct {
 	ProjectName          string
 	Warehouse            string
 	Account              string
-	GroqKeyEnvVar        string
-	Password             string
-	DbtProfileName       string
+	LlmKeyEnvVar         string
 	DbtProfileOutput     string
+	DbtProfileName       string
+	Path                 string
 	Port                 string
-	Host                 string
+	Username             string
 	Prefix               string
+	Llm                  string
 	GenerateDescriptions bool
 	ScaffoldProject      bool
 	CreateProfile        bool
 	UseDbtProfile        bool
 	Confirm              bool
 }
 
-var not_empty = func(s string) error {
+func notEmpty(s string) error {
+	s = strings.TrimSpace(s)
 	if len(s) == 0 {
 		return fmt.Errorf("cannot be empty, please enter a value")
 	}
@@ -55,17 +58,16 @@ func getProfileOptions(ps DbtProfiles) []huh.Option[string] {
 
 func Forms(ps DbtProfiles) (FormResponse, error) {
 	dfr := FormResponse{
-		BuildDir:      "build",
-		GroqKeyEnvVar: "GROQ_API_KEY",
-		Prefix:        "stg",
-		Host:          "localhost",
-		Port:          "5432",
+		BuildDir:     "build",
+		LlmKeyEnvVar: "OPENAI_API_KEY",
+		Prefix:       "stg",
+		Host:         "localhost",
+		Port:         "5432",
 	}
 	pinkUnderline := color.New(color.FgMagenta).Add(color.Bold, color.Underline).SprintFunc()
 	greenBold := color.New(color.FgGreen).Add(color.Bold).SprintFunc()
 	yellowItalic := color.New(color.FgHiYellow).Add(color.Italic).SprintFunc()
 	greenBoldItalic := color.New(color.FgHiGreen).Add(color.Bold).SprintFunc()
-	redBold := color.New(color.FgHiRed).Add(color.Bold).SprintFunc()
 	err := huh.NewForm(
 		huh.NewGroup(
 			huh.NewNote().
@@ -94,14 +96,14 @@ https://github.com/gwenwindflower/tbd
 				Title("What *prefix* for your staging files?").
 				Value(&dfr.Prefix).
 				Placeholder("stg").
-				Validate(not_empty),
+				Validate(notEmpty),
 		),
 
 		huh.NewGroup(huh.NewInput().
 			Title("What is the *name* of your dbt project?").
 			Value(&dfr.ProjectName).
 			Placeholder("rivendell").
-			Validate(not_empty),
+			Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return !dfr.ScaffoldProject
 		}),
@@ -123,17 +125,17 @@ https://github.com/gwenwindflower/tbd
 				Title("Which *output* in that profile do you want to use?").
 				Value(&dfr.DbtProfileOutput).
 				Placeholder("dev").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What *schema* do you want to generate?").
 				Value(&dfr.Schema).
 				Placeholder("raw").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What *database* is that schema in?").
 				Value(&dfr.Database).
 				Placeholder("jaffle_shop").
-				Validate(not_empty),
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return !dfr.UseDbtProfile
 		}),
@@ -157,22 +159,22 @@ https://github.com/gwenwindflower/tbd
 				Title("What is your username?").
 				Value(&dfr.Username).
 				Placeholder("[email protected]").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is your Snowflake account id?").
 				Value(&dfr.Account).
 				Placeholder("elfstone-consulting.us-west-1").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is the *schema* you want to generate?").
 				Value(&dfr.Schema).
 				Placeholder("minas-tirith").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What *database* is that schema in?").
 				Value(&dfr.Database).
 				Placeholder("gondor").
-				Validate(not_empty),
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return dfr.Warehouse != "snowflake"
 		}),
@@ -182,12 +184,12 @@ https://github.com/gwenwindflower/tbd
 				Title("What GCP *project id* do you want to generate?").
 				Value(&dfr.Project).
 				Placeholder("legolas_inc").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is the *dataset* you want to generate?").
 				Value(&dfr.Dataset).
 				Placeholder("mirkwood").
-				Validate(not_empty),
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return dfr.Warehouse != "bigquery"
 		}),
@@ -198,17 +200,17 @@ https://github.com/gwenwindflower/tbd
 Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
 				Value(&dfr.Path).
 				Placeholder("/path/to/duckdb.db").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is the *database* you want to generate?").
 				Value(&dfr.Database).
 				Placeholder("gimli_corp").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is the *schema* you want to generate?").
 				Value(&dfr.Schema).
 				Placeholder("moria").
-				Validate(not_empty),
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return dfr.Warehouse != "duckdb"
 		}),
@@ -217,7 +219,7 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
 			huh.NewInput().
 				Title("What is your Postgres *host*?").
 				Value(&dfr.Host).
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is your Postgres *port*?").
 				Value(&dfr.Port).
@@ -232,22 +234,22 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
 				Title("What is your Postgres *username*?").
 				Value(&dfr.Username).
 				Placeholder("galadriel").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is your Postgres *password*?").
 				Value(&dfr.Password).
-				Validate(not_empty).
+				Validate(notEmpty).
 				EchoMode(huh.EchoModePassword),
 			huh.NewInput().
 				Title("What is the *database* you want to generate?").
 				Value(&dfr.Database).
 				Placeholder("lothlorien").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewInput().
 				Title("What is the *schema* you want to generate?").
 				Value(&dfr.Schema).
 				Placeholder("mallorn_trees").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewSelect[string]().
 				Title("What ssl mode do you want to use?").
 				Value(&dfr.SslMode).
@@ -258,32 +260,36 @@ Relative to pwd e.g. if db is in this dir -> cool_ducks.db`).
 					huh.NewOption("Verify-full", "verify-full"),
 					huh.NewOption("Prefer", "prefer"),
 					huh.NewOption("Allow", "allow")).
-				Validate(not_empty),
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return dfr.Warehouse != "postgres"
 		}),
 
 		huh.NewGroup(
 			huh.NewNote().
-				Title(fmt.Sprintf("🤖 %s LLM generation 🦙✨", redBold("Experimental"))).
-				Description(fmt.Sprintf(`%s features via Groq.
-Currently generates: 
+				Title(fmt.Sprintf("🤖 %s LLM generation 🦙✨", yellowItalic("Optional"))).
+				Description(fmt.Sprintf(`Infers:
 ✴︎ column %s
 ✴︎ relevant %s
 
-_Requires a_ %s _stored in an env var_:
-Get one at https://groq.com.`, yellowItalic("Optional"), pinkUnderline("descriptions"), pinkUnderline("tests"), greenBoldItalic("Groq API key"))),
+_Requires an_ %s _stored in an env var_.`, pinkUnderline("descriptions"), pinkUnderline("tests"), greenBoldItalic("API key"))),
 			huh.NewConfirm().
 				Title("Do you want to infer descriptions and tests?").
 				Value(&dfr.GenerateDescriptions),
 		),
 
 		huh.NewGroup(
+			huh.NewSelect[string]().
+				Options(
+					huh.NewOption("OpenAI", "openai"),
+					huh.NewOption("Groq", "groq"),
+					huh.NewOption("Anthropic", "anthropic"),
+				).Value(&dfr.Llm),
 			huh.NewInput().
-				Title("What env var holds your Groq key?").
-				Placeholder("GROQ_API_KEY").
-				Value(&dfr.GroqKeyEnvVar).
-				Validate(not_empty),
+				Title("What env var holds your LLM API key?").
+				Placeholder("OPENAI_API_KEY").
+				Value(&dfr.LlmKeyEnvVar).
+				Validate(notEmpty),
 		).WithHideFunc(func() bool {
 			return !dfr.GenerateDescriptions
 		}),
@@ -293,7 +299,7 @@ Get one at https://groq.com.`, yellowItalic("Optional"), pinkUnderline("descript
 				Title("What directory do you want to build into?\n Must be new or empty.").
 				Value(&dfr.BuildDir).
 				Placeholder("build").
-				Validate(not_empty),
+				Validate(notEmpty),
 			huh.NewConfirm().
 				Title("🚦Are you ready to do this thing?🚦").
 				Value(&dfr.Confirm),