Merge pull request #10 from jondurbin/hydra

Multiple instructions, massively more configurable + experimental non-english language support
jondurbin · Jul 18, 2023 · 813f4dd · 813f4dd
2 parents 2d7e913 + 80c35e5
commit 813f4dd
Show file tree

Hide file tree

Showing 32 changed files with 1,968 additions and 870 deletions.
diff --git a/README.md b/README.md
@@ -11,87 +11,37 @@ This updated implementation supports either the /v1/completions endpoint or /v1/
 * support for custom topics list, custom topic generation prompt, or completely random topics
 * in-memory vector db (Chroma) for similarity comparison, which is much faster than calculating rouge score for each generated instruction
 * (seemingly) better prompts, which includes injection of random topics to relate the instructions to, which creates much more diverse synthetic instructions
-* multi-threaded producer/consumer implementation for significantly faster runtimes (generally 150+ unique prompts per minute, more initially since there are fewer duplicates, decreasing over time).
+* asyncio producers with configurable batch size
+* several "instructors", each targetting specific use-cases, such as Orca style reasoning/math, role playing, etc.
 * tries to ensure the context, if provided, is relevant to the topic and contains all the information that would be necessary to respond to the instruction, and nost just a link to article/etc.
 * generally speaking, this implementation tries to reduce some of the [noise](https://github.com/tloen/alpaca-lora/issues/65)
 
 
 ## Generating instructions
 
-See available options via:
-```
-airoboros generate-instructions --help
-```
+### NEW - 2023-07-18
+
+To better accomodate the plethora of options, the configuration has been moved to a YAML config file.
+
+Please create a copy of `example-config.yaml` and configure as desired.
+
+Once you have the desired configuration, run:
 
-Help as of 2023-05-12:
 ```
-usage: self_instruct.py [-h] [--model MODEL] [--organization-id ORGANIZATION_ID] [--openai-api-key OPENAI_API_KEY] [--instruction-count INSTRUCTION_COUNT]
-                        [--batch-size BATCH_SIZE] [--output-path OUTPUT_PATH] [--topics-path TOPICS_PATH] [--overwrite] [--append] [--uncensored]
-                        [--bot-name BOT_NAME] [--prompt PROMPT] [--contextual-prompt CONTEXTUAL_PROMPT] [--topic-generation-prompt TOPIC_GENERATION_PROMPT]
-                        [--topic-request-count TOPIC_REQUEST_COUNT] [--contextual-prompt-ratio CONTEXTUAL_PROMPT_RATIO] [--skip-instruction-re SKIP_INSTRUCTION_RE]
-                        [--temperature TEMPERATURE] [--prompt-generation-temperature PROMPT_GENERATION_TEMPERATURE] [--top-p TOP_P]
-                        [--frequency-penalty FREQUENCY_PENALTY] [--presence-penalty PRESENCE_PENALTY] [--max-usage-tokens MAX_USAGE_TOKENS]
-                        [--concurrency CONCURRENCY] [--min-docsearch-score MIN_DOCSEARCH_SCORE]
-
-options:
-  -h, --help            show this help message and exit
-  --model MODEL         OpenAI model/engine to use for prompt generation, which can be either part of the /v1/completions or /v1/chat/completions endpoints
-  --organization-id ORGANIZATION_ID
-                        organization ID to include in the request to OpenAI, defaults to organization ID tied to the API key
-  --openai-api-key OPENAI_API_KEY
-                        OpenAI API key to use, defaults to the OPENAI_API_KEY environment variable
-  --instruction-count INSTRUCTION_COUNT
-                        number of instructions to generate, not including the seed instructions
-  --batch-size BATCH_SIZE
-                        number of candidate instructions to (attempt to) generate per request
-  --output-path OUTPUT_PATH
-                        path to store all generated instructions in
-  --topics-path TOPICS_PATH
-                        path to a newline separated list of topics
-  --overwrite           overwrite output path if it exists
-  --append              append to output path if it exists
-  --uncensored          try to produce uncensored responses, via role-play prompt
-  --bot-name BOT_NAME   name of the bot, when using uncensored mode
-  --prompt PROMPT       prompt prefix to use for generating non-contextual instructions
-  --contextual-prompt CONTEXTUAL_PROMPT
-                        prompt to use for generating contextual prompts
-  --topic-generation-prompt TOPIC_GENERATION_PROMPT
-                        prompt to use in generating random topics
-  --topic-request-count TOPIC_REQUEST_COUNT
-                        number of requests to perform in random topic generation
-  --contextual-prompt-ratio CONTEXTUAL_PROMPT_RATIO
-                        ratio of prompts that should be contextual, e.g. summarization of an article
-  --skip-instruction-re SKIP_INSTRUCTION_RE
-                        regular expression used to filter low-quality/unusable instructions
-  --temperature TEMPERATURE
-                        temperature parameter to use in OpenAI requests to generate responses
-  --prompt-generation-temperature PROMPT_GENERATION_TEMPERATURE
-                        temperature parameter to use in OpenAI requests when generating synthetic instructions
-  --top-p TOP_P         top-p parameter to use in OpenAI requests
-  --frequency-penalty FREQUENCY_PENALTY
-                        frequency penalty to use in OpenAI requests
-  --presence-penalty PRESENCE_PENALTY
-                        presence penalty to use in OpenAI requests
-  --max-usage-tokens MAX_USAGE_TOKENS
-                        Maximum token usage, calculated as sum of total_tokens from responses
-  --concurrency CONCURRENCY
-                        Number of concurrent threads/requests to use
-  --min-docsearch-score MIN_DOCSEARCH_SCORE
+airoboros generate-instructions --config-path /path/to/config.yaml
 ```
 
-### Using custom topics:
-
-If you want to use a specific set of topics for prompt generation, add `--topics-path /path/to/topics.txt` to the command.  This file must be plain text, one topic per line.
+## Generating topics
 
-### Using a custom topic generator:
+### NEW - 2023-07-18
 
-If you want to use random topics, but want those topics to be somewhat related to a specific category or idea, try playing with `--topic-generation-prompt` (and probablyl `--topic-request-count`).  By default, the topic generation prompt is just random, but you can try things like:
+Again, this is now all YAML configuration based!  Please create a customized version of the YAML config file, then run:
 
 ```
-... --topic-generation-prompt "Give me a list of 100 significant historical battles." --topic-request-count 10
+airoboros generate-topics --config-path /path/to/config.yaml
 ```
 
-Since the returned topics may include duplicates, it is not guaranteed that your topic list will contain 100 * 10 topics.
+You can override the `topic_prompt` string in the configuration to use a different topic generation prompt.
 
 
 ## Support the work
@@ -101,9 +51,12 @@ https://bmc.link/jondurbin
 ## Models (research use only):
 
 ### gpt-4 versions
-* [airoboros-33b-gpt4](https://huggingface.co/jondurbin/airoboros-33b-gpt4)
-* [airoboros-13b-gpt4-1.1](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.1)
-* [airoboros-7b-gpt4-1.1](https://huggingface.co/jondurbin/airoboros-7b-gpt4-1.1)
+* [airoboros-65b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4)
+* [airoboros-33b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4)
+* [airoboros-mpt-30bb-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-mpt-30b-gpt4-1p4-five-epochs)
+* [airoboros-13b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.4)
+* [airoboros-7b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-7b-gpt4-1.4)
+* *older versions on HF as well*
 
 ### gpt-3.5-turbo versions
 * [airoboros-gpt-3.5-turbo-100k-7b](https://huggingface.co/jondurbin/airoboros-gpt-3.5-turbo-100k-7b)
@@ -115,7 +68,11 @@ https://bmc.link/jondurbin
 * [airoboros-gpt-3.5-turbo](https://huggingface.co/datasets/jondurbin/airoboros-uncensored)
 * [airoboros-gpt4](https://huggingface.co/datasets/jondurbin/airoboros-gpt4)
 * [airoboros-gpt4-1.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.1)
-* [airoboros-gpt4-1.2 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2)
+* [airoboros-gpt4-1.2](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2)
+* [airoboros-gpt4-1.3](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.3)
+* [airoboros-gpt4-1.4](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4)
+* [airoboros-gpt4-1.4.1 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1)
+
 
 ## Coming soon
 

diff --git a/airoboros/exceptions.py b/airoboros/exceptions.py
@@ -1,14 +1,26 @@
 class RateLimitError(RuntimeError):
     ...
+
+
 class TooManyRequestsError(RuntimeError):
     ...
+
+
 class BadResponseError(RuntimeError):
     ...
+
+
 class TokensExhaustedError(RuntimeError):
     ...
+
+
 class ContextLengthExceededError(RuntimeError):
     ...
+
+
 class ServerOverloadedError(RuntimeError):
     ...
+
+
 class ServerError(RuntimeError):
     ...
diff --git a/airoboros/instructors/__init__.py b/airoboros/instructors/__init__.py
diff --git a/airoboros/instructors/coding.py b/airoboros/instructors/coding.py
@@ -0,0 +1,122 @@
+import asyncio
+import os
+import re
+import random
+
+
+async def generate(instructor):
+    """Generator for coding training data."""
+    config = instructor.instructors.get("coding")
+    if not config:
+        return
+    target_count = config.get("count") or instructor.default_count
+
+    # Load the prompt template.
+    path = config.get("prompt_path", "coding.txt")
+    if not os.path.exists(path):
+        path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "prompts", path)
+    with open(path) as infile:
+        template = infile.read()
+
+    # Additional configuration related to coding tasks.
+    related = config.get("related_software", [])
+    coding_languages = config.get("coding_languages", [])
+    if not coding_languages:
+        raise ValueError("At least one coding language must be configured.")
+
+    # API params, overriding defaults with this instructor's config.
+    api_params = {**instructor.api_params, **config.get("api_params", {})}
+
+    # Min similarity score.
+    min_score = config.get("min_docsearch_score") or instructor.min_docsearch_score
+
+    # Generate the instruction/response pairs until we reach the target count.
+    batch_size = config.get("batch_size", 10)
+    count = instructor.instructor_counts.get("coding", 0)
+    language_index = 0
+    language = config.get("language") or instructor.language
+    while count < target_count:
+        # Inject languages to use for this batch.
+        current_languages = []
+        for _ in range(batch_size):
+            current_languages.append(coding_languages[language_index])
+            language_index += 1
+            if language_index >= len(coding_languages):
+                language_index = 0
+        languages_str = "\n".join(
+            [
+                f" * task {idx + 1} should ask the user to use {language}"
+                for idx, language in enumerate(current_languages)
+            ]
+        )
+        related_str = ""
+        if batch_size > 3:
+            related_str = f"One of the tasks should require interacting with {random.choice(related)}."
+
+        # Get a batch of instructions.
+        prompt = template.format(
+            batch_size=batch_size,
+            languages=languages_str,
+            related_software=related_str,
+            language=language,
+        )
+        response = await instructor.generate_response(prompt, **api_params)
+        if not response:
+            continue
+
+        # Parse instructions and generate responses.
+        futures = []
+        instructions = []
+        for instruction in re.findall(
+            r"(?:^|\n)TSK \d+\. (.*?)(?:$|(?=\nTSK \d+\. ))", response, re.DOTALL
+        ):
+            if not instruction.strip() or await instructor.is_too_similar(
+                instruction, min_score=min_score
+            ):
+                continue
+
+            # Optionally add plain formatting.
+            plain = False
+            if random.random() < 0.5:
+                plain = True
+
+            full_instruction = (
+                instruction
+                if not plain
+                else "\n".join(
+                    [
+                        instruction,
+                        "  ".join(
+                            [
+                                "Generate only the code, as a single, plain text output.",
+                                "Do not include an intro sentence indicating what the code will do.",
+                                "Do not include any instructions for usage, warnings about replacing certain values, etc.",
+                                "Do not surround the code with backticks/markdown formatting.",
+                            ]
+                        ),
+                    ]
+                )
+            )
+            instructions.append(
+                instruction if not plain else instruction + " PLAINFORMAT"
+            )
+            futures.append(instructor.generate_response(full_instruction, **api_params))
+        if not futures:
+            continue
+        responses = await asyncio.gather(*futures)
+        for idx in range(len(futures)):
+            response = responses[idx]
+            if not response:
+                continue
+            if "PLAINFORMAT" in instructions[idx] and "```" in response:
+                response = re.split(r"```[^\n]*(?:$|[\r\n])", response)[1].strip()
+                if not response:
+                    continue
+            yield {
+                "instruction": instructions[idx].strip(),
+                "response": response.strip(),
+                "category": "coding",
+            }
+            count += 1
+            if count >= target_count:
+                break