Skip to content

Commit

Permalink
Merge pull request #10 from jondurbin/hydra
Browse files Browse the repository at this point in the history
Multiple instructions, massively more configurable + experimental non-english language support
  • Loading branch information
jondurbin authored Jul 18, 2023
2 parents 2d7e913 + 80c35e5 commit 813f4dd
Show file tree
Hide file tree
Showing 32 changed files with 1,968 additions and 870 deletions.
95 changes: 26 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,87 +11,37 @@ This updated implementation supports either the /v1/completions endpoint or /v1/
* support for custom topics list, custom topic generation prompt, or completely random topics
* in-memory vector db (Chroma) for similarity comparison, which is much faster than calculating rouge score for each generated instruction
* (seemingly) better prompts, which includes injection of random topics to relate the instructions to, which creates much more diverse synthetic instructions
* multi-threaded producer/consumer implementation for significantly faster runtimes (generally 150+ unique prompts per minute, more initially since there are fewer duplicates, decreasing over time).
* asyncio producers with configurable batch size
* several "instructors", each targetting specific use-cases, such as Orca style reasoning/math, role playing, etc.
* tries to ensure the context, if provided, is relevant to the topic and contains all the information that would be necessary to respond to the instruction, and nost just a link to article/etc.
* generally speaking, this implementation tries to reduce some of the [noise](https://github.com/tloen/alpaca-lora/issues/65)


## Generating instructions

See available options via:
```
airoboros generate-instructions --help
```
### NEW - 2023-07-18

To better accomodate the plethora of options, the configuration has been moved to a YAML config file.

Please create a copy of `example-config.yaml` and configure as desired.

Once you have the desired configuration, run:

Help as of 2023-05-12:
```
usage: self_instruct.py [-h] [--model MODEL] [--organization-id ORGANIZATION_ID] [--openai-api-key OPENAI_API_KEY] [--instruction-count INSTRUCTION_COUNT]
[--batch-size BATCH_SIZE] [--output-path OUTPUT_PATH] [--topics-path TOPICS_PATH] [--overwrite] [--append] [--uncensored]
[--bot-name BOT_NAME] [--prompt PROMPT] [--contextual-prompt CONTEXTUAL_PROMPT] [--topic-generation-prompt TOPIC_GENERATION_PROMPT]
[--topic-request-count TOPIC_REQUEST_COUNT] [--contextual-prompt-ratio CONTEXTUAL_PROMPT_RATIO] [--skip-instruction-re SKIP_INSTRUCTION_RE]
[--temperature TEMPERATURE] [--prompt-generation-temperature PROMPT_GENERATION_TEMPERATURE] [--top-p TOP_P]
[--frequency-penalty FREQUENCY_PENALTY] [--presence-penalty PRESENCE_PENALTY] [--max-usage-tokens MAX_USAGE_TOKENS]
[--concurrency CONCURRENCY] [--min-docsearch-score MIN_DOCSEARCH_SCORE]
options:
-h, --help show this help message and exit
--model MODEL OpenAI model/engine to use for prompt generation, which can be either part of the /v1/completions or /v1/chat/completions endpoints
--organization-id ORGANIZATION_ID
organization ID to include in the request to OpenAI, defaults to organization ID tied to the API key
--openai-api-key OPENAI_API_KEY
OpenAI API key to use, defaults to the OPENAI_API_KEY environment variable
--instruction-count INSTRUCTION_COUNT
number of instructions to generate, not including the seed instructions
--batch-size BATCH_SIZE
number of candidate instructions to (attempt to) generate per request
--output-path OUTPUT_PATH
path to store all generated instructions in
--topics-path TOPICS_PATH
path to a newline separated list of topics
--overwrite overwrite output path if it exists
--append append to output path if it exists
--uncensored try to produce uncensored responses, via role-play prompt
--bot-name BOT_NAME name of the bot, when using uncensored mode
--prompt PROMPT prompt prefix to use for generating non-contextual instructions
--contextual-prompt CONTEXTUAL_PROMPT
prompt to use for generating contextual prompts
--topic-generation-prompt TOPIC_GENERATION_PROMPT
prompt to use in generating random topics
--topic-request-count TOPIC_REQUEST_COUNT
number of requests to perform in random topic generation
--contextual-prompt-ratio CONTEXTUAL_PROMPT_RATIO
ratio of prompts that should be contextual, e.g. summarization of an article
--skip-instruction-re SKIP_INSTRUCTION_RE
regular expression used to filter low-quality/unusable instructions
--temperature TEMPERATURE
temperature parameter to use in OpenAI requests to generate responses
--prompt-generation-temperature PROMPT_GENERATION_TEMPERATURE
temperature parameter to use in OpenAI requests when generating synthetic instructions
--top-p TOP_P top-p parameter to use in OpenAI requests
--frequency-penalty FREQUENCY_PENALTY
frequency penalty to use in OpenAI requests
--presence-penalty PRESENCE_PENALTY
presence penalty to use in OpenAI requests
--max-usage-tokens MAX_USAGE_TOKENS
Maximum token usage, calculated as sum of total_tokens from responses
--concurrency CONCURRENCY
Number of concurrent threads/requests to use
--min-docsearch-score MIN_DOCSEARCH_SCORE
airoboros generate-instructions --config-path /path/to/config.yaml
```

### Using custom topics:

If you want to use a specific set of topics for prompt generation, add `--topics-path /path/to/topics.txt` to the command. This file must be plain text, one topic per line.
## Generating topics

### Using a custom topic generator:
### NEW - 2023-07-18

If you want to use random topics, but want those topics to be somewhat related to a specific category or idea, try playing with `--topic-generation-prompt` (and probablyl `--topic-request-count`). By default, the topic generation prompt is just random, but you can try things like:
Again, this is now all YAML configuration based! Please create a customized version of the YAML config file, then run:

```
... --topic-generation-prompt "Give me a list of 100 significant historical battles." --topic-request-count 10
airoboros generate-topics --config-path /path/to/config.yaml
```

Since the returned topics may include duplicates, it is not guaranteed that your topic list will contain 100 * 10 topics.
You can override the `topic_prompt` string in the configuration to use a different topic generation prompt.


## Support the work
Expand All @@ -101,9 +51,12 @@ https://bmc.link/jondurbin
## Models (research use only):

### gpt-4 versions
* [airoboros-33b-gpt4](https://huggingface.co/jondurbin/airoboros-33b-gpt4)
* [airoboros-13b-gpt4-1.1](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.1)
* [airoboros-7b-gpt4-1.1](https://huggingface.co/jondurbin/airoboros-7b-gpt4-1.1)
* [airoboros-65b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4)
* [airoboros-33b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4)
* [airoboros-mpt-30bb-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-mpt-30b-gpt4-1p4-five-epochs)
* [airoboros-13b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.4)
* [airoboros-7b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-7b-gpt4-1.4)
* *older versions on HF as well*

### gpt-3.5-turbo versions
* [airoboros-gpt-3.5-turbo-100k-7b](https://huggingface.co/jondurbin/airoboros-gpt-3.5-turbo-100k-7b)
Expand All @@ -115,7 +68,11 @@ https://bmc.link/jondurbin
* [airoboros-gpt-3.5-turbo](https://huggingface.co/datasets/jondurbin/airoboros-uncensored)
* [airoboros-gpt4](https://huggingface.co/datasets/jondurbin/airoboros-gpt4)
* [airoboros-gpt4-1.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.1)
* [airoboros-gpt4-1.2 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2)
* [airoboros-gpt4-1.2](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2)
* [airoboros-gpt4-1.3](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.3)
* [airoboros-gpt4-1.4](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4)
* [airoboros-gpt4-1.4.1 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1)


## Coming soon

Expand Down
12 changes: 12 additions & 0 deletions airoboros/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
class RateLimitError(RuntimeError):
...


class TooManyRequestsError(RuntimeError):
...


class BadResponseError(RuntimeError):
...


class TokensExhaustedError(RuntimeError):
...


class ContextLengthExceededError(RuntimeError):
...


class ServerOverloadedError(RuntimeError):
...


class ServerError(RuntimeError):
...
Empty file.
122 changes: 122 additions & 0 deletions airoboros/instructors/coding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
import asyncio
import os
import re
import random


async def generate(instructor):
"""Generator for coding training data."""
config = instructor.instructors.get("coding")
if not config:
return
target_count = config.get("count") or instructor.default_count

# Load the prompt template.
path = config.get("prompt_path", "coding.txt")
if not os.path.exists(path):
path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "prompts", path)
with open(path) as infile:
template = infile.read()

# Additional configuration related to coding tasks.
related = config.get("related_software", [])
coding_languages = config.get("coding_languages", [])
if not coding_languages:
raise ValueError("At least one coding language must be configured.")

# API params, overriding defaults with this instructor's config.
api_params = {**instructor.api_params, **config.get("api_params", {})}

# Min similarity score.
min_score = config.get("min_docsearch_score") or instructor.min_docsearch_score

# Generate the instruction/response pairs until we reach the target count.
batch_size = config.get("batch_size", 10)
count = instructor.instructor_counts.get("coding", 0)
language_index = 0
language = config.get("language") or instructor.language
while count < target_count:
# Inject languages to use for this batch.
current_languages = []
for _ in range(batch_size):
current_languages.append(coding_languages[language_index])
language_index += 1
if language_index >= len(coding_languages):
language_index = 0
languages_str = "\n".join(
[
f" * task {idx + 1} should ask the user to use {language}"
for idx, language in enumerate(current_languages)
]
)
related_str = ""
if batch_size > 3:
related_str = f"One of the tasks should require interacting with {random.choice(related)}."

# Get a batch of instructions.
prompt = template.format(
batch_size=batch_size,
languages=languages_str,
related_software=related_str,
language=language,
)
response = await instructor.generate_response(prompt, **api_params)
if not response:
continue

# Parse instructions and generate responses.
futures = []
instructions = []
for instruction in re.findall(
r"(?:^|\n)TSK \d+\. (.*?)(?:$|(?=\nTSK \d+\. ))", response, re.DOTALL
):
if not instruction.strip() or await instructor.is_too_similar(
instruction, min_score=min_score
):
continue

# Optionally add plain formatting.
plain = False
if random.random() < 0.5:
plain = True

full_instruction = (
instruction
if not plain
else "\n".join(
[
instruction,
" ".join(
[
"Generate only the code, as a single, plain text output.",
"Do not include an intro sentence indicating what the code will do.",
"Do not include any instructions for usage, warnings about replacing certain values, etc.",
"Do not surround the code with backticks/markdown formatting.",
]
),
]
)
)
instructions.append(
instruction if not plain else instruction + " PLAINFORMAT"
)
futures.append(instructor.generate_response(full_instruction, **api_params))
if not futures:
continue
responses = await asyncio.gather(*futures)
for idx in range(len(futures)):
response = responses[idx]
if not response:
continue
if "PLAINFORMAT" in instructions[idx] and "```" in response:
response = re.split(r"```[^\n]*(?:$|[\r\n])", response)[1].strip()
if not response:
continue
yield {
"instruction": instructions[idx].strip(),
"response": response.strip(),
"category": "coding",
}
count += 1
if count >= target_count:
break
Loading

0 comments on commit 813f4dd

Please sign in to comment.