Skip to content

Commit

Permalink
docs: update context length limits to 16384 for chat finetunes
Browse files Browse the repository at this point in the history
  • Loading branch information
hemant-co committed Oct 3, 2024
1 parent fb85b13 commit 73bf834
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,7 @@ To pass the validation tests Cohere performs on uploaded data, ensure that:

- You have the proper roles. There are only three acceptable values for the `role` field: `System`, `Chatbot` or `User`. There should be at least one instance of `Chatbot` and `User` in each conversation. If your dataset includes other roles, an error will be thrown.
- A preamble should be uploaded as the first message in the conversation, with `role: System`. All other messages with `role: System` will be treated as speakers in the conversation.
- The "System" preamble message is not longer than 4096 tokens, which is half the maximum training sequence length.
- Each turn in the conversation should be within the training context length of 8192 tokens to avoid being dropped from the dataset. We explain a turn in the "Chat Customization Best Practices" section below.
- Each turn in the conversation should be within the training context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the "Chat Customization Best Practices" section below.
- Your data is encoded in UTF-8.

### Evaluation Datasets
Expand Down Expand Up @@ -126,7 +125,7 @@ A turn includes all messages up to the Chatbot speaker. The following conversati

A few things to bear in mind:

- The preamble is always kept within the context window. This means that the preamble and _all turns within the context window_ should be within 8192 tokens.
- The preamble is always kept within the context window. This means that the preamble and _all turns within the context window_ should be within 16384 tokens.
- To check how many tokens your data is, you can use the [co.tokenize() api](/reference/tokenize).
- If any turns are above the context length of 8192 tokens, we will drop them from the training data.
- If any turns are above the context length of 16384 tokens, we will drop them from the training data.
- If an evaluation file is not uploaded, we will make our best effort to automatically split your uploaded conversations into an 80/20 split. In other words, if you upload a training dataset containing only the minimum of two conversations, we'll randomly put one of them in the training set, and the other in the evaluation set.
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,7 @@ There a certain requirements for the data you use to fine-tune a model for Chat

- There are only three acceptable values for the `role` field: `System`, `Chatbot` or `User`. There should be at least one instance of `Chatbot` and `User` in each conversation. If your dataset includes other roles, a validation error will be thrown.
- A preamble should be uploaded as the first message in the conversation, with `role: System`. All other messages with `role: System` will be treated as speakers in the conversation.
- Preambles should have a context length no longer than 4096 tokens.
- What's more, each turn in the conversation should be within the context length of 4096 tokens to avoid being dropped from the dataset. We explain a turn in the ['Chat Customization Best Practices'](/docs/chat-preparing-the-data#:~:text=.await_validation()) section.
- What's more, each turn in the conversation should be within the context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the ['Chat Customization Best Practices'](/docs/chat-preparing-the-data#:~:text=.await_validation()) section.

If you need more information, see ['Preparing the Data'](/docs/chat-preparing-the-data).

Expand Down Expand Up @@ -180,7 +179,7 @@ Below is a table of errors or warnings you may receive and how to fix them.
| Error | 'extra speaker in example: \<extra_speaker_name> (line : X)' | This means that the uploaded training dataset has speakers which are not one of the allowed roles: `System`,`User` or `Chatbot` | Rename or remove the extra speaker and re-upload the dataset. |
| Error | 'missing Chatbot in example' \nOR \n'missing User in example' | This means the uploaded training dataset is missing either `Chatbot` or `User` speaker, both of which are required. | Upload your dataset with required speakers `Chatbot` and `User` |
| Warning | 'dataset has 0 valid eval rows. dataset will be auto-split' | This error is thrown when eval data was not uploaded, in which case the dataset will be auto-split with 80% going to training and 20% to evaluation. | None |
| Warning | 'train dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' \nOR \n'eval dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' | This means the train and/or eval dataset has turns which exceed the context length of 4096 tokens, and will be dropped for training. The message specifies the conversation index x (which starts at 0), as well as the number of turns over the context length in that conversation, y. | If you do not want any turns dropped, consider shortening turns. |
| Warning | 'train dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' \nOR \n'eval dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' | This means the train and/or eval dataset has turns which exceed the context length of 16384 tokens, and will be dropped for training. The message specifies the conversation index x (which starts at 0), as well as the number of turns over the context length in that conversation, y. | If you do not want any turns dropped, consider shortening turns. |



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,7 @@ To pass the validation tests Cohere performs on uploaded data, ensure that:

- You have the proper roles. There are only three acceptable values for the `role` field: `System`, `Chatbot` or `User`. There should be at least one instance of `Chatbot` and `User` in each conversation. If your dataset includes other roles, an error will be thrown.
- A preamble should be uploaded as the first message in the conversation, with `role: System`. All other messages with `role: System` will be treated as speakers in the conversation.
- The "System" preamble message is not longer than 4096 tokens, which is half the maximum training sequence length.
- Each turn in the conversation should be within the training context length of 8192 tokens to avoid being dropped from the dataset. We explain a turn in the "Chat Customization Best Practices" section below.
- Each turn in the conversation should be within the training context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the "Chat Customization Best Practices" section below.
- Your data is encoded in UTF-8.

### Evaluation Datasets
Expand Down Expand Up @@ -125,7 +124,7 @@ A turn includes all messages up to the Chatbot speaker. The following conversati

A few things to bear in mind:

- The preamble is always kept within the context window. This means that the preamble and _all turns within the context window_ should be within 8192 tokens.
- The preamble is always kept within the context window. This means that the preamble and _all turns within the context window_ should be within 16384 tokens.
- To check how many tokens your data is, you can use the [Tokenize API](/reference/tokenize).
- If any turns are above the context length of 8192 tokens, we will drop them from the training data.
- If any turns are above the context length of 16384 tokens, we will drop them from the training data.
- If an evaluation file is not uploaded, we will make our best effort to automatically split your uploaded conversations into an 80/20 split. In other words, if you upload a training dataset containing only the minimum of two conversations, we'll randomly put one of them in the training set, and the other in the evaluation set.
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,7 @@ There a certain requirements for the data you use to fine-tune a model for Chat

- There are only three acceptable values for the `role` field: `System`, `Chatbot` or `User`. There should be at least one instance of `Chatbot` and `User` in each conversation. If your dataset includes other roles, a validation error will be thrown.
- A preamble should be uploaded as the first message in the conversation, with `role: System`. All other messages with `role: System` will be treated as speakers in the conversation.
- Preambles should have a context length no longer than 4096 tokens.
- What's more, each turn in the conversation should be within the context length of 4096 tokens to avoid being dropped from the dataset. We explain a turn in the ['Chat Customization Best Practices'](/v2/docs/chat-preparing-the-data#chat-customization-best-practices) section.
- What's more, each turn in the conversation should be within the context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the ['Chat Customization Best Practices'](/v2/docs/chat-preparing-the-data#chat-customization-best-practices) section.

If you need more information, see ['Preparing the Data'](/v2/docs/chat-preparing-the-data).

Expand Down Expand Up @@ -182,7 +181,7 @@ Below is a table of errors or warnings you may receive and how to fix them.
| Error | 'extra speaker in example: \<extra_speaker_name> (line : X)' | This means that the uploaded training dataset has speakers which are not one of the allowed roles: `System`,`User` or `Chatbot` | Rename or remove the extra speaker and re-upload the dataset. |
| Error | 'missing Chatbot in example' \nOR \n'missing User in example' | This means the uploaded training dataset is missing either `Chatbot` or `User` speaker, both of which are required. | Upload your dataset with required speakers `Chatbot` and `User` |
| Warning | 'dataset has 0 valid eval rows. dataset will be auto-split' | This error is thrown when eval data was not uploaded, in which case the dataset will be auto-split with 80% going to training and 20% to evaluation. | None |
| Warning | 'train dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' \nOR \n'eval dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' | This means the train and/or eval dataset has turns which exceed the context length of 4096 tokens, and will be dropped for training. The message specifies the conversation index x (which starts at 0), as well as the number of turns over the context length in that conversation, y. | If you do not want any turns dropped, consider shortening turns. |
| Warning | 'train dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' \nOR \n'eval dataset has conversations with too many tokens. conversation number: number of turns with too many tokens is as follows, x:y' | This means the train and/or eval dataset has turns which exceed the context length of 16384 tokens, and will be dropped for training. The message specifies the conversation index x (which starts at 0), as well as the number of turns over the context length in that conversation, y. | If you do not want any turns dropped, consider shortening turns. |



Expand Down

0 comments on commit 73bf834

Please sign in to comment.