Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed table with nowrap #22

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 73 additions & 9 deletions fern/pages/get-started/datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -168,20 +168,84 @@ Datasets of type `chat-finetune-input`, for example, are expected to have a json
"role": "Chatbot",
"content": "Time magazines top 10 cover stories in the last 10 years were:\\n\\n1. Volodymyr Zelenskyy\\n2. Elon Musk\\n3. Martin Luther King Jr.\\n4. How Earth Survived\\n5. Her Lasting Impact\\n6. Nothing to See Here\\n7. Meltdown\\n8. Deal With It\\n9. The Top of America\\n10. Bitter Pill"
}
]
}
```

The following table describes the types of datasets supported by the Dataset API:

| Dataset Type | Description | Schema | Rules | Task Type | Status | File Types Supported | Are Metadata Fields Supported? | Sample File |
|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|---------------------------|--------------------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `single-label-classification-finetune-input` | A file containing text and a single label (class) for each text | `text:string \nlabel:string` | You must include 40 valid train examples, \nwith five examples per label. A label cannot be present in all examples \nThere must be 24 valid evaluation examples. | Classification Fine-tuning | Supported | `csv` and `jsonl` | No | [Art classification file](https://drive.google.com/file/d/15-CchSiALUQwto4b-yAMWhdUqz8vfwQ1/view?usp=drive_link) |
| `multi-label-classification-finetune-input` | A file containing text and an array of label(s) (class) for each text | `text:string \nlabel:list[string]` | You must include 40 valid train examples, with five examples per label \nA label cannot be present in all examples. There must be 24 valid evaluation examples. | Classification Fine-tuning | Supported | `jsonl` | No | n/a |
| `reranker-finetune-input` | A file containing queries and an array of passages relevant to the query. There must also be "hard negatives", passages semantically similar but ultimately not relevant. | `query:string \nrelevant_passages:list[string] \nhard_negatives:list[string]` | There must be 256 train examples and at least 64 evaluation examples. There must be at least one relevant passage, with no overlap between relevant passage and hard negatives. | Rerank Fine-tuning | Supported | `jsonl` | No | [train_valid.json](https://drive.google.com/file/d/1CmXWfQRedVyWBDCsSkeF9g8gyqmpUA7C/view?usp=drive_link) |
| `chat-finetune-input` | A file containing conversations | `messages: list[Message]` \n \n`- Message - \n role: text \n context: text` | There must be two valid train examples and one valid evaluation example. | Chat Fine-tuning | In progress/not supported | `jsonl` | No | [train_celestial_fox.json](https://drive.google.com/file/d/19x6sOPXNWoZj9Jo989h09wd4IJ6Su9by/view?usp=drive_link) |
| `embed-input` | A file containing text to be embedded | `text:string` | None of the rows in the file can be empty. | Embed job | Supported | `csv` and `jsonl` | Yes | [embed_jobs_sample_data.jsonl](https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/embed_jobs_sample_data.jsonl) / [embed_jobs_sample_data.csv](https://github.com/cohere-ai/notebooks/blob/main/notebooks/data/embed_jobs_sample_data.csv) |


<table class="fern-table" style={{ 'white-space': 'nowrap', display: 'block', overflow: 'auto' }}>
<thead>
<tr>
<th>Dataset Type</th>
<th>Description</th>
<th>Schema</th>
<th>Rules</th>
<th>Task Type</th>
<th>Status</th>
<th>File Types Supported</th>
<th>Are Metadata Fields Supported?</th>
<th>Sample File</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>single-label-classification-finetune-input</code></td>
<td>A file containing text and a single label (class) for each text</td>
<td><code>text:string<br/>label:string</code></td>
<td>You must include 40 valid train examples, with five examples per label. <br/>A label cannot be present in all examples.<br/>There must be 24 valid evaluation examples.</td>
<td>Classification Fine-tuning</td>
<td>Supported</td>
<td><code>csv</code> and <code>jsonl</code></td>
<td>No</td>
<td><a href="https://drive.google.com/file/d/15-CchSiALUQwto4b-yAMWhdUqz8vfwQ1/view?usp=drive_link">Art classification file</a></td>
</tr>
<tr>
<td><code>multi-label-classification-finetune-input</code></td>
<td>A file containing text and an array of label(s) (class) for each text</td>
<td><code>text:string<br/>label:list[string]</code></td>
<td>You must include 40 valid train examples, with five examples per label.<br/>A label cannot be present in all examples.<br/>There must be 24 valid evaluation examples.</td>
<td>Classification Fine-tuning</td>
<td>Supported</td>
<td><code>jsonl</code></td>
<td>No</td>
<td>n/a</td>
</tr>
<tr>
<td><code>reranker-finetune-input</code></td>
<td>A file containing queries and an array of passages relevant to the query. There must also be "hard negatives", passages semantically similar but ultimately not relevant.</td>
<td><code>query:string<br/>relevant_passages:list[string]<br/>hard_negatives:list[string]</code></td>
<td>There must be 256 train examples and at least 64 evaluation examples.<br/>There must be at least one relevant passage, with no overlap between relevant passage and hard negatives.</td>
<td>Rerank Fine-tuning</td>
<td>Supported</td>
<td><code>jsonl</code></td>
<td>No</td>
<td><a href="https://drive.google.com/file/d/1CmXWfQRedVyWBDCsSkeF9g8gyqmpUA7C/view?usp=drive_link">train_valid.json</a></td>
</tr>
<tr>
<td><code>chat-finetune-input</code></td>
<td>A file containing conversations</td>
<td><code>messages: list[Message]<br/><br/>- Message -<br/>role: text<br/>context: text</code></td>
<td>There must be two valid train examples and one valid evaluation example.</td>
<td>Chat Fine-tuning</td>
<td>In progress/not supported</td>
<td><code>jsonl</code></td>
<td>No</td>
<td><a href="https://drive.google.com/file/d/19x6sOPXNWoZj9Jo989h09wd4IJ6Su9by/view?usp=drive_link">train_celestial_fox.json</a></td>
</tr>
<tr>
<td><code>embed-input</code></td>
<td>A file containing text to be embedded</td>
<td><code>text:string</code></td>
<td>None of the rows in the file can be empty.</td>
<td>Embed job</td>
<td>Supported</td>
<td><code>csv</code> and <code>jsonl</code></td>
<td>Yes</td>
<td><a href="https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/embed_jobs_sample_data.jsonl">embed_jobs_sample_data.jsonl</a> / <a href="https://github.com/cohere-ai/notebooks/blob/main/notebooks/data/embed_jobs_sample_data.csv">embed_jobs_sample_data.csv</a></td>
</tr>
</tbody>
</table>

### Downloading a dataset

Expand Down
Loading