Training a model from in-memory data #198

loicbarrault · 2020-03-18T17:23:19Z

Hi,
How could I change the code so that it is possible to train a model from in-memory data instead of using files?
Basically, changing
tokenizer.train(["wiki.test.raw"], vocab_size=20000)
by
tokenizer.train(data_array, vocab_size=20000)
considering that data_array is e.g. an array of sentences ["First sentence", "second sentence...].
Thanks for your work!
Best, Loic

The text was updated successfully, but these errors were encountered:

Luckick · 2022-07-22T18:44:01Z

Refer to https://huggingface.co/docs/tokenizers/training_from_memory for example.

ksopyla mentioned this issue Jul 2, 2020

Enabling in-memory inputs for training a new tokenizer #88

Closed

n1t0 added the enhancement New feature or request label Oct 20, 2020

This was referenced Oct 20, 2020

tokenizer.train with strings not files #299

Closed

Training a Tokenizer from large texts read from memory #465

Closed

Feature Request: Train using a text iterator #478

Closed

fraboniface mentioned this issue Nov 4, 2020

Feature Request: make tokenizers trainable on arrow files #501

Closed

Narsil self-assigned this Nov 5, 2020

Narsil pinned this issue Nov 5, 2020

This was referenced Nov 10, 2020

[WIP] train from memory. #512

Closed

Feature Request: Example of how best to use with datasets #479

Closed

n1t0 mentioned this issue Nov 13, 2020

Training improvements #528

Closed

6 tasks

n1t0 mentioned this issue Nov 25, 2020

Ability to train from memory #544

Merged

1 task

n1t0 closed this as completed in #544 Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a model from in-memory data #198

Training a model from in-memory data #198

loicbarrault commented Mar 18, 2020

Luckick commented Jul 22, 2022

Training a model from in-memory data #198

Training a model from in-memory data #198

Comments

loicbarrault commented Mar 18, 2020

Luckick commented Jul 22, 2022