tokenizer.train with strings not files #299

geg00 · 2020-06-09T22:48:04Z

I'm trying to train a tokenizer with a bunch of text files in Apache Tika json format

I just want to train it with text not with file names.
file1 = json.load(open('json file.json, 'rb'))

The content is stored here
file1[0]['X-TIKA:content']

I want to send the content to train but
tokenizer.train only allows me to use a list of files not text.

I can extract the content and then save it again but that's not really the case.

Is there any way to feed strings to the tokenizer.train???

sayakpaul · 2020-06-13T04:29:58Z

I have a similar use case. In my case, I am trying to pass on an array of sentences actually which look like so:

I am using ByteLevelBPETokenizer and here's why I am trying to fit it to the data:

training_data = np.array(stories_df['sentence']).reshape(-1, 1)
tokenizer.train(training_data, vocab_size=20000)

Any directions would be really helpful.

n1t0 · 2020-06-15T12:36:32Z

This is not possible at the moment, but indeed we should clearly support this use case.

n1t0 · 2020-10-20T21:28:52Z

Duplicate of #198

klxu03 · 2024-05-01T23:47:33Z

Is it still not supported yet?

n1t0 marked this as a duplicate of #198 Oct 20, 2020

n1t0 closed this as completed Oct 20, 2020

Provide feedback