Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.train with strings not files #299

Closed
geg00 opened this issue Jun 9, 2020 · 4 comments
Closed

tokenizer.train with strings not files #299

geg00 opened this issue Jun 9, 2020 · 4 comments

Comments

@geg00
Copy link

geg00 commented Jun 9, 2020

I'm trying to train a tokenizer with a bunch of text files in Apache Tika json format

I just want to train it with text not with file names.
file1 = json.load(open('json file.json, 'rb'))

The content is stored here
file1[0]['X-TIKA:content']

I want to send the content to train but
tokenizer.train only allows me to use a list of files not text.

I can extract the content and then save it again but that's not really the case.

Is there any way to feed strings to the tokenizer.train???

@sayakpaul
Copy link
Member

I have a similar use case. In my case, I am trying to pass on an array of sentences actually which look like so:

image

I am using ByteLevelBPETokenizer and here's why I am trying to fit it to the data:

training_data = np.array(stories_df['sentence']).reshape(-1, 1)
tokenizer.train(training_data, vocab_size=20000)

Any directions would be really helpful.

@n1t0
Copy link
Member

n1t0 commented Jun 15, 2020

This is not possible at the moment, but indeed we should clearly support this use case.

@n1t0
Copy link
Member

n1t0 commented Oct 20, 2020

Duplicate of #198

@n1t0 n1t0 marked this as a duplicate of #198 Oct 20, 2020
@n1t0 n1t0 closed this as completed Oct 20, 2020
@klxu03
Copy link

klxu03 commented May 1, 2024

Is it still not supported yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants