-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer.train with strings not files #299
Comments
I have a similar use case. In my case, I am trying to pass on an array of sentences actually which look like so: I am using training_data = np.array(stories_df['sentence']).reshape(-1, 1)
tokenizer.train(training_data, vocab_size=20000) Any directions would be really helpful. |
This is not possible at the moment, but indeed we should clearly support this use case. |
Duplicate of #198 |
Is it still not supported yet? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm trying to train a tokenizer with a bunch of text files in Apache Tika json format
I just want to train it with text not with file names.
file1 = json.load(open('json file.json, 'rb'))
The content is stored here
file1[0]['X-TIKA:content']
I want to send the content to train but
tokenizer.train only allows me to use a list of files not text.
I can extract the content and then save it again but that's not really the case.
Is there any way to feed strings to the tokenizer.train???
The text was updated successfully, but these errors were encountered: