-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use tok as fallback tokenizer #11
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to avoid changing the default tokenization behavior. I can see adding tok as an optional alternative in the Encoder constructor, though.
print(next(encoder.inverse_transform(encoder.transform([example])))) | ||
# vizzini : he didn ' t fall ? inconceivable ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So using tok would change the default tokenization?
For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text. |
@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)! i.e. I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the |
@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose |
The tokenizer makes for a better default (faster and saner).
Targets #10