-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TokenGazetteer #106
Comments
Please give more information: which tokenizer are you using for the document and which for the gazetteer list? The TokenGazetteer does not match strings but sequences of tokens, so entries will only match if the token sequence for "A.P.L." will be the same in the document and in the gazetteer list. |
I'm using Annie tokenizer through GateWorker like this:
and then I use TokenGazetteer like this:
This works for Apple in my list but does not find matches for Apple#, Apple+ or Apple/ |
As I said, the TokenGazetteer matches sequences of tokens: so the sequence of tokens in your document is whatever the ANNIE tokenizer produces, but the sequence of tokens for each entry in your gazetteer list is (by default) what splitting on white space produces - and this is different for words with special characters, punctuation etc. The TokenGazetteer has a tokenizer parameter to specify a tokenizer that will get used instead of splitting on whitespace, but in your case, that cannot be used as your tokenizer is not a Python tokenizer but running in Java GATE via the worker. So to make this work for you, we need a workaround where all entries in your gazetteer list first get tokenized by the ANNIE gazetteer as well and then stored an used as an already tokenized list. I will try to come up with a simple solution for this. |
Here is a possible way to do this: from gatenlp.processing.annotator import Annotator
class AnnieTokenizer(Annotator):
def __init__(self, gateworker, tokeniserPR):
self._gw = gateworker
self._tok = tokeniserPR
self._ctrl = gateworker.jvm.gate.Factory.createResource("gate.creole.SerialAnalyserController")
self._ctrl.add(tokeniserPR)
self._corpus = gateworker.newCorpus()
self._ctrl.setCorpus(self._corpus)
def __call__(self, doc):
gdoc = self._gw.pdoc2gdoc(doc)
self._corpus.add(gdoc)
self._ctrl.execute()
self._corpus.remove(gdoc)
tmpdoc = self._gw.gdoc2pdoc(gdoc)
# make sure we return the SAME document!
outset = doc.annset()
for ann in tmpdoc.annset().with_type("Token"):
outset.add_ann(ann)
return doc
gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
gdoc = gs.pdoc2gdoc(doc)
gcorp = gs.newCorpus()
gcorp.add(gdoc)
gpipe.setCorpus(gcorp)
gpipe.execute()
anniedoc = gs.gdoc2pdoc(gdoc)
# get the annie tokenizer from the pipeline and wrap it in something usable for the token gazetteer
annietok = AnnieTokenizer(gs, gpipe.getPRs()[1])
# create the token gazetteer using the ANNIE tokenizer
# IMPORTANT: this must be done before the gateworker gets closed as the gateworker is needed for creating the
# gazetteer instance
tgaz = TokenGazetteer(path + ".def", fmt="gate-def", tokenizer=annietok, annset="", all=False, skip=True, outset=outset, outtype=detail)
gs.close()
# gateworker should not be needed for just running the gazetteer
gazdoc = tgaz(doc) |
Closing this as it is not a bug. The TokenGazetteer will get refactored and hopefully made easier to use in future versions, see #109 |
That should have been |
Yes, but it still doesn't work. The outset is empty.
|
I tested the code here and it worked (the error above was not detected because gs was the same as gateworker when I ran it) with the version of gatenlp I am using. |
I ran "python -m pip install -U gatenlp[all]" so I guess my version is the latest one. |
The version you are using can be easily determined using For testing, please make sure that the As I said, the code I shared works for me, so if it does not work in your case, you need to find out where the problem may lie.
This should show a number of Tokens, e.g. for "A", ".", "B" ... and "Apple", "#" |
My version is '1.0.5+snapshot'. I tried to download new version from github code but nothing worked with that version. |
'1.0.5+snapshot' is a github development version. Since that version gets updated regularly, it can be useful to I am using the very latest github version here it might be a good idea for you to try that one and report any problems as that version is going to be part of what gets released next anyway. |
The problem is that any Tokenizer considers the characters like . , + - / \ @ # % $ ! as one separate Token, which is reasonable. |
If you have such rather specific requirements, you will probably have to implement your own approach to tokenizing the gazetteer list and/or document or not use the token gazetteer at all. From what you have shared about your requirements so far, maybe you can implement your own algorithm to convert the original list of gazetteer strings to lists of tokens for the gazetteer, where it would be perfectly valid to generate more than one gazetteer list entry per original string. This is something that often makes sense in other contexts. |
TokenGazetteer do not annotate strings from the .lst file which include characters like #%.,/-+*@!
For example if my .lst file is like:
Apple
A.P.L
App#
the TokenGazetteer only annotates Apple in the document and ignores A.P.L and App#
The text was updated successfully, but these errors were encountered: