Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributional Semantics #27

Open
k0105 opened this issue Jan 11, 2016 · 13 comments
Open

Distributional Semantics #27

k0105 opened this issue Jan 11, 2016 · 13 comments

Comments

@k0105
Copy link
Contributor

k0105 commented Jan 11, 2016

Background

So, I've suggested to integrate Distributional Semantics into the YodaQA pipeline by using JoBim Text, a framework developed by TU Darmstadt (in Germany) and IBM that is also used for domain adaptation in Watson. It provides a way to acquire domain knowledge in an unsupervised way. For instance, UMLS, a huge ontology in medicine and frankly one of the most extensive ontologies I've worked with, covers most concepts in medicine, but still misses quite a few relations. With distributional semantics this ontology can be "completed", the need for knowledge engineering is heavily reduced and it even scales better than conventional triple stores.

Work already done

So I've looked into JBT and can now generate models for my own corpora, so I can compute similar terms, contexts and even labelled sense clusters based on Hearst patterns found by applying UIMA Ruta. Furthermore, Dr. Riedl, one of the main developers, has kindly agreed to provide their Wikipedia Stanford model, which saved us a lot of computation time. Additionally, we could use a web service they offer, which also features a Wikipedia trigram model.

Example

Let's for instance look up the word "exceptionally": The framework recognizes that "exceptionally#RB" is similar to terms like "extremely#RB", "extraordinarily#RB", "incredibly#RB", "exceedingly#RB", "remarkably#RB" etc. It can provide accurate counts for these terms and it can provide context to e.g. distinguish "cold", the disease, from "cold", the sensation. And finally we can group these interpretations, e.g. the trigram output actually distinguishes the meaning of the term as in "[extremely, unusually, incredibly, extraordinarily, ..." from the meaning "[attractive, intelligent, elegant, ...", which is quite clever imho.

What now?

So, there are a couple of things JBT can be used for. The most prominent example is TyCor: Expand the concept to infer type constraints and match those to the LAT. That's why I already asked whether it makes sense to add the functionality to cz.brmlab.yodaqa.analysis.tycor.LATMatchTyCor.java first.

But even more important than one particular use case might be to make JBT generally available to the pipeline. When we discussed scaling Yoda here: #21 (comment) Petr mentioned that he strives to encapsulate computationally intensive tasks behind REST interfaces (essentially microservices, I guess). Watson uses distributional semantics all over its pipeline and some benefits might only become visible once the pipeline is extended to domain knowledge. Hence, I suggest to make JBT available as another data backend just like Freebase, DBPedia and enwiki, before using it in any particular stage of the pipeline. We can then try using it in various places and see where we obtain better results. I would also write a detailed README, so people get up to speed quickly.

I started this thread a) to track progress and b) to ask for comments. Does anyone have additional ideas where or how to use JBT in YodaQA? Do my ideas make sense or can you think of a better approach?

Best wishes,
Joe

@pasky
Copy link
Member

pasky commented Jan 11, 2016

Awesome, thanks a lot for starting this as a github issue now. We certainly have a lot to talk about here, let me try to sort that out a little:

JBT Provider

We need to implement JBT RESTful microservice that does the heavy lifting, keeps stuff loaded in memory etc. I guess this at least partially should already exist within JBT as they offer a web interface, if we can just reuse that, it's the best option. In cz.brmlab.yodaqa.provider, we'll probably need a subpackage .jbt or something with classes that provide the access to this to the rest of YodaQA, possibly with some caching later on.

Overally, should be reasonably trivial? It's fine by me to just implement whatever we need for initial usage within the pipeline, we don't have to cover all the features all at once (might even not be 100% desirable from dead code perspective).

JBT for LATs

Equally importantly, we want to scout for ways to use JBT in YodaQA. I completely agree with you that using JBT for smarter type coercion is the best first application! It should be easy to do, it's a pretty well-defined task and could have a nice impact.

I'll elaborate in a followup comment.

JBT - other usages

Other ideas for using JBT, sorted roughly by difficulty I guess:

  • Clue matching in properties - we now use the "propsel" word embedding based model to check how close a property is to the given LAT, it might be interesting to compare that to a JBT-based similarity model
  • Context-aware entity linking - if we say "who directed ender's game?", we want to pick "ender's game" the movie, not the book - I think JBT ought to help us with that? we could look at the abstract of the entity "Ender's Game is a 2013 American science fiction action film based on the novel", we already to that now for a naive word embedding based classifier
  • Query expansion - generate similar expressions to query esp. via solr ("What city in Australia has rain forests" we want to also look for "town" and "rainforest" or maybe "jungle" as alternatives)
  • Clue matching in sentences - for picking answer-yielding sentences from fulltext solr results, we now require literal occurences of the clues, without allowing even for synonyms; we also use in-sentence relationship between the answer and matched clue as a feature for scoring answers. We could use JBT similarities, but maybe also contexts, to spot non-literal occurences.

There are surely some much more sophisticated uses for JBT, this is just initial ideas without still having a lot of experience with it.

@pasky
Copy link
Member

pasky commented Jan 11, 2016

JBT for LATs

So, to recapitulate how LAT tycor works:

  • Bunch of LATs are generated from the question, and generalized question LATs (wordnet hypernymy based) are generated from these by the tycor.LATByWordnet annotator
  • Bunch of LATs are generated from the answer, and generalized answer LATs (wordnet hypernymy based) are generated from these by the tycor.LATByWordnet annotator
  • The generalization level of an LAT is tracked in the "specificity" attribute of the LAT
  • In tycor.LATMatchTyCor, the most specific (least general) match between question and answer LATs is sought

Now, the most obvious+easy way to add JBT to the mix is to create and employ an analog of tycor.LATByWordnet that would add the generalized LATs based on JBT. This should be quite straightforward, I guess.

However, of course there may be a lot fo generalized LATs generated by JBT, quite a lot more than from Wordnet. (It'd be interesting to see that, how many do we have for, say, nouns like "novelist" or "microplanet"? What if we take top 5?) In that case, we would want to instead create an analog to LATMatchTyCor that just looks at the LATs and internally crosschecks them without saving the full lists.

But on the other side I don't really see a big harm with even storing 100 LATs in the CandidateAnswerCAS, if it stirs some trouble, we can improve on that later. So maybe we could just prototype by implementing LATByJBT as an analog to LATByWordnet and adding it to the pipeline on the same places and we should immediately see some action?

@k0105
Copy link
Contributor Author

k0105 commented Jan 11, 2016

Thank you very much for your replies. As a result of our conversation: I will build a JBT REST interface that is also able to use TU Darmstadt's web service and document it in such a way that people can easily apply it to custom corpora. I will also have to finish a classifier and a new rule-based system for my project, but you can expect results in one to two weeks.

Hopefully right after that I will pick up your idea of LATByJBT (that I like a lot) to try an example. I will have to see how much time I have left - right now it seems like I can work 2 weeks on this in total. [Worst case: I have to take a break to finish my thesis and continue 6 weeks later. But the backend is definitely done before that so you could play with JBT in the meantime if you want.]

I'll report back as soon as the backend is done and let you know about my schedule for the remaining work.

@pasky
Copy link
Member

pasky commented Jan 11, 2016 via email

@k0105
Copy link
Contributor Author

k0105 commented Jan 13, 2016

Just for the record: I just sent you a prototype of the REST interface.

I will now focus on writing a paper (primarily - feel free to contact me any time), which should take approximately 4 weeks (with some related work) and after that I will be back on scaling and distributional semantics.

@vineetk1
Copy link

JBT looks promising, and I would start using it when it is generally available for YodaQA. What will the data backend consist of? Will it have the Standford Wikipedia model? Will it also have the Trigram model?

@k0105
Copy link
Contributor Author

k0105 commented Jan 18, 2016

It already supports both. We are just discussing a minor detail about the return values, but the backend should be available very soon.

Update: The functionality is done, but we agreed on providing JSON return values as well. Since I just killed the development system, I'll have to set up my databases again (good way to verify the instructions), add this and then you'll find a dedicated brmson repository for the JoBim Text backend, probably by the end of this week. I'll post another reply then, so you'll get notified when it's done.

Update 2: I've just sent Petr the updated version of the REST service [18.1.'16, 20:20].

@k0105
Copy link
Contributor Author

k0105 commented Jan 18, 2016

Note for later: #30 should have synergies.

@pasky
Copy link
Member

pasky commented Jan 18, 2016

I didn't review the code in detail or set up the endpoint yet (maybe I'll have to swap mysql/mariadb for sqlite through the process, maybe not), but in order to keep the momentum, already pushed this out as https://github.com/brmson/jobimservice ! Thanks a lot for contributing this.

@k0105
Copy link
Contributor Author

k0105 commented Jan 19, 2016

I don't think replacing MySQL with SQLite is feasible for several reasons:

  • This might complicate collaboration with JBT, since they primarily use MySQL and DCA (might not be too important, see PS).
  • MySQL/MariaDB has significant advantages for us, because it features network access and (multi-client!) concurrency models an embedded DB like SQLite cannot offer - the publicly available models have about 200GBs, even if you're only interested in Wikipedia that's about 7, embedding that kind of data is not really desirable imho.
  • There are a ton of minor features that I wouldn't regard as decisive for us, but that matter when considered in total: Partitioning (incl. sharding), replication, better memory usage for large datasets, availability of commercial support etc.
  • And the setup of MariaDB doesn't mean too much overhead: sudo apt-get install mariadb-server-10.0 on Debian or yum install mariadb-server mariadb followed by systemctl start mariadb and systemctl enable mariadb on CentOS and you're done.

PS:
Dr. Martin Riedl has pointed out that it is fairly easy to switch the API to databases that can be used via JDBC by simply adapting the SQL commands in the configuration. I can confirm this - I had to change some commands to switch from MySQL to MariaDB and that was - as expected - trivial. From a technical perspective it's easy to do, one just has to be sure that this is really the right solution. If anyone goes over these points and spots no problem for his/her scenario, switching to SQLite should be straightforward.

@pasky
Copy link
Member

pasky commented Jan 20, 2016

For me, the motivation is that I already have bunch of things in my existing MySQL instnace and prefer all the YodaQA-related databases running standalone and on SSD-backed store. But it's no big deal, and all your points make sense too! So I'll just import it into my MySQL/MariaDB instance, let's see how that goes.

@vineetk1
Copy link

vineetk1 commented Mar 9, 2016

@jbauer180266 Thanks for your help in installing JoBim.
I have written a wiki page on How to install JoBimText

@k0105
Copy link
Contributor Author

k0105 commented Jun 14, 2016

I should point out that my ensemble for distributional semantics now supports GloVe and word2vec besides JoBim Text. I won't have time to play with integrating it into Yoda for the next 2.5 months, but the functionality is there. Should make a nice paper. Hence, I will likely do it one day, but please feel free to steal it from me. If someone starts this before me, just please let me know so we don't duplicate efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants