Neural supertagging #54

olzama · 2021-09-17T11:07:24Z

olzama
Sep 17, 2021
Maintainer

It looks like I will undertake a project related to supertagging for HPSG in the near future (looks like I get to make this my main postdoc project! yay).

Do you have any repositories/code etc., which I could use for a baseline?

Thank you!

becdridan · 2021-09-17T23:03:52Z

becdridan
Sep 17, 2021
Collaborator

Hey Olga - Exciting news, I will be interested to see what you will do! In terms of code, anything I have would be more than 10 years old... Even if I can find it, I don't think you'd be able to make it run, since it would have depended on PET and the ERG as they were at the time. Most of what I did though did get merged into ERG and PET and as far as I know is still there? Bec

…

On Fri, Sep 17, 2021 at 9:07 PM Olga Zamaraeva ***@***.***> wrote: @becdridan <https://github.com/becdridan> @oepen <https://github.com/oepen> It looks like I will undertake a project related to supertagging for HPSG in the near future (looks like I get to make this my main postdoc project! yay). Do you have any repositories/code etc., which I could use for a baseline? Thank you! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/orgs/delph-in/teams/participants/discussions/8>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH2OYCD5WB7WNNCS33SWLNDUCMOPNANCNFSM5EGYUFOQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

arademaker · 2021-09-18T14:19:47Z

arademaker
Sep 18, 2021
Maintainer

It looks like I will undertake a project related to supertagging for HPSG in the near future (looks like I get to make this my main postdoc project! yay).

Nice @olzama! Maybe at some point we can discuss how to adapt Portuguese grammar to use supertagging. Since we are now in a stage were such adaptation should not be break too many things.

0 replies

olzama · 2021-09-20T08:26:57Z

olzama
Sep 20, 2021
Maintainer Author

Hey Olga - Exciting news, I will be interested to see what you will do! In
terms of code, anything I have would be more than 10 years old... Even if I
can find it, I don't think you'd be able to make it run, since it would
have depended on PET and the ERG as they were at the time. Most of what I
did though did get merged into ERG and PET and as far as I know is still
there?

Bec
…

You know @becdridan , to be honest, anything you might have would be helpful, even if it doesn't just run magically after 10 years of not being touched. Of course if you can't find it for example, then it's another matter!

By the way, I was able to open your thesis the other day but now I cannot. Do you know if the site is coming back?..

0 replies

olzama · 2021-09-20T08:29:08Z

olzama
Sep 20, 2021
Maintainer Author

Most of what I
did though did get merged into ERG and PET and as far as I know is still
there?

@danflick Could you comment on this? To use ERG with supertagging, do I need a special command or is it used by default now? (I am assuming not?)

0 replies

danflick · 2021-09-20T15:09:07Z

danflick
Sep 20, 2021
Collaborator

Olga, it's great to hear that you'll be able to work on supertagging-inspired improvements in parsing efficiency. Both PET and ACE support the ubertagging additions to the ERG, and the wiki page UtTop explains how to invoke it with PET. The ACE parser uses the same trained model, but ubertagging is disabled in the standard release. To enable it, uncomment the following four lines in erg/ace/config.tdl, and then compile the grammar as usual: ;übertag-emission-path := "../ut/nanc_wsj_redwoods_noaffix.ex.gz". ;übertag-transition-path := "../ut/nanc_wsj_redwoods_noaffix.tx.gz". ;übertag-generic-map-path := "../ut/generics.cfg". ;übertag-whitelist-path := "../ut/whitelist.cfg". The available model is sadly no longer so useful, since it was trained on an old version of the ERG (1214), and ubertagging is especially sensitive to grammar changes. In particular, the 2020 version of the grammar is likely to do badly with that model, since punctuation marks are now treated as separate tokens, but were not when that model was trained. As you'll know from Bec's thesis, training data is created by using the ERG to parse a relatively large corpus such as NANC (350 million words: https://catalog.ldc.upenn.edu/LDC95T21). This is somewhat resource-intensive, so has not been updated for either the 2018 or 2020 versions of the ERG. But if you want to start with a good baseline for comparison with any new methods you develop, it would be nice to update the training data and the model with the 2020 ERG.

…

________________________________ From: Olga Zamaraeva ***@***.***> Sent: Monday, September 20, 2021 1:29 AM To: delph-in/participants ***@***.***> Cc: Dan Flickinger ***@***.***>; Mention ***@***.***> Subject: Re: [delph-in/participants] Neural supertagging (#8) Most of what I did though did get merged into ERG and PET and as far as I know is still there? @danflick<https://github.com/danflick> Could you comment on this? To use ERG with supertagging, do I need a special command or is it used by default now? (I am assuming not?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/orgs/delph-in/teams/participants/discussions/8/comments/4>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG5PC4RBR2IWOD75JH3QEWLUC3WF5ANCNFSM5EGYUFOQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

olzama · 2021-09-20T15:12:42Z

olzama
Sep 20, 2021
Maintainer Author

Thank you, @danflick !

I have a question: aren't "ubertagging" and "supertagging" different?.. I thought ubertagging had to do with MWE, whild supertagging was the general approach to reduce search space?

0 replies

danflick · 2021-09-20T15:35:43Z

danflick
Sep 20, 2021
Collaborator

You're right, Olga, that supertagging and ubertagging are distinct, but both have as their aim the reduction of the search space in parsing. Here is a helpful discussion of how they differ: http://moin.delph-in.net/wiki/WeSearch/UberTagging (I don't know why this old moin wiki WeSearch sub-page did not get migrated to the Github DELPH-IN wiki.) Dan

…

________________________________ From: Olga Zamaraeva ***@***.***> Sent: Monday, September 20, 2021 8:12 AM To: delph-in/participants ***@***.***> Cc: Dan Flickinger ***@***.***>; Mention ***@***.***> Subject: Re: [delph-in/participants] Neural supertagging (#8) Thank you, @danflick<https://github.com/danflick> ! I have a question: aren't "ubertagging" and "supertagging" different?.. I thought ubertagging had to do with MWE, whild supertagging was the general approach to reduce search space? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/orgs/delph-in/teams/participants/discussions/8/comments/6>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG5PC4WZGZIMVLLUHGAVYF3UC5FPLANCNFSM5EGYUFOQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

emilymbender · 2021-09-20T15:39:33Z

emilymbender
Sep 20, 2021
Maintainer

It looks like that page was protected on the old wiki ... is there a good reason for that, or should it be migrated? (I can't see it; must have been UiO internal.)

0 replies

danflick · 2021-09-20T15:45:29Z

danflick
Sep 20, 2021
Collaborator

Okay, here are the brief but helpful contents of that page, authored by Bec back in 2011: Background Among project members, we have half-jokingly started to use the term ubertagging to refer to the type of lexical disambiguation we want to investigate. The two key characteristics that (arguably) distinguish uebertagging from more 'classic' PoS and supertagging are (a) formalizing the tagging problem over a lattice (rather than a plain sequence) of observations; and (b) the use of very fine-grained morpho-syntactic lexical categories, applied to complete so-called lexical items (in the sense of Dridan, 2009), i.e. partial derivations that include the application of lexical rules. Abstract Problem Definition Imagine a lattice composed of four vertices, numbered 1 through 4, and four edges (A, B, D, and E), arranged as follows: [1-2] A [1-3] B [2-4] D [3-4] E For our application to parsing with the ERG, the edges in the lattice will correspond to what we call lexical tokens (please see the ErgTokenization<http://moin.delph-in.net/wiki/ErgTokenization> page for details, in particular for the reasons why it is desirable to assume a lattice of 'observations' to be uebertagged). Even without thinking about category assignments to the edges in the lattice, the above example already gives us a segmentation task, i.e. picking the most probable full path through the lattice. Furthermore, assume an inventory of categories C1, C2, ..., Cn, which we want to attach as labels to the edges of the lattice. In application to parsing with the ERG, n will be on the order of several thousands, and there will typically be constraints on which categories can be attached to which edges. What we would like to derive is a probabilistic model over the above problem space, to jointly solve the segmentation and labelling problem. Possibly Relevant References * A tree-trellis based fast search for finding the N-best sentence hypotheses in continuous speech recognition<http://uniport-prod-sfx.uio.no/sfx_ubo?sid=google&auinit=FK&aulast=Soong&atitle=A+tree-trellis+based+fast+search+for+finding+the+N-best+sentence+hypotheses+in+continuous+speech+recognition&id=doi:10.1109/ICASSP.1991.150437>. Soong, F.K. and Huang, E.F. 1991 International Conference on Acoustics, Speech, and Signal Processing (1991) * A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm<http://acl.ldc.upenn.edu/C/C94/C94-1032.pdf>. Nagata, M. Proceedings of the 15th conference on Computational linguistics (1994)

…

________________________________ From: emilymbender ***@***.***> Sent: Monday, September 20, 2021 8:39 AM To: delph-in/participants ***@***.***> Cc: Dan Flickinger ***@***.***>; Mention ***@***.***> Subject: Re: [delph-in/participants] Neural supertagging (#8) It looks like that page was protected on the old wiki ... is there a good reason for that, or should it be migrated? (I can't see it; must have been UiO internal.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/orgs/delph-in/teams/participants/discussions/8/comments/8>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG5PC4VPZUDI3EYKUYGUO7DUC5IUDANCNFSM5EGYUFOQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

guyemerson · 2021-09-23T09:58:22Z

guyemerson
Sep 23, 2021
Maintainer

A recent pre-print on neural supertagging for CCG parsing: https://arxiv.org/pdf/2109.10044.pdf

0 replies

olzama · 2021-09-28T12:09:40Z

olzama
Sep 28, 2021
Maintainer Author

So, @becdridan 's thesis is about supertagging, right? And the old model integrated into ACE is ubertagging? This is what confuses me. Does it make sense for me to think about supertagging at all, or should I skip that and concern myself with ubertagging, if I want a meaningful baseline?

0 replies

guyemerson · 2021-09-28T15:18:24Z

guyemerson
Sep 28, 2021
Maintainer

@olzama, from Bec's comments which Dan shared above, ubertagging is supertagging over a lattice, with a particular kind of fine-grained tags. The second point is about what kind of supertags we want (but this is an important question for any kind of supertagging). The first point means: ubertagging = supertagging + segmentation. She also explains it that way in her 2013 paper.

So you can't really "skip" supertagging because it's an integral part of ubertagging.

0 replies

oepen · 2021-09-28T16:07:22Z

oepen
Sep 28, 2021
Maintainer

thanks to dan for scraping that problem specification from the old wiki (this page had started out project-internally, and regrettably never was made public later on).

i wrote down that specification back then in the light of multiple in-depth discussions involving many people inside DELPH-IN, including during a couple of mini-summits. re-reading it today, i believe it remains a precise characterization of the problem we actually need to solve.

that wiki page was the starting point for the solution bec built (originally in PET) and documented in her EMNLP paper. it would be veey interesting to see how neural architectures lend themselves to the underlying problem.

as @guyemerson observes, uber tagging is joint segmentation and sequence labeling. but, furthermore, the solution space is highly constrained by the lattice that originates from lexical parsing. we are essentially looking for the best path through that lattice, or an n-best list of such paths.

for experimentation in the constrained space, back then we put an undocumented option into [incr tsdb()] to generate the lattice in the form of lexical profiles, essentially a dump of the chart that results from parsing with only lexical rules. if need be, i could probably recover the necessary incantations to invoke that process.

0 replies

olzama · 2021-09-28T21:28:26Z

olzama
Sep 28, 2021
Maintainer Author

@olzama, from Bec's comments which Dan shared above, ubertagging is supertagging over a lattice, with a particular kind of fine-grained tags. The second point is about what kind of supertags we want (but this is an important question for any kind of supertagging). The first point means: ubertagging = supertagging + segmentation. She also explains it that way in her 2013 paper.

So you can't really "skip" supertagging because it's an integral part of ubertagging.

I think what I meant to ask is: would it make any sense for me to focus on supertagging and not ubertagging. At the moment, I am trying to understand what to start with. the goal is better parsing, and I think what I am hearing is that ultimately our state of the art is ubertagging, not just supertagging. So, that's what needs to be improved. Is that right?

0 replies

olzama · 2021-10-15T13:38:37Z

olzama
Oct 15, 2021
Maintainer Author

As you'll know from Bec's thesis, training data is created by using the ERG to parse a relatively large corpus such as NANC (350 million words: https://catalog.ldc.upenn.edu/LDC95T21). This is somewhat resource-intensive, so has not been updated for either the 2018 or 2020 versions of the ERG. But if you want to start with a good baseline for comparison with any new methods you develop, it would be nice to update the training data and the model with the 2020 ERG.

@danflick That's what I intend to do (redo the baseline with the 2020 ERG).

In her thesis, Bec just says: "The tree- banks come from the 0902 release of the ERG".

You said "a large corpus such as NANC"; so was it NANC or was it something else?

I suppose if I am redoing the baseline, then there is no requirement to use the exact same corpus as Bec used. I wonder what criteria I should have in mind (apart from the corpus size) when choosing the corpus. It would be good to have something with mixed domains, I think (because presumably the ERG does better with different domains than a statistical parser trained on one domain)?

0 replies

olzama · 2021-11-08T08:32:46Z

olzama
Nov 8, 2021
Maintainer Author

Could any of the @delph-in/participants comment on my post above please? Or if my post isn't clear, let me know? Thanks!

0 replies

danflick · 2021-11-08T19:37:15Z

danflick
Nov 8, 2021
Collaborator

I guess you already know about the relevent DELPH-IN wiki pages describing Rebecca Dridan's and Stephan Oepen's work on augmented supertagging within the WeSearch project at the University of Oslo (see WeScience and especially WeSearch_SuperTagging_Setup). You'll also find potentially useful scripts in the LOGON code repository (see LogonTop and LogonInstallation), for example in logon/uio/titan/README (see section on ubertagger training). In working with the ERG, the ErgTokenization page may also be helpful. Good benchmarks are provided in Dridan's ACL paper (www.aclweb.org/anthology/D13-1120.pdf) and in her dissertation.

0 replies

olzama · 2021-11-09T09:26:23Z

olzama
Nov 9, 2021
Maintainer Author

Oh yes, section 4 of the ACL paper it is! And in the screenshot above (that one is from Bec's dissertation), looks like the "Tourism data" refers to "LOGON tourism data that has been used for training (Ytrestøl et al. 2009)". That is not initially explained but I was able to find the info just by searching through the entire document where "tourism" is mentioned. Thanks, @danflick !

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DELPH-IN

Neural supertagging #54

{{title}}

Replies: 18 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

DELPH-IN

Neural supertagging #54

olzama Sep 17, 2021 Maintainer

Replies: 18 comments

becdridan Sep 17, 2021 Collaborator

arademaker Sep 18, 2021 Maintainer

olzama Sep 20, 2021 Maintainer Author

olzama Sep 20, 2021 Maintainer Author

danflick Sep 20, 2021 Collaborator

olzama Sep 20, 2021 Maintainer Author

danflick Sep 20, 2021 Collaborator

emilymbender Sep 20, 2021 Maintainer

danflick Sep 20, 2021 Collaborator

guyemerson Sep 23, 2021 Maintainer

olzama Sep 28, 2021 Maintainer Author

guyemerson Sep 28, 2021 Maintainer

oepen Sep 28, 2021 Maintainer

olzama Sep 28, 2021 Maintainer Author

olzama Oct 15, 2021 Maintainer Author

olzama Nov 8, 2021 Maintainer Author

danflick Nov 8, 2021 Collaborator

olzama Nov 9, 2021 Maintainer Author

olzama
Sep 17, 2021
Maintainer

becdridan
Sep 17, 2021
Collaborator

arademaker
Sep 18, 2021
Maintainer

olzama
Sep 20, 2021
Maintainer Author

olzama
Sep 20, 2021
Maintainer Author

danflick
Sep 20, 2021
Collaborator

olzama
Sep 20, 2021
Maintainer Author

danflick
Sep 20, 2021
Collaborator

emilymbender
Sep 20, 2021
Maintainer

danflick
Sep 20, 2021
Collaborator

guyemerson
Sep 23, 2021
Maintainer

olzama
Sep 28, 2021
Maintainer Author

guyemerson
Sep 28, 2021
Maintainer

oepen
Sep 28, 2021
Maintainer

olzama
Sep 28, 2021
Maintainer Author

olzama
Oct 15, 2021
Maintainer Author

olzama
Nov 8, 2021
Maintainer Author

danflick
Nov 8, 2021
Collaborator

olzama
Nov 9, 2021
Maintainer Author