Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with lexicon.randomWord("nns") #409

Closed
cqx931 opened this issue Apr 18, 2017 · 24 comments
Closed

Issues with lexicon.randomWord("nns") #409

cqx931 opened this issue Apr 18, 2017 · 24 comments
Assignees
Milestone

Comments

@cqx931
Copy link
Collaborator

cqx931 commented Apr 18, 2017

See also #408, #487

Proposed message:
All plural nouns have been removed from the latest rita dictionary. To random generate plural nouns, please use RiTa.pluralize(lexicon.randomWord("nn")) instead.

@cqx931 cqx931 self-assigned this Apr 18, 2017
@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

Why don't we just do this in the code?

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

hmm, yes, that sounds better.

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

We should also check which other parts-of-speech cause problems

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

Currently, only 'nns' returns zero result, as we remove them completely.
Other parts-of-speech that could be considerate for future changes are
vbn, vbd, vbg

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

Are you checking the full list? When I run the code below (in RiPos.java), I get other errors.
We may also need to update the POS list online and in Java

  public static void main(String[] args) {
    RiLexicon rl = new RiLexicon();
    for (int i = 0; i < PENN_TAGS.length; i++) {
      System.out.print(PENN_TAGS[i].toString()+": ");
      System.out.println(rl.randomWord(PENN_TAGS[i].toString()));
    }
  }
[INFO] RiTa.version [##version##]
[INFO] Loaded 28084(0) lexicon in 312 ms
cc: but
cd: one
dt: this
ex: there
fw: Exception in thread "main" java.util.NoSuchElementException: 0
	at rita.support.RandomIterator.next(RandomIterator.java:58)
	at rita.support.RandomIterator.next(RandomIterator.java:1)
	at rita.RiLexicon.randomWordByLength(RiLexicon.java:233)
	at rita.RiLexicon.randomWord(RiLexicon.java:170)
	at rita.support.RiPos.main(RiPos.java:207)

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

I believe these can also be removed from PosTagger as they are no longer used (at least from JS):

  N: ['n', 'NOUN_KEY'],
  V: ['v', 'VERB_KEY'],
  R: ['r', 'ADVERB_KEY'],
  A: ['a', 'ADJECTIVE_KEY'],

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

That is because some of the PENN tag is not in rita dictionary at all

fw: Foreign word
pdt: Predeterminer
pos: Possessive ending
rp: Particle
sym: Symbol

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

Isn't n, v, r, a for simplified WordNet tag set?

https://rednoise.org/rita/reference/RiTa/RiTa.getPosTags/index.php

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

yes, however, i dont think the constants are used anywhere, are they (at least not in js)?

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

Yes, they could be removed from PosTagger.
However, currently lexicon.randomWord("n") doesn't work.
If we support simplified WordNet tag in getPosTags, won't it also make sense to support WordNet tags for randomWord?

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

I think we don't need to support randomWord for the following three penn Tags, they are rare cases, and it doesn't make much sense to support them anyway.
fw: Foreign word
pos: Possessive ending ('s)
sym: Symbol

For rp - particle I think we can add the tag to a list of words in the dictionary:

Ex:

aboard about across along apart around aside at away back before behind by crop down ever fast for forth from go high i.e. in into just later low more off on open out over per pie raising start teeth that through under unto up up-pp upon whole with you
Reference: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html
(Some of the words on this list are rather strange to me...Ex: teeth, you...)

I'm not sure what to do with predeterminer though, any thought on this?
pdt: Predeterminer

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

If we support simplified WordNet tag in getPosTags, won't it also make sense to support WordNet tags for randomWord?

Yes, I suppose so (though we can still delete the constants). Can you add a low-priority ticket for this to Java/JS?

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

I think we don't need to support randomWord for the following three penn Tags, they are rare cases, and it doesn't make much sense to support them anyway.
fw: Foreign word
pos: Possessive ending ('s)
sym: Symbol

Agreed, but we need to catch the error and output a message: 'No words in the lexicon with this POS'. Not that we also need to consider the 1000-word lexicon, where this might happen more often.

For rp - particle I think we can add the tag to a list of words in the dictionary:

Agreed. Lets use these:
aboard about across along apart around aside at away back before behind by down ever fast for from go high in into just later low more off on open out over per start that through under unto up upon with

I'm not sure what to do with predeterminer though, any thought on this?
pdt: Predeterminer

This is part of a phrase, so not relevant for a word. Should be treated as the first case above and get a warning (anytime we dont find the POS requested, should be this same warning)

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

I think we can also delete all the PosTagger Constants in RitaJS?

  CC: ['cc', 'Coordinating conjunction'],
  CD: ['cd', 'Cardinal number'],
  DT: ['dt', 'Determiner'],
  EX: ['ex', 'Existential there'],
  FW: ['fw', 'Foreign word'],
  IN: ['in', 'Preposition or subordinating conjunction'],
  JJ: ['jj', 'Adjective'],
  JJR: ['jjr', 'Adjective, comparative'],
  JJS: ['jjs', 'Adjective, superlative'],
  LS: ['ls', 'List item marker'],
  MD: ['md', 'Modal'],
  NN: ['nn', 'Noun, singular or mass'],
  NNS: ['nns', 'Noun, plural'],
  NNP: ['nnp', 'Proper noun, singular'],
  NNPS: ['nnps', 'Proper noun, plural'],
  PDT: ['pdt', 'Predeterminer'],
  POS: ['pos', 'Possessive ending'],
  PRP: ['prp', 'Personal pronoun'],
  PRP$: ['prp$', 'Possessive pronoun (prolog version PRP-S)'],
  RB: ['rb', 'Adverb'],
  RBR: ['rbr', 'Adverb, comparative'],
  RBS: ['rbs', 'Adverb, superlative'],
  RP: ['rp', 'Particle'],
  SYM: ['sym', 'Symbol'],
  TO: ['to', 'to'],
  UH: ['uh', 'Interjection'],
  VB: ['vb', 'Verb, base form'],
  VBD: ['vbd', 'Verb, past tense'],
  VBG: ['vbg', 'Verb, gerund or present participle'],
  VBN: ['vbn', 'Verb, past participle'],
  VBP: ['vbp', 'Verb, non-3rd person singular present'],
  VBZ: ['vbz', 'Verb, 3rd person singular present'],
  WDT: ['wdt', 'Wh-determiner'],
  WP: ['wp', 'Wh-pronoun'],
  WP$: ['wp$', 'Possessive wh-pronoun (prolog version WP-S)'],
  WRB: ['wrb', 'Wh-adverb'],

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

When I check randomWord, we actually have rp and pdt in the dictionary, but because we are only checking the first posTag from the posTag array of a word, rp and pdt are never identified.

Is there any particular reason that we only check the first posTag?

If I change the code to check the whole posTagArray instead of the first posTag only, it can pass the test for rp and pdt, however, it will fail this test, which makes sure that the matching tag for randomWord is always the first posTag from the array...

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

I think we can also delete all the PosTagger Constants in RitaJS?

It doesn't break anything after deleting them, so I think, yes...

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

Ok. I will make these changes in java...

dhowe pushed a commit that referenced this issue Apr 18, 2017
dhowe pushed a commit that referenced this issue Apr 18, 2017
@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

When I check randomWord, we actually have rp and pdt in the dictionary, but because we are only checking the first posTag from the posTag array of a word, rp and pdt are never identified.

Is there any particular reason that we only check the first posTag?

If I change the code to check the whole posTagArray instead of the first posTag only, it can pass the test for rp and pdt, however, it will fail this test, which makes sure that the matching tag for randomWord is always the first posTag from the array...

Lets leave as is (we want words that are clearly the pos requested), and just issue the warning for those cases (rp, pdt)

@dhowe
Copy link
Owner

dhowe commented Apr 18, 2017

Note: for getRandomWord() in Java, I throw an Exception with the following message: 'No words with pos='+pos+ ' found'

@cqx931
Copy link
Collaborator Author

cqx931 commented Apr 18, 2017

@cqx931 cqx931 closed this as completed Apr 19, 2017
@isohedral
Copy link

Hi! For the record, I just downloaded Rita for Processing for the first time, and the first thing I tried was getRandomWord( "nns" ). Instead of getting zero results, I get "negroes" over and over again. That's, uh, a bit weird? Needless to say, I'm glad I found this note about using pluralize().

@dhowe
Copy link
Owner

dhowe commented Jan 4, 2018

That is strange. It is quite unlikely that you would get the same result twice, let alone over and over. Can you include your test code?

I just downloaded the current RiTa from NPM, and ran a simple test, which gave me the following output (code follows):

0) wafers
1) enforcements
2) toasters
3) platinums
4) evaluations
5) squabblings
6) monkeys
7) providers
8) screeches
9) riverboats
10) diffusions
11) avengers
12) sacs
13) infinities
14) emigrations
15) banishments
16) dictums
17) shards
18) avails
19) grandeurs
var rita = require('rita');
  
for (var i = 0; i < 20; i++) {
  var rs = rita.randomWord( "nns" );
  console.log(i+') '+rs);
}
  

@isohedral
Copy link

Sorry, I should have specified that this is the Java version, running in Processing 3.3.6. Here's sample code:

import rita.*;

for( int idx = 0; idx < 10; ++idx ) {
  println( RiTa.randomWord( "nns" ) );
}

@dhowe dhowe added this to the Version 1.3 milestone Jan 4, 2018
@dhowe dhowe reopened this Jan 4, 2018
dhowe added a commit that referenced this issue Jan 4, 2018
@dhowe
Copy link
Owner

dhowe commented Jan 4, 2018

Confirmed and fixed in 1479f46. This will be in the next release (v1.3), scheduled for Jan 5, 2018.

@dhowe dhowe changed the title Add warning message for lexicon.randomWord("nns") Issues with lexicon.randomWord("nns") Jan 4, 2018
@dhowe dhowe modified the milestones: Version 1.3, Version 1.4 Jun 16, 2018
@dhowe dhowe closed this as completed Jun 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants