POS Tagger = processes a sequence of words and attaches a part of speech to each word
nltk.pos_tag(text)
- CC = coordinating conjunction
- RB = adverbs
- IN = preposition
- NN = noun
- NNP = noun proper
- JJ = adjective
- VBP = present-tense verb
Taggers help with generation of similar words, since you want to make sure they are the same part of speech.
- Tagged tokens often represented by tuples ('dog', 'NN')
- Can find out what type of works are most common in different catagories of text (e.g. news)
Nouns = used after a determiner or as the subject of a verb
- the woman who I saw yesterday --> determiner
- the woman sat down --> subject of a verb
Verbs = express a relation involving the referents of one or more noun phrases
- Rome fell --> Simple
- Dot com stocks suddenly fell like a stone --> with modifiers and adjuncts
Adjectives = describe nouns and used a modifiers (e.g. the large pizza) or predicates (e.g. the pizza is large)
Adverbs = modify verbs to specify the time, manner, place, or direction of the event described by the verb (e.g. the stocks fell quickly). They can also modify adjectives (e.g. Mary's teachers was really nice.)
Articles = determiners (the, a)
Modals = (should, may)
Personal pronouns = (she, they)
Searching for 3-word phrases using POS tagging
from nltk.corpus import brown
def process(sentence):
for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence):
if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
print w1, w2, w3
for tagged_sent in brown.tagged_sents():
process(tagged_sent)
To create a POS-tagger:
- Default tag everything w/ most common POS (NN)
- Use a regular expression tagger
- Use a lookup tagger -- aka reference against a previously tagged piece of text (useful for the top 100 common words)
- Unigram tagging -- a trained lookup tagger (uses two data sets, training and test sets)
How to determine the category of a word (in general)?
- Morphological clues (suffixes, prefixes)
- Synatic clues = the typical contexts in which words occur
- Sematic clues = the meaning of the word
open class = a POS that new words are typically added to (e.g. muggles, n00bs, added to noun class)
closed class = a POS that new words are not added to often (e.g. above, along, below, between have no changed)
Classifiers: choosing the correct class label for a given input
- eg. email spam filters
- topic of a news article
- classifying "bank" as a noun or verb
Supervised = when a classifier is built based on a training corpa containing the correct label for each input
Steps in creating a classifier:
- Deciding what features are relevant and how to encode those features (this is a lost of the work in a good classifier)
- Examine the likelihood ratios = the listings in the training set that meet the features and are correct
Classifiers used for document classification (e.g. news, romance, horror)
Joint classifier model = examines a bunch of related inputs and makes a label Sequence classier model = first find the most likely class label for the first input, then use this to find the best label for the second input and so on.
- Shortcoming: committed to all decisions, and one decision influences the next
Hidden Markov Analysis = assigns scores to all the possible sequences and then chooses which sequence has the highest score (employ probability distribution, which a sequence classifier does not.)
Ways to measure classifiers:
- Accuracy
- Precision = how many of the items were identified as relevant
- Recall = how many of the relevant items we identified
Confusion Matrix = table where each cell [i,j] indicates how often a label j was predicated when the correct label was i
Cross Validation = perform multiple evaluations of different test tests and combine the scores
- To do this, subdivide the original corpus into N subsets called folds
- For each fold, we train the model using all but that fold, and then test on this fold
Decision Trees = simple flowchart that selects labels for input values
- decision nodes = check feature value
- leaf nodes - assign labels
- root value = flowcharts initial decision
- decision stump = tree with a single node that decides to to classify inputs based on a single feature
Information gain = how much more organized the input values become when we divide them up using a given feature (by calculating the entropy of their labels. This will be high if the input values have slightly varied labels and low if many inputs have the same label)
Naive Bayes: every features gets a say in determining which label should be assigned to a given input.
- Each classifier starts by looking at the prior probability of the label (aka the frequency of the label in the training set)
- Contribution from each feature combined with prior probability to achieve a likelihood estimate
Structured data = regular and predictable organization of entities and relationships
Named entity recognition = search for mentions of entities (locations, businesses, etc.)
Relation recognition = search for relationships between entities in the text
Chunking = technique used for entity recognition. Smaller chunks are tagged with POS, while larger chunks are used to identify multiple-word entities (e.g. San Francisco). These use tag patterns to re-factor chunks.
- Tag patterns = a sequence of pos-tags eg that help identify a chunk
- Noun phrase chunking = search for a sequence of proper nouns and chunk them together
Chink = a sequence of tokens that we do not want to include in a chunk