Chapter 5: Categorizing and Tagging Words

POS Tagger = processes a sequence of words and attaches a part of speech to each word

nltk.pos_tag(text)

CC = coordinating conjunction
RB = adverbs
IN = preposition
NN = noun
NNP = noun proper
JJ = adjective
VBP = present-tense verb

Taggers help with generation of similar words, since you want to make sure they are the same part of speech.

Tagged tokens often represented by tuples ('dog', 'NN')
Can find out what type of works are most common in different catagories of text (e.g. news)

Nouns = used after a determiner or as the subject of a verb

the woman who I saw yesterday --> determiner
the woman sat down --> subject of a verb

Verbs = express a relation involving the referents of one or more noun phrases

Rome fell --> Simple
Dot com stocks suddenly fell like a stone --> with modifiers and adjuncts

Adjectives = describe nouns and used a modifiers (e.g. the large pizza) or predicates (e.g. the pizza is large)

Adverbs = modify verbs to specify the time, manner, place, or direction of the event described by the verb (e.g. the stocks fell quickly). They can also modify adjectives (e.g. Mary's teachers was really nice.)

Articles = determiners (the, a)

Modals = (should, may)

Personal pronouns = (she, they)

Searching for 3-word phrases using POS tagging

from nltk.corpus import brown
def process(sentence):
	for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence):
		if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
			print w1, w2, w3

for tagged_sent in brown.tagged_sents():
	process(tagged_sent)

To create a POS-tagger:

Default tag everything w/ most common POS (NN)
Use a regular expression tagger
Use a lookup tagger -- aka reference against a previously tagged piece of text (useful for the top 100 common words)
Unigram tagging -- a trained lookup tagger (uses two data sets, training and test sets)

How to determine the category of a word (in general)?

Morphological clues (suffixes, prefixes)
Synatic clues = the typical contexts in which words occur
Sematic clues = the meaning of the word

open class = a POS that new words are typically added to (e.g. muggles, n00bs, added to noun class)

closed class = a POS that new words are not added to often (e.g. above, along, below, between have no changed)

Chapter 6: Learning to Classify Text

Classifiers: choosing the correct class label for a given input

eg. email spam filters
topic of a news article
classifying "bank" as a noun or verb

Supervised = when a classifier is built based on a training corpa containing the correct label for each input

Steps in creating a classifier:

Deciding what features are relevant and how to encode those features (this is a lost of the work in a good classifier)
Examine the likelihood ratios = the listings in the training set that meet the features and are correct

Classifiers used for document classification (e.g. news, romance, horror)

Joint classifier model = examines a bunch of related inputs and makes a label Sequence classier model = first find the most likely class label for the first input, then use this to find the best label for the second input and so on.

Shortcoming: committed to all decisions, and one decision influences the next

Hidden Markov Analysis = assigns scores to all the possible sequences and then chooses which sequence has the highest score (employ probability distribution, which a sequence classifier does not.)

Ways to measure classifiers:

Accuracy
Precision = how many of the items were identified as relevant
Recall = how many of the relevant items we identified

Confusion Matrix = table where each cell [i,j] indicates how often a label j was predicated when the correct label was i

Cross Validation = perform multiple evaluations of different test tests and combine the scores

To do this, subdivide the original corpus into N subsets called folds
For each fold, we train the model using all but that fold, and then test on this fold

Decision Trees = simple flowchart that selects labels for input values

decision nodes = check feature value
leaf nodes - assign labels
root value = flowcharts initial decision
decision stump = tree with a single node that decides to to classify inputs based on a single feature

Information gain = how much more organized the input values become when we divide them up using a given feature (by calculating the entropy of their labels. This will be high if the input values have slightly varied labels and low if many inputs have the same label)

Naive Bayes: every features gets a say in determining which label should be assigned to a given input.

Each classifier starts by looking at the prior probability of the label (aka the frequency of the label in the training set)
Contribution from each feature combined with prior probability to achieve a likelihood estimate

Chapter 7 - Extracting Information from Text

Structured data = regular and predictable organization of entities and relationships

Named entity recognition = search for mentions of entities (locations, businesses, etc.)

Relation recognition = search for relationships between entities in the text

Chunking = technique used for entity recognition. Smaller chunks are tagged with POS, while larger chunks are used to identify multiple-word entities (e.g. San Francisco). These use tag patterns to re-factor chunks.

Tag patterns = a sequence of pos-tags eg that help identify a chunk
Noun phrase chunking = search for a sequence of proper nouns and chunk them together

Chink = a sequence of tokens that we do not want to include in a chunk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alloriginal.md

alloriginal.md

Chapter 5: Categorizing and Tagging Words

Chapter 6: Learning to Classify Text

Chapter 7 - Extracting Information from Text

Files

alloriginal.md

Latest commit

History

alloriginal.md

File metadata and controls

Chapter 5: Categorizing and Tagging Words

Chapter 6: Learning to Classify Text

Chapter 7 - Extracting Information from Text