Skip to content

Latest commit

 

History

History
122 lines (89 loc) · 5.68 KB

alloriginal.md

File metadata and controls

122 lines (89 loc) · 5.68 KB

Chapter 5: Categorizing and Tagging Words

POS Tagger = processes a sequence of words and attaches a part of speech to each word

nltk.pos_tag(text)
  • CC = coordinating conjunction
  • RB = adverbs
  • IN = preposition
  • NN = noun
  • NNP = noun proper
  • JJ = adjective
  • VBP = present-tense verb

Taggers help with generation of similar words, since you want to make sure they are the same part of speech.

  • Tagged tokens often represented by tuples ('dog', 'NN')
  • Can find out what type of works are most common in different catagories of text (e.g. news)

Nouns = used after a determiner or as the subject of a verb

  • the woman who I saw yesterday --> determiner
  • the woman sat down --> subject of a verb

Verbs = express a relation involving the referents of one or more noun phrases

  • Rome fell --> Simple
  • Dot com stocks suddenly fell like a stone --> with modifiers and adjuncts

Adjectives = describe nouns and used a modifiers (e.g. the large pizza) or predicates (e.g. the pizza is large)

Adverbs = modify verbs to specify the time, manner, place, or direction of the event described by the verb (e.g. the stocks fell quickly). They can also modify adjectives (e.g. Mary's teachers was really nice.)

Articles = determiners (the, a)

Modals = (should, may)

Personal pronouns = (she, they)

Searching for 3-word phrases using POS tagging

from nltk.corpus import brown
def process(sentence):
	for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence):
		if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
			print w1, w2, w3

for tagged_sent in brown.tagged_sents():
	process(tagged_sent)

To create a POS-tagger:

  • Default tag everything w/ most common POS (NN)
  • Use a regular expression tagger
  • Use a lookup tagger -- aka reference against a previously tagged piece of text (useful for the top 100 common words)
  • Unigram tagging -- a trained lookup tagger (uses two data sets, training and test sets)

How to determine the category of a word (in general)?

  • Morphological clues (suffixes, prefixes)
  • Synatic clues = the typical contexts in which words occur
  • Sematic clues = the meaning of the word

open class = a POS that new words are typically added to (e.g. muggles, n00bs, added to noun class)

closed class = a POS that new words are not added to often (e.g. above, along, below, between have no changed)

Chapter 6: Learning to Classify Text

Classifiers: choosing the correct class label for a given input

  • eg. email spam filters
  • topic of a news article
  • classifying "bank" as a noun or verb

Supervised = when a classifier is built based on a training corpa containing the correct label for each input

Steps in creating a classifier:

  • Deciding what features are relevant and how to encode those features (this is a lost of the work in a good classifier)
  • Examine the likelihood ratios = the listings in the training set that meet the features and are correct

Classifiers used for document classification (e.g. news, romance, horror)

Joint classifier model = examines a bunch of related inputs and makes a label Sequence classier model = first find the most likely class label for the first input, then use this to find the best label for the second input and so on.

  • Shortcoming: committed to all decisions, and one decision influences the next

Hidden Markov Analysis = assigns scores to all the possible sequences and then chooses which sequence has the highest score (employ probability distribution, which a sequence classifier does not.)

Ways to measure classifiers:

  • Accuracy
  • Precision = how many of the items were identified as relevant
  • Recall = how many of the relevant items we identified

Confusion Matrix = table where each cell [i,j] indicates how often a label j was predicated when the correct label was i

Cross Validation = perform multiple evaluations of different test tests and combine the scores

  • To do this, subdivide the original corpus into N subsets called folds
  • For each fold, we train the model using all but that fold, and then test on this fold

Decision Trees = simple flowchart that selects labels for input values

  • decision nodes = check feature value
  • leaf nodes - assign labels
  • root value = flowcharts initial decision
  • decision stump = tree with a single node that decides to to classify inputs based on a single feature

Information gain = how much more organized the input values become when we divide them up using a given feature (by calculating the entropy of their labels. This will be high if the input values have slightly varied labels and low if many inputs have the same label)

Naive Bayes: every features gets a say in determining which label should be assigned to a given input.

  • Each classifier starts by looking at the prior probability of the label (aka the frequency of the label in the training set)
  • Contribution from each feature combined with prior probability to achieve a likelihood estimate

Chapter 7 - Extracting Information from Text

Structured data = regular and predictable organization of entities and relationships

Named entity recognition = search for mentions of entities (locations, businesses, etc.)

Relation recognition = search for relationships between entities in the text

Chunking = technique used for entity recognition. Smaller chunks are tagged with POS, while larger chunks are used to identify multiple-word entities (e.g. San Francisco). These use tag patterns to re-factor chunks.

  • Tag patterns = a sequence of pos-tags eg that help identify a chunk
  • Noun phrase chunking = search for a sequence of proper nouns and chunk them together

Chink = a sequence of tokens that we do not want to include in a chunk