Skip to content

Frequently Asked Questions

zme1 edited this page Feb 12, 2018 · 10 revisions

FAQ

Topics

Questions are sorted below by topic, in the order covered in the course. If you have a question not covered here, make sure to ask it on the Issues board.

XML

Regular Expressions

Answers

How do I deal with overlapping elements?

Here are three approaches:

  • Fragmentation. If a theme shows up in several lines of a song, you may be able to tag it separately in each line. This is sensible if you don't need to pay attention to the continuity, that is, at whether, for example, the references to, say, "love" in lines 2 and 3 should be considered one long reference or two short ones.
  • Discontinuous virtual elements. If you do care about those connections, you can tag the references or themes in the lines separately, but unite them with a shared attribute value. For example, you can have two elements with an attribute called "group" with a numerical value. All elements that have the same value for the "group" attribute would be considered to function as a single discontinuous virtual unit.
  • Milestones. Milestones are empty elements that can be used to fake start and end tags in overlapping situations. In the Text Encoding Initiative (TEI), which we'll look at later, people sometimes need to tag paragraphs, but also pages, and since page breaks can fall in the middle of paragraphs, it isn't possible to use start and end tags for both page ranges and paragraph ranges. The milestone approach might tag all of the paragraphs with start and end tags, but it then uses empty (= page break) tags to show where one page ends and the next begins. Empty elements can have attributes, so you can even record a page number, if you need to. If you're thinking that none of these is particularly satisfactory, welcome to the club! There are markup languages under development that permit overlapping start and end tags, but none is sufficiently mature for real production use. In my own work, I've used all three of the strategies above, depending on the nature of the data and my research focus.

















    \

What do we mean by "greedy" and "lazy" repetition?

When an expression in Regular Expressions is described as greedy that means that it will match as aggressively as possible when finding matches to the string in your "Find" window. For example, if you searched for quotes by typing ".+" (with 'dot matches all' checked) and you had a sample text like the one below:

"Hey, how's it going?" he asked. "I'm doing just fine! How are you?"

The greedy expression will find the first instance of quotation marks and won't stop until the very last one, which includes multiple quotes simultaneously in addition to text that isn't a quote. If you were to try this on a novel, for example, there would only be one match and it would extend all the way from the first set of quotation marks all the way to the last in the entire work.

The way to combat overly greedy expressions is to make them lazy. That is to say, instead of using a greedy expression to capture as much as possible, use instead a lazy expression that will capture as little as possible. In Regex, the way to make a greedy character (most commonly the dot) lazy is with a question mark in the expression. A lazy expression to match individual quotes, like the ones above, will look like this:

".+?"

The question mark after the repetition indicator is telling the computer to match the first instance of quotation marks that it finds and continue only until it finds a second set. This feature is especially helpful when tagging any type of string of text that can occupy more than a single line of text, which is extremely common in many of the documents we work with!