Skip to content

TXM and the Victorian Women Writers Project

Serge Heiden edited this page Feb 9, 2015 · 3 revisions

[Taken from public email shared on TEI-L on 10/28/2013 which I hope to make more readable with Serge's help]

Dear Michelle,

Thanks for sharing and licensing, this gives us an opportunity to test our exploration tools on fresh TEI P5 materials (plain text files are left for later perusal). I did a quick & dirty import of the 9,570,731 words 'vwwp_tei' P5 corpus[1] into TXM for experiment and the result is sufficiently readable to be shared (thanks to Alexei Lavrentiev for help)[2].

If you want to browse and analyze the VWWP corpus locally with the desktop version of TXM (for Windows, Mac OS X or Linux) you can download the TXM bundled binary version of the VWWP corpus from the TXM demo portal here: http://partage-fichiers.ens-lyon.fr/ddvk44pr6o (available for one month, same license as original files)[3] (the first version of the VWWP binary corpus is outdated)

  • To use the file, first install the TXM sofware from Sourceforge: https://sourceforge.net/projects/txm
  • Then load the corpus into TXM with the "File > Load" command pointing to the file directly (don't unzip the file)[4].

If you don't want to install the desktop version of TXM, you can give it a try directly from a web browser through our on line TXM demo portal[5].

For an overview, here are the most frequent lemma of the corpus (considering only nouns and adjectives), followed by their number of occurrences (some sample words have direct links to their concordance processing into the TXM demo portal. There, double-click on a concordance line to read the occurrence highlighted into the original text edition): man 18075, little 15090, good 14202, day 13342, time 12961, life 11750, thing 11676, hand 11288, woman 10896, eye 10849, old 10832, own 10150, great 10145, child 8376, way 8158, heart 8143, face 7237, young 7059, year 6777, last 6775, such 6681, world 6588, word 6575, night 6426, love 6297, nothing 6232, people 6137, many 6054, head 5934, room 5686, house 5648, first 5566, light 5543, long 5397, poor 5330, God 5294, friend 5095, voice 5052, home 4784 etc.

All the best, Serge

PS. For people who don't know TXM, you may be interested by its leaflet (in English!): http://sourceforge.net/projects/txm/files/documentation/TXM%20Leaftlet%20EN.pdf/download

___ Notes

  1. I didn't understand how to download the 199 files of the corpus from Github at once. So if someone is interested, here are the two (quick&dirty level 2) Bash command lines I used to get the files: wget https://github.com/iulibdcs/tei_text/tree/master/vwwp_tei -O - | grep VAB |sed -e 's/.title="//g' -e 's/.xml".//g' >text-names.txt for f in cat text-names.txt; do wget "https://raw.github.com/iulibdcs/tei_text/master/vwwp_tei/$f.xml"; done`

  2. The XML TEI P5 files have been imported with the genuine 'XML-TEI BFM' import module (specifically designed for the semantics of the 'Base de français médiéval' XML TEI P5 encoding guidelines) + a standard TEI pre-filter stylesheet available from the TXM XSL library called "txm-filter-teip5-teibfm.xsl" (see https://sourceforge.net/projects/txm/files/library/xsl) The name of that stylesheet means something like "given some TEI P5 encoded files, let's see what we can do to import them into TXM with the BFM TEI semantics import module". Without any text metadata specification provided, we could only consistently retrieve from the files text author names (from the TEI headers). We could also import from TEI headers some text titles and original dates. All available tags were used to build text structures. Word tokenization is OK except for words written with '-' characters. Each word has been automatically tagged by a part of speech and a lemma by TreeTagger. Just hover your mouse over a word in a text edition page to see the results, or use the search engine to look for and count those word properties. Each text edition has been paginated and numbered by the original tags. Some foreign words have been given a specific character style (green color). Please mind that a thorough import of any TEI corpus into TXM for analysis needs fine tuning of XSL pre-filters and import scripts.

  3. Please feel free to host that file yourself, in your github for example, if you find it more appropriate (and useful).

  4. This may take a few minutes. You need to have about 3.5 GB of disk space available to load this corpus.

  5. We have taken the liberty to host the 'vwwp_tei' corpus in our public demo portal for the experiment, like the other public corpora we host. Please tel me if this is not appropriate so I can remove it from there.

-- Dr. Serge Heiden, [email protected], http://textometrie.ens-lyon.fr ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883