-
Notifications
You must be signed in to change notification settings - Fork 0
jamesknox/becorpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
README.txt April 2009 M.Conway ------------------------------ --- Introduction ------------- As source documents cannot be redistributed for copyright reasons, we have provided a perl script that downloads as many documents as possible, and then merges each downloaded document with its associated event frame. Note that event frames for all files are provided in the "./events" subdirectory. The event corpus consists of 200 documents. --- Modules Required --------- The downloading script requires a number of modules that may or may not be installed on your system. The modules are: LWP::Simple Encode File::Find File::Basename Perl6::Slurp (If necessary, modules can be installed using the command: sudo cpan MODULE_NAME ) Note that the script is unlikely to work with Windows (it has been tested with Mac OS and Linux). --- Downloading the Corpus --- First, run the script in the "event_corpus" directory (i.e. the top level directory of the zip file). Then, perl download_corpus.pl This takes a couple of minutes to run, it should provide a progress report. If the script will not run, check that all the necessary modules are installed. Next, to find out how successful the "download_corpus.pl" command was, use: perl summary.pl This gives a list of the number of documents successfully downloaded, how may could not be downloaded, and how many documents were empty. For example: SUMMARY OF DOCUMENTS DOWNLOADED ------------------------------- Number of successful downloads: 170 Number of unavailable documents: 23 Number of empty documents: 7 Finally, we merge the event frames and documents using the command: perl merge_events.pl --- Directory Structure ------ When this command has been run, the top level directory should consist of: DIRECTORIES ./urls contains text files containing urls and file names ./raw_html raw downloaded html files ./clean_html downloaded files with html stripped ./events event frames ./merged_files downloaded files merged with their associated event frames ./information provides details of the sources and topics of documents ./pre_processing_scripts LOG FILES download_corpus_DOWNLOADED.log download_corpus_DOWNLOAD_UNAVAILABLE.log download_corpus_EMPTYFILE.log PERL SCRIPTS download_corpus.pl merge_events.pl summary.pl
About
Automatically exported from code.google.com/p/becorpus
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published