Skip to content

Latest commit

 

History

History
38 lines (31 loc) · 1.35 KB

Getting_Started.md

File metadata and controls

38 lines (31 loc) · 1.35 KB

Getting Started OCR-Parser

Requirements

  • command line (shell) access
  • Python
  • Standard *nix utilities (grep, find, ls, etc.)

OCR Files

  • OCR output of the image files returned from digitization
    • these should be flat text files, no XML, rtf, or other encoding
  • Files must be accessible to the program
    • Locally stored
    • Accessible via a network file system

##Usage

  • Project File organization

    • _data - some sample test data, can be ignored
    • _regex - executable python scripts and libraries for parsing the OCR text files
    • _scripts - shell scripts to automate running the program and set up the environment
    • _templates - example CSV files from our site
    • _utility - utility programs and scripts to prepare and organize files
  • Running the program

    • _scripts/parse_OCR_file.py
      • accepts a list of files to STDIN to be processed (needs more definition)
      • obtains startup information from shell variables
        • Parse_Limit - # of lines in the input file to parse for metadata
        • CSV_output_file - filename of the spreadsheet file produced by the program
          • it will append to an existing file
    • _scripts / run_parse_OCR.sh
    • sets up the environment and runs the program
    • needs to be tweaked for your environment
    • chmod +x run_parse_OCR.sh (to permit execution of the script)