Skip to content
This repository has been archived by the owner on Mar 6, 2019. It is now read-only.

Updates for speed and python 3 compatibility #7

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dexteradeus
Copy link

As I started to use this system, I started making changes which I think could be useful to others.

  1. Updates to some scripts to improve python 2/3 compatibility
  2. Fixed a formatting bug in in the training file output to support running crf_learn with multiple threads
  3. Refactored crf file generation to support multithreading
  4. Updated roundtrip.sh to support providing counts as command line options and to use all system cores when generating data files as well as running crf_learn

On my system with 8 cores, I noticed a 7.5x reduction in processing time to run roundup.sh with the provided dataset.

- Prevent extra newlines from being included in the output
- Updates to make script python2/3 compatible
- Replace generate data for loop with a pool of worker processes for
  processing each line of the training input file in a separate process
- Move writing the output data to a seperate process which reads from
  the queue filled by the worker processes
@walkerdb
Copy link

I can confirm this works when set up correctly. On macs the code to get a processor count will fail (line 4 of roundtrip.sh), but it is easy to hardcode a number.

@maugch
Copy link

maugch commented Jul 22, 2017

there is an error on rountrip.sh
line 42 input_file instead of iput_file

also it seems not to generate test data
def _generate_data_worker is never called.
tested on ubuntu 16.04

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants