Skip to content

Commit

Permalink
Minor fixes: unused parameter, factor optional components into a cent…
Browse files Browse the repository at this point in the history
…ral header.

git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3849 1f5c12ca-751b-0410-a591-d2e778427230
  • Loading branch information
heafield committed Jan 26, 2011
1 parent 93caa3d commit d79c756
Show file tree
Hide file tree
Showing 6 changed files with 26 additions and 22 deletions.
12 changes: 5 additions & 7 deletions kenlm/README
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
Language model inference code by Kenneth Heafield <infer at kheafield.com>
The official website is http://kheafield.com/code/mt/infer.html . If you're a decoder developer, please download the latest version from there instead of copying from another decoder.
The official website is http://kheafield.com/code/mt/infer.html . If you're a decoder developer, please download the latest version from there instead of copying from Moses.

This documentation is directed at decoder developers.

Currently, it loads an ARPA file in 2/3 the time SRI takes and uses 6.5 GB when SRI takes 11 GB. These are compared to the default SRI build (i.e. without their smaller structures). I'm working on optimizing this even further.

Binary format via mmap is supported. Run ./build_binary to make one then pass the binary file name instead.

Currently, it assumes POSIX APIs for errno, sterror_r, open, close, mmap, munmap, ftruncate, fstat, and read. This is tested on Linux and the non-UNIX Mac OS X. I welcome submissions porting (via #ifdef) to other systems (e.g. Windows) but proudly have no machine on which to test it.

A brief note to Mac OS X users: your gcc is too old to recognize the pack pragma. The warning effectively means that, on 64-bit machines, the model will use 16 bytes instead of 12 bytes per n-gram of maximum order (those of lower order are already 16 bytes) in the probing and sorted models. The trie is not impacted by this.

It does not depend on Boost or ICU. However, if you use Boost and/or ICU in the rest of your code, you should define HAVE_BOOST and/or HAVE_ICU in util/string_piece.hh. Defining HAVE_BOOST will let you hash StringPiece. Defining HAVE_ICU will use ICU's StringPiece to prevent a conflict with the one provided here. By the way, ICU's StringPiece is buggy and I reported this bug: http://bugs.icu-project.org/trac/ticket/7924 .
It does not depend on Boost or ICU. However, if you use Boost and/or ICU in the rest of your code, you should define HAVE_BOOST and/or HAVE_ICU in util/have.hh. Defining HAVE_BOOST will let you hash StringPiece. Defining HAVE_ICU will use ICU's StringPiece to prevent a conflict with the one provided here.

The recommend way to use this:
Copy the code and distribute with your decoder.
Set HAVE_ICU and HAVE_BOOST at the top of util/string_piece.hh as instructed above.
Set HAVE_ICU and HAVE_BOOST at the top of util/have.hh as instructed above.
Look at compile.sh and reimplement using your build system.
Use either the interface in lm/ngram.hh or lm/virtual_interface.hh
Interface documentation is in comments of lm/virtual_interface.hh (including for lm/ngram.hh).
Use either the interface in lm/model.hh or lm/virtual_interface.hh
Interface documentation is in comments of lm/virtual_interface.hh (including for lm/model.hh).

I recommend copying the code and distributing it with your decoder. However, please send improvements to me so that they can be integrated into the package.

Expand Down
15 changes: 7 additions & 8 deletions kenlm/lm/search_trie.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include "lm/word_index.hh"
#include "util/ersatz_progress.hh"
#include "util/file_piece.hh"
#include "util/have.hh"
#include "util/proxy_iterator.hh"
#include "util/scoped.hh"

Expand All @@ -20,7 +21,6 @@
#include <cstdio>
#include <deque>
#include <limits>
//#include <parallel/algorithm>
#include <vector>

#include <sys/mman.h>
Expand Down Expand Up @@ -170,7 +170,7 @@ template <class Proxy> class CompareRecords : public std::binary_function<const
return Compare(reinterpret_cast<const WordIndex*>(first.data()), second.Indices());
}
bool operator()(const std::string &first, const std::string &second) const {
return Compare(reinterpret_cast<const WordIndex*>(first.data()), reinterpret_cast<const WordIndex*>(first.data()));
return Compare(reinterpret_cast<const WordIndex*>(first.data()), reinterpret_cast<const WordIndex*>(second.data()));
}

private:
Expand Down Expand Up @@ -384,7 +384,6 @@ void WriteContextFile(uint8_t *begin, uint8_t *end, const std::string &ngram_fil
PartialIter context_begin(PartialViewProxy(begin + sizeof(WordIndex), entry_size, context_size));
PartialIter context_end(PartialViewProxy(end + sizeof(WordIndex), entry_size, context_size));

// TODO: __gnu_parallel::sort here.
std::sort(context_begin, context_end, CompareRecords<PartialViewProxy>(order - 1));

std::string name(ngram_file_name + kContextSuffix);
Expand Down Expand Up @@ -502,7 +501,7 @@ void ConvertToSorted(util::FilePiece &f, const SortedVocabulary &vocab, const st
}
// Sort full records by full n-gram.
EntryProxy proxy_begin(begin, entry_size), proxy_end(out_end, entry_size);
// TODO: __gnu_parallel::sort here.
// Tried __gnu_parallel::sort here but it took too much memory.
std::sort(NGramIter(proxy_begin), NGramIter(proxy_end), CompareRecords<EntryProxy>(order));
files.push_back(DiskFlush(begin, out_end, file_prefix, batch, order, weights_size));
WriteContextFile(begin, out_end, files.back(), entry_size, order);
Expand Down Expand Up @@ -533,7 +532,7 @@ void ConvertToSorted(util::FilePiece &f, const SortedVocabulary &vocab, const st
}
}

void ARPAToSortedFiles(util::FilePiece &f, const std::vector<uint64_t> &counts, std::size_t buffer, const std::string &file_prefix, SortedVocabulary &vocab) {
void ARPAToSortedFiles(util::FilePiece &f, const std::vector<uint64_t> &counts, size_t buffer, const std::string &file_prefix, SortedVocabulary &vocab) {
{
std::string unigram_name = file_prefix + "unigrams";
util::scoped_fd unigram_file;
Expand All @@ -544,10 +543,10 @@ void ARPAToSortedFiles(util::FilePiece &f, const std::vector<uint64_t> &counts,
// Only use as much buffer as we need.
size_t buffer_use = 0;
for (unsigned int order = 2; order < counts.size(); ++order) {
buffer_use = std::max(buffer_use, (size_t)((sizeof(WordIndex) * order + 2 * sizeof(float)) * counts[order - 1]));
buffer_use = std::max<size_t>(buffer_use, static_cast<size_t>((sizeof(WordIndex) * order + 2 * sizeof(float)) * counts[order - 1]));
}
buffer_use = std::max(buffer_use, (size_t)((sizeof(WordIndex) * counts.size() + sizeof(float)) * counts.back()));
buffer = std::min((size_t)buffer, buffer_use);
buffer_use = std::max<size_t>(buffer_use, static_cast<size_t>((sizeof(WordIndex) * counts.size() + sizeof(float)) * counts.back()));
buffer = std::min<size_t>(buffer, buffer_use);

util::scoped_memory mem;
mem.reset(malloc(buffer), buffer, util::scoped_memory::MALLOC_ALLOCATED);
Expand Down
2 changes: 1 addition & 1 deletion kenlm/util/exception.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ const char *HandleStrerror(int ret, const char *buf) {
}

// The GNU version.
const char *HandleStrerror(const char *ret, const char *buf) {
const char *HandleStrerror(const char *ret, const char * /*buf*/) {
return ret;
}
} // namespace
Expand Down
3 changes: 1 addition & 2 deletions kenlm/util/file_piece.hh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include "util/ersatz_progress.hh"
#include "util/exception.hh"
#include "util/have.hh"
#include "util/mmap.hh"
#include "util/scoped.hh"
#include "util/string_piece.hh"
Expand All @@ -11,8 +12,6 @@

#include <cstddef>

#define HAVE_ZLIB

namespace util {

class EndOfFileException : public Exception {
Expand Down
11 changes: 11 additions & 0 deletions kenlm/util/have.hh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
/* This ties kenlm's config into Moses's build system. If you are using kenlm
* outside Moses, see http://kheafield.com/code/kenlm/developers/ .
*/
#ifndef UTIL_HAVE__
#define UTIL_HAVE__

#include "../config.h"

#define HAVE_ZLIB

#endif // UTIL_HAVE__
5 changes: 1 addition & 4 deletions kenlm/util/string_piece.hh
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,7 @@
#ifndef BASE_STRING_PIECE_H__
#define BASE_STRING_PIECE_H__

//Uncomment this line if you use ICU in your code.
//#define HAVE_ICU
//Uncomment this line if you want boost hashing for your StringPieces.
//#define HAVE_BOOST
#include "util/have.hh"

#ifdef HAVE_BOOST
#include <boost/functional/hash/hash.hpp>
Expand Down

0 comments on commit d79c756

Please sign in to comment.