Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce resources to load language models #121

Open
pemistahl opened this issue Nov 5, 2022 · 3 comments
Open

Reduce resources to load language models #121

pemistahl opened this issue Nov 5, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@pemistahl
Copy link
Owner

pemistahl commented Nov 5, 2022

Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be ndarray.

@pemistahl pemistahl added the enhancement New feature or request label Nov 5, 2022
@pemistahl pemistahl added this to the Lingua 1.5.0 milestone Nov 5, 2022
@ghost
Copy link

ghost commented Dec 17, 2022

Which files ? if you require the processing in Python or in JavaScript(Node) I can work on a Google proto buffer format; quite sure the persisted model would be way lighter, maybe the processing would be fast, I do not know.
Any way, I'm glad to help.
I'm happy that you provide a JS binding as well, I'm looking for a fast language detection runnable on Node.
Thanks

@ghost
Copy link

ghost commented Dec 17, 2022

I know this is half road, as you were asking for a better structure to gain processing time. But for big model on memory here is a solution:

I changed the format a little bit from regular Map<string: string> to Map<number[]: string[]>. I guess you treat as so anyway, so hopefully not a problem.

Here is a working example in JavaScript/Node: https://github.com/bacloud23/lingua-rs-bigrams

So here how it goes:

  • You encode/persist JSON once back and forth into a lightweight binary file.
  • With proto buffers, you definitely can load the binary file with the defined format, and decode the entire object (it would be way lighter).
  • But If you work with the values of the ngrams key iteratively and not cumulatively (I guess so), you can (I guess) load one Pair at a time inside a loop. I think it comes with a processing cost though (again if even possible).
  • You can do the same in Rust or in Python with the same proto model and the new encoding.

Drawback: new protobufjs dependency.

@getreu
Copy link

getreu commented Apr 27, 2023

@ghost: By how much your solution reduces the binary size?

@pemistahl pemistahl modified the milestones: Lingua 1.5.0, Lingua 1.6.0 May 29, 2023
@pemistahl pemistahl modified the milestones: Lingua 1.6.0, Lingua 1.7.0 Oct 30, 2023
@pemistahl pemistahl removed this from the Lingua 1.7.0 milestone Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants