Replies: 4 comments 3 replies
-
I like the idea of moving the inference libs to dynamic libraries. This will improve linkage time, will reduce app size (since we'll need to load only what we're going to use). the update of inference lib be done standalone and faster. However, we should check if this will be OK for mobiles to ship dynamic library plugins on their app stores. Probably yes, but there are always some requirements, e.g same code-signing, licensing etc. |
Beta Was this translation helpful? Give feedback.
-
One question here is what to do with the demos... Perhaps this would mean creating a separate demo library? |
Beta Was this translation helpful? Give feedback.
-
Another question is how to manage Dummy. Dummy is used for testing so it makes sense to remain in this repo, but on the other hand, if it does, it won't fully represent an inference lib |
Beta Was this translation helpful? Give feedback.
-
Closing this as mostly resolved. Further clarifications are in #177 and issues will be opened on how to handle actual plugins (not plibs) on restrictive target platorms |
Beta Was this translation helpful? Give feedback.
-
Currently to use an inference lib with the local API you have to link with it and then add its
ModelLoader
to theModelFactory
.This is acceptable if C++ is the only language we support but it becomes highly unsustainable when there are multiple wrappers.
A third party inference lib will have to provide a load interface for all wrappers. But if we consider third party wrappers, this is close to impossible.
So instead of the current "push"-like interface, we should consider inference libs plugins with a single
extern "C"
entry point to be loaded on demand. Then loading inference libs would be a matter of finding the plugin on the disk and inference libs and wrappers will be truly detached.This is also a path to a solution to a couple of problems mentioned in #5: Some inference libs (ie llama.cpp and whisper.cpp) might not be written in a way that supports multiple backends. For example currently you cannot have a llama.cpp binary with both CUDA and Vulkan compute. With plugins one could have both and choose the compute backend based on a plugin. Note that just using plugins is not enough on its own for this. It's a step, but not the only step. Allowing multiple instances of a compute lib has very significant build implications. So significant even, that dropping C++ in favor of C for the Local inference API may become a consideration.
There is also an alternative though and it is adequate if the SDK focus is servers and edge is a byproduct. Do nothing. Assume that every consumer of the local SDK builds it themselves. Thus adding new inference libs (or new wrappers) will be handled on their end, in their build system. We can document how to do it, but don't do anything more to accommodate them than that
Beta Was this translation helpful? Give feedback.
All reactions