You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My initial experimentation with various models under Toolio was rather scattershot, and for example Hermes-2-Theta-Llama-3-8B was one of the first I tried because I've always rated Nous Research fine-tunes very highly for all sorts of purposes. Then yesterday I came across the BigCodeBench Leaderboard, and no surprise Theta is up well there, though in its 70B incarnation. Anyway, I thought I'd start this ongoing discussion of LLMs that 1) work with MLX and 2) are good for tool-calling based on various ratings & experiences.
From the HF release post:
We just released the 🌸 BigCodeBenchmark: testing LLMs on more realistic and harder coding tasks involving tool usage. 🛠
While benchmarks like HumanEval are saturating even GPT-4o or DeepSeekCoder-v2 just land around 50% while humans get 97%!
A few highlights 🚀:
🛠 tasks utilize diverse function calls from 139 popular Python libraries.
🤓 complex, user-oriented instructions for each task
📊 includes verified examples and high test coverage
🙋♂️ comes in a standard function complete form as well as instruction version
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
My initial experimentation with various models under Toolio was rather scattershot, and for example Hermes-2-Theta-Llama-3-8B was one of the first I tried because I've always rated Nous Research fine-tunes very highly for all sorts of purposes. Then yesterday I came across the BigCodeBench Leaderboard, and no surprise Theta is up well there, though in its 70B incarnation. Anyway, I thought I'd start this ongoing discussion of LLMs that 1) work with MLX and 2) are good for tool-calling based on various ratings & experiences.
From the HF release post:
Note: DeepSeekCoder-v2 is 236B params.
Beta Was this translation helpful? Give feedback.
All reactions