-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eval_mmlu ? #136
Comments
bug fixed. |
Hi, sorry for the delay. What was the problem again ? |
Hi, I'd like to evaluate the performance of hqq+ on mmlu, Today, I printed the pred_answer of eval_mmlu() after eval_wikitext2() , and found lots of '\n' in the list of pred_answer, I thought i had solved the problem, but still got bad score of mmlu (0.0342, and ppl is 5.601, seems normal). So could you offer your solution? |
Oh I see, which model exactly? Is this an issue with MMLU only or also the other metrics? |
Llama-2-7b-hf, only MMLU at present, the ppl is normal. |
Did you train the model yourself? How did you train it (what dataset) etc. |
yes, but i just ran the hqq_puls.py and did not change any params. So the dataset is dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') |
the key to the question is the eval_mmlu() function, I printed the element of answers and saw "[", it should have been "A" or "B" or "C" or "D". ( https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/run_mmlu_llama.py, line 172) |
Sounds like it's overfiting to wikitext, that might explain why the ppl is good but mmlu is bad. You need proper instruct data (and the instruct model not base model) to get good mmlu performance, which is outside the scope of the hqq+ script |
Ok, I am going to try other dataset or instruct model, thank you :) |
It's normally a mix of datasets not just 1. It's an art to figure out the right datasets :D |
No description provided.
The text was updated successfully, but these errors were encountered: