Thanks to the Great Ms. Freax for designing the Athenas-Oracle project.
Based on it, we have made improvements and designed the Paper Helper for Machine Learning Scientists. With the effects of RAG Fusion and RAFT (RAG Finetune, fine-tuned using GPT-4-1106-Preview API on the 52,000 MLArxivPapers and ArxivQA dataset as the backend), it can effectively reduce hallucinations and enhance retrieval relevance. We have implemented an end-to-end application of parallel generating, providing useful information to paper readers based on references ranked by relevance. We also incorporated structural relationships to represent the extracted information.
In short, everything is designed to enable a machine learning researcher to read papers more efficiently and provide the most reliable references based on paper citations!
The assistant utilizes three tools: search, gather evidence, and answer questions. These tools enable it to find and parse relevant full-text research papers, identify specific sections in the paper that help answer the question, summarize those sections with the context of the question (called evidence), and then generate an answer based on the evidence. It is an agent so that the LLMs orchestrating the tools can adjust the input to paper searches, gather evidence with different phrases, and assess if an answer is complete.
The basic RAG simply splits the search prompt into simple words in a crude manner, and may produce certain spelling illusions without truly understanding the user's intent.
Our system also has integrated the RAFT method. This approach enhances the capability of LLMs in specific RAG tasks by leveraging the core idea that if LLMs can "learn" documents in advance, it can improve RAG's performance.
We finetuned the OpenAI API using 52,000 domain-specific papers from the field of machine learning to augment the knowledge of PaperHelper within the machine learning domain, thereby assisting machine learning scientists in reading papers more efficiently and accurately.
With the implementation of RAFT, we can extract the reference section at the end of articles more efficiently. First, we use RAG to traverse all the references in the article. Then, based on the knowledge from the LLMs, we refine the information using the top-k algorithm to identify the literature most relevant to the article.
We can find that through the RAFT method, the model integrates cutting-edge knowledge, enabling readers to further explore academic papers based on current information rather than providing outdated and misleading content.
Use the following command step by step:
- Clone the Repository
git clone https://github.com/JerryYin777/PaperHelper.git
- Install Dependencies
cd PaperHelper
pip install -r requirements.txt
- Set OpenAI API Key
cd .streamlit
touch secrets.toml #input your OPENAI_API_KEY = "sk-yourapikeyhere" here
- Start PaperHelper
streamlit run app.py
Note:
- Set
allow_dangerous_deserialization: bool = True
first, where you can find infaiss.py
.
- You may also embed your pdf first in the application (click the button), or you may raise error
Exceptation: Directory index does not exist.