Tip
This program is designed for efficient retrieval and evaluation tasks, especially useful for document-based question answering. It includes BM25 and FAISS-based retrieval methods, along with optional preprocessing and scoring tools, for AICUP-2024 competition datasets. This project supports various settings, such as custom dictionaries and weighted scoring, enabling fine-tuning for optimized retrieval accuracy.
For a comprehensive overview of the program's logic and algorithms, refer to the utils/README.md file.
Warning
Ensure that your CUDA version and GPU are compatible with the dependencies in the requirements.txt
file. For GPU-accelerated tasks, make sure you have the appropriate CUDA version installed and that your GPU drivers are up to date.
-
Install required dependencies:
pip install -r requirements.txt
-
Set the question range in
bm25_retrieve_v2.py
:Modify the
RANGE
variable to specify which questions to process. For example:RANGE = range(0, 150) # Process questions range from 0 to 149 (total 150 questions)
This setting controls which subset of questions will be processed during retrieval.
-
Open
retrieve_v2.sh
and check the file paths for the following variables:--question_path
: Path to the file containing the questions to retrieve.--source_path
: Path to the folder containing the source documents or dataset.--output_path
: Path where the retrieval results will be saved.--load_path
: Path to a custom dictionary or resource file, such as a frequency dictionary.
Example of
retrieve_v2.sh
file:#!/bin/bash # ... check word2vec/wiki.zh.bin python3 bm25_retrieve_v2.py \ --question_path ./CompetitionDataset/dataset/preliminary/questions_example.json \ --source_path ./CompetitionDataset/reference \ --output_path ./CompetitionDataset/dataset/preliminary/pred_retrieve.json \ --load_path ./custom_dicts/with_frequency
-
Once you've verified the file paths, open your terminal, navigate to the directory where the
retrieve_v2.sh
script is located, and run the following command to execute it:./retrieve_v2.sh
This will start the retrieval process, and the results will be saved to the file specified in
--output_path
. -
After the script finishes running, you can check the output at the location specified in the
--output_path
to view the retrieval results.
Tip
You can find detailed logs in the logs/ directory:
retrieve_YYYY-MM-DD_HH-MM-SS.log
: Contains retrieval process logs and resultschunk_YYYY-MM-DD_HH-MM-SS.log
: Contains word segmentation results and chunking details
Note
If you have a ground_truths.json file, you can also run python3 score.py
to evaluate the retrieval results. The evaluation results will be saved in logs/score_YYYY-MM-DD_HH-MM-SS.log
.
-
If you want to experiment with different parameters, you can modify the settings in
config.yaml
:The
config.yaml
file contains various configurable parameters that control the retrieval behavior:# Core parameters load_all_data: false # Whether to load all data at once (not developed yet) bm25_k1: 0.5 # BM25 k1 parameter bm25_b: 0.7 # BM25 b parameter chunk_size: 500 # Size of text chunks overlap: 100 # Overlap between chunks # Scoring method weights base_weights: bm25: 0.20 # Weight for BM25 score faiss: 0.3 # Weight for FAISS similarity importance: 0.0 # Weight for term importance semantic: 0.1 # Weight for semantic matching coverage: 0.1 # Weight for query coverage position: 0.1 # Weight for position scoring density: 0.15 # Weight for term density context: 0.05 # Weight for context similarity
Adjust these parameters based on your specific needs and run the retrieval script again to see the effects.
Important
Some parameters in config.yaml
are still under development and may need further tuning. Please check back later for the optimal settings.
Tip
๐ You may want to add common words or stop words. For details, refer to custom_dicts/README.md.
๐ง If you want to train your own Word2Vec model, check out word2vec/README.md for more information.
If you want to preprocess the data using OCR and text extraction, you can use the tools in the LangChain_ORC/ directory:
cd LangChain_ORC
./preprocess.sh
This will:
- Extract text from PDF files in the reference directory
- Format FAQ documents for better retrieval
- Generate preprocessed files in a structured format
See LangChain_ORC/README.md for detailed preprocessing instructions and options.
Note
Data preprocessing is optional. The retrieval system will work with raw files, but preprocessing may improve results for certain document types.
Run retrieve_v2.sh
- Build and run the container.
docker compose up -d --build
- Run the retrieve_v2.sh
docker exec baseline conda run -n baseline bash /app/retrieve_v2.sh
RUN retrieve_v3.sh
- Get into the LangChain_ORC/
cd LangChain_ORC
- Run the docker and the OCR process
docker compose up -d --build
Part 2 Run the retrieve_v3.sh
- Build and run the container.
docker compose up -d --build
- Run the
retrieve_v3.sh
docker exec baseline conda run -n baseline bash /app/retrieve_v3.sh
AICUP-2024/ # ๅฐๆก่ณๆๅคพ
โโ Baseline/ # ๅฎๆนๆไพ็็ฏไพ่ณๆๅคพ
โ โโ README.md # ็ฐกไป
โ โโ README.pdf # pdf ็ฐกไป
โ โโ bm25_retrieve.py # ็ฏไพ็จๅผ
โ โโ bm25_with_faiss.py # ๆธฌ่ฉฆ faiss
โ โโ langchain_test.py # ๆธฌ่ฉฆ langchain
โ โโ new_bm25_retrieve.py # ๆธฌ่ฉฆๆฐ bm25 ็จๅผ
โ โโ retrieve-test.py # ๆธฌ่ฉฆ retrieve
โโ CompetitionDataset/ # ่ณๆ้
โ โโ dataset/
โ โ โโ preliminary/
โ โ โโ ground_truths_example.json # ๅ
ๅซ็ๅฏฆ็ญๆก็็ฏไพๆไปถ
โ โ โโ pred_retrieve.json # ้ ๆธฌๆชข็ดข็ตๆ็็ฏไพๆไปถ
โ โ โโ questions_example.json # ๅ
ๅซๅ้ก็็ฏไพๆไปถ
โ โโ reference/ # ๅ่่ณๆๆไปถๅคพ
โ โโ faq/ # ๅธธ่ฆๅ้ก้่ณๆ
โ โ โโ pid_map_content.json
โ โโ finance/ # ่ฒกๅ็ธ้่ณๆ
โ โ โโ 0.pdf
โ โ โโ 1.pdf
โ โ โโ ...
โ โโ insurance/ # ไฟ้ช็ธ้่ณๆ
โ โโ 0.pdf
โ โโ 1.pdf
โ โโ ...
โโ LangChain_ORC/ # OCR่ฎๅPDF
โ โโ docker-compose.yaml # docker
โ โโ dockerfile # ๅปบ็ซ docker
โ โโ faq_format.py # faq format
โ โโ pdf_extractor.py # pdf format
โ โโ preprocess.sh # preprocess ไธป็จๅผ
โโ custom_dicts/ # ่ชๅฎ็พฉ่ฉๅ
ธ
โ โโ origin_dict/ # ๅๅง่ฉๅ
ธ
โ โ โโ common_use_dict.txt # ๅธธ็จ่ฉ
โ โ โโ insurance_dict.txt # ไฟ้ช็จ่ฉ
โ โโ with_freqency/ # ๅซ่ฉ้ ป่ฉๅ
ธ
โ โ โโ common_use_dict_with_frequency.txt # ๅธธ็จ่ฉๅซ่ฉ้ ป
โ โ โโ insurance_dict_with_frequency.txt # ไฟ้ช็จ่ฉๅซ่ฉ้ ป
โ โโ add_dict_frequency.py # ๅฐๅๅง่พญๅ
ธๆทปๅ ่ฉ้ ป
โ โโ dict.txt.big # ็นไธญ่พญๅ
ธ
โ โโ stopwords.txt # ๅๆญข่ฉ
โโ word2vec/ # Word2Vec ่ฉๅ้ๆจกๅ
โ โโ corpusSegDone.txt # ๅ่ฉๅพ็่ณๆ้
โ โโ load_pretrain.py # ่ผๅ
ฅ้ ่จ็ทด Word2Vec model
โ โโ model.bin # ่ชๅทฑ่จ็ทด็ model
โ โโ segment_corpus.log # ่ณๆ้ๅ่ฉ่ฉ้ ป้ ่ฆฝ
โ โโ segment_corpus.py # ๅฐ่ณๆ้ๅ่ฉ
โ โโ train_word2vec.py # ่จ็ทด Word2Vec model
โ โโ transfer_vec2bin.py # ๅฐ modelๅพ .vec ่ฝๆๆ .bin
โ โโ wiki.zh.bin # wiki.zh ้ ่จ็ทด model
โโ utils/ # ๅทฅๅ
ทๅ
โ โโ rag_processor/ # ๅญๆพ RAG ็ๆๆๅ่ฝ (ๅ่ฝๅฐๆชๅ้ขๅฎๅ
จ)
โ โ โโ readers/ # ๆชๆก่ฎๅๅทฅๅ
ท
โ โ โ โโ __init__.py # ๅฏๅบๆไปถ่ฎๅๆจก็ต
โ โ โ โโ document_loader.py # ๆไปถ่ฎๅๅจ
โ โ โ โโ json_reader.py # ่ฎๅ json
โ โ โ โโ pdf_reader.py # ่ฎๅ pdf
โ โ โโ retrieval_system/ # ๆชข็ดขๅทฅๅ
ท
โ โ โ โโ __pycache__
โ โ โ โโ __init.py # ๅฏๅบๆชข็ดขๆจก็ต
โ โ โ โโ bm25_retrival.py # BM25 (ๆชๅ้ข)
โ โ โ โโ context_similarity.py # ่จ็ฎไธไธๆ็ธไผผๅบฆ
โ โ โ โโ faiss_retrieval.py # faiss ๅ้ๆชข็ดข
โ โ โ โโ position_score.py # ่จ็ฎไฝ็ฝฎๅพๅ
โ โ โ โโ query_coverage.py # ่จ็ฎๆฅ่ฉข่ฉ่ฆ่็
โ โ โ โโ reranker.py # ๅ้ๆ็ดข้ๆๅบ (ๆชๅฎๆ)
โ โ โ โโ retrieval_system.py # ๆดๅๆชข็ดขๅจ
โ โ โ โโ semantic_similarity.py # ่จ็ฎ่ชๆ็ธไผผๅบฆ
โ โ โ โโ term_density.py # ่จ็ฎ่ฉๅฏๅบฆ
โ โ โ โโ term_importance.py # ่จ็ฎ่ฉ้
้่ฆๆง
โ โ โโ scoring/ # ่ฉๅๅทฅๅ
ท
โ โ โ โโ __pycache__
โ โ โ โโ __init__ # ๅฏๅบ่ฉๅๆจก็ต
โ โ โ โโ base_scorer.py # ๅบ็ค่ฉๅ (ๆชๅฎๆ)
โ โ โ โโ bm25_scorer.py # BM25 ไธป่ฆ่ฉๅ
โ โ โ โโ weighted_scorer.py # ๅค็ถญๅบฆๅ ๆฌ่ฉๅ
โ โ โโ __init__.py # ๅฏๅบ RAG ็ณป็ตฑไธญๅ่็ๆจก็ต็ๆ ธๅฟๅ
ไปถ๏ผๆไพ็ตฑไธ็ๆฅๅฃไพๅค้จไฝฟ็จ
โ โ โโ config.py # ่ฎๅใๅฒๅญ config ๆไปถ
โ โ โโ document_processor.py # ๆไปถ่็ๅจ
โ โ โโ document_score_calculator.py # ๆไปถ่ฉๅๅจ
โ โ โโ query_processor.py # ๆฅ่ฉข่็ๅจ
โ โ โโ resource_loader.py # ่ณๆบ่ผๅ
ฅๅจ
โ โโ RAGProcessor.py # RAGๆชข็ดขๅจ
โ โโ README.md # ๆดๅๆชข็ดขๆนๅผ็่ชชๆ
โ โโ __init__.py
โ โโ env.py # ๆชขๆฅ็ฐๅข่ฎๆธ
โโ logs/ # ้่กๆฅ่ช
โ โโ retrieve_xxxx-xx-xx_xx-xx-xx.log # ๆชข็ดข็ๆณ
โ โโ chunk_xxxx-xx-xx_xx-xx-xx.log # ๆไปถๅๅก็ๆณ
โ โโ score_xxxx-xx-xx_xx-xx-xx.log # ่ฉๅ็ตๆ
โโ .env # ็ฐๅข่ฎๆธ
โโ .gitignore
โโ README.md # ็จๅผไฝฟ็จ่ชชๆๆไปถ
โโ bm25_retrieve_v2.py # ไธป็จๅผ (็ดๆฅ่ฎๅPDF)
โโ bm25_retrieve_v3.py # ไธป็จๅผ (่ผๅ
ฅ้ ่็json๏ผjson_readerๆชๆดๆฐๅฎๆ๏ผๅฐๆช่v2ๅไฝต)
โโ config.yaml # ไธป็จๅผ้่กๅๆธ (้จๅๅๆธๅฐๆช่จญ็ฝฎๅฎๆ)
โโ docker-compose.yaml # docker compose
โโ dockerfile # ๅปบ็ซ docker
โโ requirements.txt # ้ๅฎ่ฃ็ module
โโ retrieve.log # ๆชข็ดข็ๆณ
โโ retrieve_v2.sh # ้่กไธป็จๅผ (็ดๆฅ่ฎๅPDF)
โโ retrieve_v3.sh # ้่กไธป็จๅผ (่ผๅ
ฅ้ ่็json๏ผjson_readerๆชๆดๆฐๅฎๆ๏ผๅฐๆช่v2ๅไฝต)
โโ score.log # ่ฉๅ็ตๆ
โโ score.py # ่ฉไผฐ้่ก็ตๆ