Skip to content

Hsun1128/AICUP-2024

Repository files navigation

AICUP-2024

AICUP Team Logo

Tip

This program is designed for efficient retrieval and evaluation tasks, especially useful for document-based question answering. It includes BM25 and FAISS-based retrieval methods, along with optional preprocessing and scoring tools, for AICUP-2024 competition datasets. This project supports various settings, such as custom dictionaries and weighted scoring, enabling fine-tuning for optimized retrieval accuracy.

๐Ÿš€ Program Description

For a comprehensive overview of the program's logic and algorithms, refer to the utils/README.md file.

๐Ÿ“ฆ Installation and Start

Warning

Ensure that your CUDA version and GPU are compatible with the dependencies in the requirements.txt file. For GPU-accelerated tasks, make sure you have the appropriate CUDA version installed and that your GPU drivers are up to date.

  1. Install required dependencies:

    pip install -r requirements.txt
  2. Set the question range in bm25_retrieve_v2.py:

    Modify the RANGE variable to specify which questions to process. For example:

    RANGE = range(0, 150)  # Process questions range from 0 to 149 (total 150 questions)

    This setting controls which subset of questions will be processed during retrieval.

  3. Open retrieve_v2.sh and check the file paths for the following variables:

    • --question_path: Path to the file containing the questions to retrieve.
    • --source_path: Path to the folder containing the source documents or dataset.
    • --output_path: Path where the retrieval results will be saved.
    • --load_path: Path to a custom dictionary or resource file, such as a frequency dictionary.

    Example of retrieve_v2.sh file:

    #!/bin/bash
    # ... check word2vec/wiki.zh.bin
    python3 bm25_retrieve_v2.py \
      --question_path ./CompetitionDataset/dataset/preliminary/questions_example.json \
      --source_path ./CompetitionDataset/reference \
      --output_path ./CompetitionDataset/dataset/preliminary/pred_retrieve.json \
      --load_path ./custom_dicts/with_frequency
  4. Once you've verified the file paths, open your terminal, navigate to the directory where the retrieve_v2.sh script is located, and run the following command to execute it:

    ./retrieve_v2.sh

    This will start the retrieval process, and the results will be saved to the file specified in --output_path.

  5. After the script finishes running, you can check the output at the location specified in the --output_path to view the retrieval results.

Tip

You can find detailed logs in the logs/ directory:

  • retrieve_YYYY-MM-DD_HH-MM-SS.log: Contains retrieval process logs and results
  • chunk_YYYY-MM-DD_HH-MM-SS.log: Contains word segmentation results and chunking details

Note

If you have a ground_truths.json file, you can also run python3 score.py to evaluate the retrieval results. The evaluation results will be saved in logs/score_YYYY-MM-DD_HH-MM-SS.log.

  1. If you want to experiment with different parameters, you can modify the settings in config.yaml:

    The config.yaml file contains various configurable parameters that control the retrieval behavior:

    # Core parameters
    load_all_data: false       # Whether to load all data at once (not developed yet)
    bm25_k1: 0.5               # BM25 k1 parameter
    bm25_b: 0.7                # BM25 b parameter
    chunk_size: 500            # Size of text chunks
    overlap: 100               # Overlap between chunks
    
    # Scoring method weights
    base_weights:
      bm25: 0.20              # Weight for BM25 score
      faiss: 0.3              # Weight for FAISS similarity
      importance: 0.0         # Weight for term importance
      semantic: 0.1           # Weight for semantic matching
      coverage: 0.1           # Weight for query coverage
      position: 0.1           # Weight for position scoring
      density: 0.15           # Weight for term density
      context: 0.05           # Weight for context similarity

    Adjust these parameters based on your specific needs and run the retrieval script again to see the effects.

Important

Some parameters in config.yaml are still under development and may need further tuning. Please check back later for the optimal settings.

Tip

๐ŸŒŸ You may want to add common words or stop words. For details, refer to custom_dicts/README.md.
๐Ÿง  If you want to train your own Word2Vec model, check out word2vec/README.md for more information.

๐Ÿ”„ [Optional] Data Preprocessing

If you want to preprocess the data using OCR and text extraction, you can use the tools in the LangChain_ORC/ directory:

cd LangChain_ORC
./preprocess.sh

This will:

  • Extract text from PDF files in the reference directory
  • Format FAQ documents for better retrieval
  • Generate preprocessed files in a structured format

See LangChain_ORC/README.md for detailed preprocessing instructions and options.

Note

Data preprocessing is optional. The retrieval system will work with raw files, but preprocessing may improve results for certain document types.

๐Ÿณ Quick Start with Docker

  1. Build and run the container.
    docker compose up -d --build
  2. Run the retrieve_v2.sh
    docker exec baseline conda run -n baseline bash /app/retrieve_v2.sh

Part 1 run the preprocess

  1. Get into the LangChain_ORC/
    cd LangChain_ORC
  2. Run the docker and the OCR process
    docker compose up -d --build

Part 2 Run the retrieve_v3.sh

  1. Build and run the container.
    docker compose up -d --build
  2. Run the retrieve_v3.sh
    docker exec baseline conda run -n baseline bash /app/retrieve_v3.sh

๐Ÿ“‚Data Structure

AICUP-2024/     # ๅฐˆๆกˆ่ณ‡ๆ–™ๅคพ
โ”œโ”€ Baseline/                 # ๅฎ˜ๆ–นๆไพ›็š„็ฏ„ไพ‹่ณ‡ๆ–™ๅคพ
โ”‚   โ”œโ”€ README.md               # ็ฐกไป‹
โ”‚   โ”œโ”€ README.pdf              # pdf ็ฐกไป‹
โ”‚   โ”œโ”€ bm25_retrieve.py        # ็ฏ„ไพ‹็จ‹ๅผ
โ”‚   โ”œโ”€ bm25_with_faiss.py      # ๆธฌ่ฉฆ faiss
โ”‚   โ”œโ”€ langchain_test.py       # ๆธฌ่ฉฆ langchain
โ”‚   โ”œโ”€ new_bm25_retrieve.py    # ๆธฌ่ฉฆๆ–ฐ bm25 ็จ‹ๅผ
โ”‚   โ””โ”€ retrieve-test.py        # ๆธฌ่ฉฆ retrieve
โ”œโ”€ CompetitionDataset/                        # ่ณ‡ๆ–™้›†
โ”‚   โ”œโ”€ dataset/
โ”‚   โ”‚   โ””โ”€ preliminary/
โ”‚   โ”‚        โ”œโ”€ ground_truths_example.json      # ๅŒ…ๅซ็œŸๅฏฆ็ญ”ๆกˆ็š„็ฏ„ไพ‹ๆ–‡ไปถ
โ”‚   โ”‚        โ”œโ”€ pred_retrieve.json              # ้ ๆธฌๆชข็ดข็ตๆžœ็š„็ฏ„ไพ‹ๆ–‡ไปถ
โ”‚   โ”‚        โ””โ”€ questions_example.json          # ๅŒ…ๅซๅ•้กŒ็š„็ฏ„ไพ‹ๆ–‡ไปถ
โ”‚   โ””โ”€ reference/                      # ๅƒ่€ƒ่ณ‡ๆ–™ๆ–‡ไปถๅคพ
โ”‚        โ”œโ”€ faq/                         # ๅธธ่ฆ‹ๅ•้กŒ้›†่ณ‡ๆ–™
โ”‚        โ”‚   โ””โ”€ pid_map_content.json
โ”‚        โ”œโ”€ finance/                     # ่ฒกๅ‹™็›ธ้—œ่ณ‡ๆ–™
โ”‚        โ”‚   โ”œโ”€ 0.pdf
โ”‚        โ”‚   โ”œโ”€ 1.pdf
โ”‚        โ”‚   โ””โ”€ ...
โ”‚        โ””โ”€ insurance/                   # ไฟ้šช็›ธ้—œ่ณ‡ๆ–™
โ”‚             โ”œโ”€ 0.pdf
โ”‚             โ”œโ”€ 1.pdf
โ”‚             โ””โ”€ ...
โ”œโ”€ LangChain_ORC/             # OCR่ฎ€ๅ–PDF
โ”‚   โ”œโ”€ docker-compose.yaml      # docker
โ”‚   โ”œโ”€ dockerfile               # ๅปบ็ซ‹ docker
โ”‚   โ”œโ”€ faq_format.py            # faq format
โ”‚   โ”œโ”€ pdf_extractor.py         # pdf format
โ”‚   โ””โ”€ preprocess.sh            # preprocess ไธป็จ‹ๅผ
โ”œโ”€ custom_dicts/                               # ่‡ชๅฎš็พฉ่ฉžๅ…ธ
โ”‚   โ”œโ”€ origin_dict/                              # ๅŽŸๅง‹่ฉžๅ…ธ
โ”‚   โ”‚   โ”œโ”€ common_use_dict.txt                     # ๅธธ็”จ่ฉž
โ”‚   โ”‚   โ””โ”€ insurance_dict.txt                      # ไฟ้šช็”จ่ฉž
โ”‚   โ”œโ”€ with_freqency/                            # ๅซ่ฉž้ ป่ฉžๅ…ธ
โ”‚   โ”‚   โ”œโ”€ common_use_dict_with_frequency.txt      # ๅธธ็”จ่ฉžๅซ่ฉž้ ป
โ”‚   โ”‚   โ””โ”€ insurance_dict_with_frequency.txt       # ไฟ้šช็”จ่ฉžๅซ่ฉž้ ป
โ”‚   โ”œโ”€ add_dict_frequency.py                     # ๅฐ‡ๅŽŸๅง‹่พญๅ…ธๆทปๅŠ ่ฉž้ ป
โ”‚   โ”œโ”€ dict.txt.big                              # ็นไธญ่พญๅ…ธ
โ”‚   โ””โ”€ stopwords.txt                             # ๅœๆญข่ฉž
โ”œโ”€ word2vec/                 # Word2Vec ่ฉžๅ‘้‡ๆจกๅž‹
โ”‚   โ”œโ”€ corpusSegDone.txt       # ๅˆ†่ฉžๅพŒ็š„่ณ‡ๆ–™้›†
โ”‚   โ”œโ”€ load_pretrain.py        # ่ผ‰ๅ…ฅ้ ่จ“็ทด Word2Vec model
โ”‚   โ”œโ”€ model.bin               # ่‡ชๅทฑ่จ“็ทด็š„ model
โ”‚   โ”œโ”€ segment_corpus.log      # ่ณ‡ๆ–™้›†ๅˆ†่ฉž่ฉž้ ป้ ่ฆฝ
โ”‚   โ”œโ”€ segment_corpus.py       # ๅฐ‡่ณ‡ๆ–™้›†ๅˆ†่ฉž
โ”‚   โ”œโ”€ train_word2vec.py       # ่จ“็ทด Word2Vec model
โ”‚   โ”œโ”€ transfer_vec2bin.py     # ๅฐ‡ modelๅพž .vec ่ฝ‰ๆ›ๆˆ .bin
โ”‚   โ””โ”€ wiki.zh.bin             # wiki.zh ้ ่จ“็ทด model
โ”œโ”€ utils/                                      # ๅทฅๅ…ทๅŒ…
โ”‚   โ”œโ”€ rag_processor/                            # ๅญ˜ๆ”พ RAG ็š„ๆ‰€ๆœ‰ๅŠŸ่ƒฝ (ๅŠŸ่ƒฝๅฐšๆœชๅˆ†้›ขๅฎŒๅ…จ)
โ”‚   โ”‚   โ”œโ”€ readers/                                # ๆช”ๆกˆ่ฎ€ๅ–ๅทฅๅ…ท
โ”‚   โ”‚   โ”‚   โ”œโ”€ __init__.py                           # ๅŒฏๅ‡บๆ–‡ไปถ่ฎ€ๅ–ๆจก็ต„
โ”‚   โ”‚   โ”‚   โ”œโ”€ document_loader.py                    # ๆ–‡ไปถ่ฎ€ๅ–ๅ™จ
โ”‚   โ”‚   โ”‚   โ”œโ”€ json_reader.py                        # ่ฎ€ๅ– json
โ”‚   โ”‚   โ”‚   โ””โ”€ pdf_reader.py                         # ่ฎ€ๅ– pdf
โ”‚   โ”‚   โ”œโ”€ retrieval_system/                       # ๆชข็ดขๅทฅๅ…ท
โ”‚   โ”‚   โ”‚   โ”œโ”€ __pycache__
โ”‚   โ”‚   โ”‚   โ”œโ”€ __init.py                             # ๅŒฏๅ‡บๆชข็ดขๆจก็ต„
โ”‚   โ”‚   โ”‚   โ”œโ”€ bm25_retrival.py                      # BM25 (ๆœชๅˆ†้›ข)
โ”‚   โ”‚   โ”‚   โ”œโ”€ context_similarity.py                 # ่จˆ็ฎ—ไธŠไธ‹ๆ–‡็›ธไผผๅบฆ
โ”‚   โ”‚   โ”‚   โ”œโ”€ faiss_retrieval.py                    # faiss ๅ‘้‡ๆชข็ดข
โ”‚   โ”‚   โ”‚   โ”œโ”€ position_score.py                     # ่จˆ็ฎ—ไฝ็ฝฎๅพ—ๅˆ†
โ”‚   โ”‚   โ”‚   โ”œโ”€ query_coverage.py                     # ่จˆ็ฎ—ๆŸฅ่ฉข่ฉž่ฆ†่“‹็Ž‡
โ”‚   โ”‚   โ”‚   โ”œโ”€ reranker.py                           # ๅ‘้‡ๆœ็ดข้‡ๆŽ’ๅบ (ๆœชๅฎŒๆˆ)
โ”‚   โ”‚   โ”‚   โ”œโ”€ retrieval_system.py                   # ๆ•ดๅˆๆชข็ดขๅ™จ
โ”‚   โ”‚   โ”‚   โ”œโ”€ semantic_similarity.py                # ่จˆ็ฎ—่ชžๆ„็›ธไผผๅบฆ
โ”‚   โ”‚   โ”‚   โ”œโ”€ term_density.py                       # ่จˆ็ฎ—่ฉžๅฏ†ๅบฆ
โ”‚   โ”‚   โ”‚   โ””โ”€ term_importance.py                    # ่จˆ็ฎ—่ฉž้ …้‡่ฆๆ€ง
โ”‚   โ”‚   โ”œโ”€ scoring/                                # ่ฉ•ๅˆ†ๅทฅๅ…ท
โ”‚   โ”‚   โ”‚   โ”œโ”€ __pycache__
โ”‚   โ”‚   โ”‚   โ”œโ”€ __init__                              # ๅŒฏๅ‡บ่ฉ•ๅˆ†ๆจก็ต„
โ”‚   โ”‚   โ”‚   โ”œโ”€ base_scorer.py                        # ๅŸบ็คŽ่ฉ•ๅˆ† (ๆœชๅฎŒๆˆ)
โ”‚   โ”‚   โ”‚   โ”œโ”€ bm25_scorer.py                        # BM25 ไธป่ฆ่ฉ•ๅˆ†
โ”‚   โ”‚   โ”‚   โ””โ”€ weighted_scorer.py                    # ๅคš็ถญๅบฆๅŠ ๆฌŠ่ฉ•ๅˆ†
โ”‚   โ”‚   โ”œโ”€ __init__.py                             # ๅŒฏๅ‡บ RAG ็ณป็ตฑไธญๅ„่™•็†ๆจก็ต„็š„ๆ ธๅฟƒๅ…ƒไปถ๏ผŒๆไพ›็ตฑไธ€็š„ๆŽฅๅฃไพ›ๅค–้ƒจไฝฟ็”จ
โ”‚   โ”‚   โ”œโ”€ config.py                               # ่ฎ€ๅ–ใ€ๅ„ฒๅญ˜ config ๆ–‡ไปถ
โ”‚   โ”‚   โ”œโ”€ document_processor.py                   # ๆ–‡ไปถ่™•็†ๅ™จ
โ”‚   โ”‚   โ”œโ”€ document_score_calculator.py            # ๆ–‡ไปถ่ฉ•ๅˆ†ๅ™จ
โ”‚   โ”‚   โ”œโ”€ query_processor.py                      # ๆŸฅ่ฉข่™•็†ๅ™จ
โ”‚   โ”‚   โ””โ”€ resource_loader.py                      # ่ณ‡ๆบ่ผ‰ๅ…ฅๅ™จ
โ”‚   โ”œโ”€ RAGProcessor.py                           # RAGๆชข็ดขๅ™จ
โ”‚   โ”œโ”€ README.md                                 # ๆ•ดๅ€‹ๆชข็ดขๆ–นๅผ็š„่ชชๆ˜Ž
โ”‚   โ”œโ”€ __init__.py
โ”‚   โ””โ”€ env.py                                    # ๆชขๆŸฅ็’ฐๅขƒ่ฎŠๆ•ธ
โ”œโ”€ logs/                                    # ้‹่กŒๆ—ฅ่ชŒ
โ”‚   โ”œโ”€ retrieve_xxxx-xx-xx_xx-xx-xx.log       # ๆชข็ดข็‹€ๆณ
โ”‚   โ”œโ”€ chunk_xxxx-xx-xx_xx-xx-xx.log          # ๆ–‡ไปถๅˆ†ๅกŠ็‹€ๆณ
โ”‚   โ””โ”€ score_xxxx-xx-xx_xx-xx-xx.log          # ่ฉ•ๅˆ†็ตๆžœ
โ”œโ”€ .env                   # ็’ฐๅขƒ่ฎŠๆ•ธ
โ”œโ”€ .gitignore
โ”œโ”€ README.md              # ็จ‹ๅผไฝฟ็”จ่ชชๆ˜Žๆ–‡ไปถ
โ”œโ”€ bm25_retrieve_v2.py    # ไธป็จ‹ๅผ (็›ดๆŽฅ่ฎ€ๅ–PDF)
โ”œโ”€ bm25_retrieve_v3.py    # ไธป็จ‹ๅผ (่ผ‰ๅ…ฅ้ ่™•็†json๏ผŒjson_readerๆœชๆ›ดๆ–ฐๅฎŒๆˆ๏ผŒๅฐšๆœช่ˆ‡v2ๅˆไฝต)
โ”œโ”€ config.yaml            # ไธป็จ‹ๅผ้‹่กŒๅƒๆ•ธ (้ƒจๅˆ†ๅƒๆ•ธๅฐšๆœช่จญ็ฝฎๅฎŒๆˆ)
โ”œโ”€ docker-compose.yaml    # docker compose
โ”œโ”€ dockerfile             # ๅปบ็ซ‹ docker
โ”œโ”€ requirements.txt       # ้œ€ๅฎ‰่ฃ็š„ module
โ”œโ”€ retrieve.log           # ๆชข็ดข็‹€ๆณ
โ”œโ”€ retrieve_v2.sh         # ้‹่กŒไธป็จ‹ๅผ (็›ดๆŽฅ่ฎ€ๅ–PDF)
โ”œโ”€ retrieve_v3.sh         # ้‹่กŒไธป็จ‹ๅผ (่ผ‰ๅ…ฅ้ ่™•็†json๏ผŒjson_readerๆœชๆ›ดๆ–ฐๅฎŒๆˆ๏ผŒๅฐšๆœช่ˆ‡v2ๅˆไฝต)
โ”œโ”€ score.log              # ่ฉ•ๅˆ†็ตๆžœ
โ””โ”€ score.py               # ่ฉ•ไผฐ้‹่กŒ็ตๆžœ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •