This project demonstrates efficient vector database querying using GPU acceleration, applied to a large-scale dataset of Wikipedia articles. The process involves embedding generation, clustering, and approximate nearest neighbor search (GPU accelerated) to retrieve relevant articles based on input queries.
We utilize the Wikipedia dataset containing plain text from November 2020. You can download the dataset from Kaggle:
Plain Text Wikipedia 202011 Dataset
(The below preprocessing is done and saved in CUDA5 /scratch/pvg2018/ if you don't have access, email [email protected] or [email protected] or [email protected])
- Embedding Generation: Use
embedding.py
to generate vector embeddings for the Wikipedia articles. - Clustering: Apply K-means clustering with
cluster.py
to group the embeddings into 128 clusters. Save the clustered data in a designated folder. - Data Conversion: Convert the
.npy
files to.bin
format usingconvert_npy_bin.py
to ensure compatibility with C++. - Query Processing: Use pregenerated embeddings in the queries_data folder or create custom ones (See the Queries subsection for more info)
- Approximate Nearest Neighbor Search: Compile and execute
IVF.cpp
to find the closest matching article to the query.
The following arguments need to be passed to the executable:
- n_probe: Value from 1 to 128 which denotes how many top clusters can be chosen in the coarse search to do the fine search in
- Which kernel mode: This defines which cuda kernel will run. It can either be "Atomic", or "NonAtomic". These are the 2 different types of kernels we use to compute the coarse and fine search
- Sequential Search: This can be true or false. "true" stands for sequential search and "false" for non sequential search. This defines if each cluster is handled by a separate kernel or all the clusters are combined into one and a single kernel handles them all.
- Use CUDA coarse: This can be true or false. This stands for using the CPU or GPU for the coarse search part (find the top n_probe cluster centroids).
- Use CUDA fine: This can be true or false. This stands for using the CPU or GPU for the fine search part (find the top k closest elements in the top n_probe clusters).
- ThreadsPerBlock: Can be any multiple of 32. Denotes the threads per block
We have pre-generated embeddings for the following queries, saved as .bin
files in the queries_data
folder:
"What is learning rate in gradient descent?"
(query1.bin
)"What is Microbial biogeography?"
(query2.bin
)"Give me details about The Arch of Cabanes."
(query3.bin
)"Give me details about the history of the Taj Mahal."
(query4.bin
)"Tell me something about the labelling used on aid packages created and sent under the Marshall Plan."
(query5.bin
)
You can use these queries by updating the query number in the code where the query file is specified.
If you want to process a new question:
- Run the
test.py
script with the following command:python test.py --query "Your new question here"
- This will generate a
.bin
file for your query and save it in thequeries_data
folder. - Update the query number in the code to point to the new query file.
In your IVF.cpp
code, make sure the path to the query file points to the desired query (e.g., query1.bin
for the first pre-uploaded query or your new query file):
std::string query_path = "queries_data/query1.bin"; // Update the file as needed
To compile and run the program on a CUDA-enabled machine:
ssh to cuda5.cims.nyu.edu
git clone https://github.com/PranavGrandhi/GPU_Accelerated_Vector_Indexing.git
cd GPU_Accelerated_Vector_Indexing
module load cuda-12.4
nvcc IVF.cpp cosine_similarity.cu -o IVF
./IVF --n_probe=30 --mode=Atomic --sequential_fine_search=false --use_cuda_coarse=true --use_cuda_fine=true --threadsperBlock=128 --print_results=true
Update flag values as needed
Upon execution, the program will output the article most relevant to the input query.
The program will display the title and content of the Wikipedia article that best matches the query "What is learning rate in gradient descent?", or any custom query
We can use the run_multiple_configs.sh
file to run specific configurations multiple times and calculate the average running CPU/GPU time for each configuration.
To execute an experiment:
-
Prepare the Configuration File: Each experiment has a corresponding
.txt
file listing all parameter configurations (e.g.,experiment1_config.txt
). -
Run the Script: Use the following command:
./run_multiple_configs.sh <configuration_file> [<num_runs>]
- Replace
<configuration_file>
with the name of the experiment configuration file (e.g.,experiment1_config.txt
). - Replace
<num_runs>
with the number of runs for averaging (default is 5).
- Replace
-
Output: The script will display the configuration details, individual run times, and the average time for each configuration.
./run_multiple_configs.sh experiment1_config.txt 10
For more details, refer to the source code and comments within each script.