This is the search engine Tübingo for the course Modern Search Engines in the summer term 2024.
- Daniel Flat
- Lenard Rommel
- Veronika Smilga
- Lilli Diederichs
This project runs with Python 3.11.0. To install the required packages, follow the instructions below:
Ensure Python 3.11.0 is installed on your system. For Ubuntu:
sudo apt install python3.11
For MacOS:
brew install [email protected]
Make sure you have Poetry installed to manage project dependencies. You can install Poetry via pip:
pip install poetry
poetry install
To install the project dependencies, run the following command:
poetry install;
This command reads the pyproject.toml
file and installs all specified dependencies into a virtual environment managed by Poetry.
Note: When adding a package to a specific group, ensure that this package is NOT specified anywhere else in the file. Duplicates can cause issues that are hard to resolve.
python -m spacy download en_core_web_sm;
For the collaborators and the reviewers make sure that git lfs (git Large File Storage) is installed on your machine. Look here for more details.
For MacOS:
brew install git-lfs
With that we are tracking the dump.sql (the index for our search engine) because it might be too large for a repo. We used
git lfs track ./data/dump.sql
to still be able to push this file.
To activate the Poetry environment, run the following command:
poetry shell
For our search engine we use a shared database with our index. This is required to run the search engine and test it. For that make sure to run the following:
docker compose down;
docker compose up --build db
After that you should wait some seconds until you get the message:
LOG: database system is ready to accept connections
When you want to try out how the database works, you can experiment it with 001_Flat_db_example_connection.ipynb
To run the project, execute the following command:
python Main.py
This command runs the Main.py
file, which is the entry point of the project.
If you want to run the Crawler, please go to 006_Nika_crawling_with_checkpointing.ipynb and run it. All instructions are provided in the notebook.
For the project we all want to be on the same page. For that we have one common database, a PostgreSQL.
Furthermore, we all try to be in sync with our data, so for that there always exists a dump.sql.
This file makes it possible to always have the latest data when calling docker compose up --build db
.
But when you are working with it, you should update the dump.sql
when you make progress for the team. To do that you have to do the following steps.
- Make sure the docker container is running. For that look at the chapter "Running the Project".
- Look up the container ID. For that just exec:
docker ps
It should list all your running containers. Look for the right container. The right name of the image is project_mse-db
.
An ID might look like this: c946285e9b4f
3. Go and overwrite the dump.sql
by exec the following script:
docker exec -t your_container_name_or_id pg_dump -U user search_engine_db > ./db/dump.sql
For example with the container ID c946285e9b4f
the command should look like this:
docker exec -t c946285e9b4f pg_dump -U user search_engine_db > ./db/dump.sql
- Push the updated dump.sql using git.
git add db/dump.sql
git commit -m "Update dump.sql"