Skip to content

Commit

Permalink
Adds LLM support to obtain the repository URL through LLM providers.
Browse files Browse the repository at this point in the history
LLM providers can be accessed through third party APIs (such as OpenAI), or through the Genrative AI Hub in the SAP AI Core.
  • Loading branch information
lauraschauer committed Jun 7, 2024
1 parent c520045 commit 375158d
Show file tree
Hide file tree
Showing 19 changed files with 908 additions and 164 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ prospector/install_fastext.sh
prospector/nvd.ipynb
prospector/data/nvd.pkl
prospector/data/nvd.csv
prospector/data_sources/reports
.vscode/settings.json
prospector/cov_html/*
prospector/client/cli/cov_html/*
Expand All @@ -51,6 +52,7 @@ prospector/.coverage
**/cov_html
prospector/cov_html
.coverage
prospector/.venv
prospector/prospector.code-workspace
prospector/requests-cache.sqlite
prospector/prospector-report.html
Expand Down
98 changes: 84 additions & 14 deletions prospector/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,29 @@ currently under development: the instructions below are intended for development

:exclamation: Please note that **Windows is not supported** while WSL and WSL2 are fine.

## Description
## Table of Contents

1. [Description](#description)
2. [Quick Setup & Run](#setup--run)
3. [Development Setup](#development-setup)
4. [Contributing](#contributing)
5. [History](#history)

## 📖 Description

Prospector is a tool to reduce the effort needed to find security fixes for
*known* vulnerabilities in open source software repositories.

Given an advisory expressed in natural language, Prospector processes the commits found in the target source code repository, ranks them based on a set of predefined rules, and produces a report that the user can inspect to determine which commits to retain as the actual fix.

## Setup & Run
## ⚡️ Quick Setup & Run

Prerequisites:

:warning: The tool requires Docker and Docker-compose, as it employes Docker containers for certain functionalities. Make sure you have Docker installed and running before proceeding with the setup and usage of Prospector.
* Docker (make sure you have Docker installed and running before proceeding with the setup)
* Docker-compose

To quickly set up Prospector:
To quickly set up Prospector, follow these steps. This will run Prospector in its containerised version. If you wish to debug or run Prospector's components individually, follow the steps below at [Development Setup](#development-setup).

1. Clone the project KB repository
```
Expand Down Expand Up @@ -44,7 +55,52 @@ To quickly set up Prospector:
By default, Prospector saves the results in a HTML file named *prospector-report.html*.
Open this file in a web browser to view what Prospector was able to find!
## Development Setup
### 🤖 LLM Support
To use Prospector with LLM support, set the `use_llm_<...>` parameters in `config.yaml`. Additionally, you must specify required parameters for API access to the LLM. These parameters can vary depending on your choice of provider, please follow what fits your needs:
<details><summary><b>Use SAP AI CORE SDK</b></summary>
You will need the following parameters in `config.yaml`:
```yaml
llm_service:
type: sap
model_name: <model_name>
```

`<model_name>` refers to the model names available in the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme#1-supported-models) you can find an overview of available models.

In `.env`, you must set the deployment URL as an environment variable following this naming convention:
```yaml
<model_name (in capitals, and - changed to _)>_URL
```

</details>

<details><summary><b>Use personal third party provider</b></summary>

Implemented third party providers are **OpenAI**, **Google** and **Mistral**.

1. You will need the following parameters in `config.yaml`:
```yaml
llm_service:
type: third_party
model_name: <model_name>
```
`<model_name>` refers to the model names available, for example `gpt-4o` for OpenAI. You can find a lists of available models here:
1. [OpenAI](https://platform.openai.com/docs/models)
2. [Google](https://ai.google.dev/gemini-api/docs/models/gemini)
3. [Mistral](https://docs.mistral.ai/getting-started/models/)

2. Make sure to add your OpenAI API key to your `.env` file as `[OPENAI|GOOGLE|MISTRAL]_API_KEY`.

</details>

## 👩‍💻 Development Setup

Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing).

Prerequisites:

Expand All @@ -53,6 +109,8 @@ Prerequisites:
* gcc g++ libffi-dev python3-dev libpq-dev
* Docker & Docker-compose

### General

You can setup everything and install the dependencies by running:
```
make setup
Expand Down Expand Up @@ -81,11 +139,13 @@ your editor so that autoformatting is enforced "on save". The pre-commit hook en
black is run prior to committing anyway, but the auto-formatting might save you some time
and avoid frustration.

If you use VSCode, this can be achieved by pasting these lines in your configuration file:
If you use VSCode, this can be achieved by installing the Black Formatter extension and pasting these lines in your configuration file:

```
"python.formatting.provider": "black",
"editor.formatOnSave": true,
```json
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
}
```

### Starting the backend database and the job workers
Expand All @@ -94,17 +154,23 @@ If you run the client without running the backend you will get a warning and hav

You can then start the necessary containers with the following command:

`make docker-setup`
```bash
make docker-setup
```

This also starts a convenient DB administration tool at http://localhost:8080

If you wish to cleanup docker to run a fresh version of the backend you can run:

`make docker-clean`
```bash
make docker-clean
```

### Starting the RESTful server

`uvicorn api.main:app --reload`
```bash
uvicorn service.main:app --reload
```

Note, that it requires `POSTGRES_USER`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DBNAME` to be set in the .env file.

Expand All @@ -113,7 +179,9 @@ You might also want to take a look at `http://127.0.0.1:8000/docs`.

*Alternatively*, you can execute the RESTful server explicitly with:

`python api/main.py`
```bash
python api/main.py
```

which is equivalent but more convenient for debugging.

Expand All @@ -127,11 +195,13 @@ Prospector makes use of `pytest`.

:exclamation: **NOTE:** before using it please make sure to have running instances of the backend and the database.

## 🤝 Contributing

If you find a bug, please open an issue. If you can also fix the bug, please
create a pull request (make sure it includes a test case that passes with your correction
but fails without it)

## History
## 🕰️ History

The high-level structure of Prospector follows the approach of its
predecessor FixFinder, which is described in:
Expand Down
25 changes: 22 additions & 3 deletions prospector/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from dotenv import load_dotenv

import llm.llm_operations as llm
from util.http import ping_backend

path_root = os.getcwd()
Expand All @@ -32,10 +33,12 @@ def main(argv): # noqa: C901
with ConsoleWriter("Initialization") as console:
config = get_configuration(argv)
if not config:
logger.error("No configuration file found. Cannot proceed.")
logger.error(
"No configuration file found, or error in configuration file. Cannot proceed."
)

console.print(
"No configuration file found.",
"No configuration file found, or error in configuration file. Check logs.",
status=MessageStatus.ERROR,
)
return
Expand All @@ -51,6 +54,16 @@ def main(argv): # noqa: C901
)
return

if not config.repository and not config.use_llm_repository_url:
logger.error(
"Either provide the repository URL or allow LLM usage to obtain it."
)
console.print(
"Either provide the repository URL or allow LLM usage to obtain it.",
status=MessageStatus.ERROR,
)
sys.exit(1)

# if config.ping:
# return ping_backend(backend, get_level() < logging.INFO)

Expand All @@ -63,6 +76,12 @@ def main(argv): # noqa: C901

logger.debug("Vulnerability ID: " + config.vuln_id)

# whether to use LLM support
if not config.repository:
config.repository = llm.get_repository_url(
llm_config=config.llm, vuln_id=config.vuln_id
)

results, advisory_record = prospector(
vulnerability_id=config.vuln_id,
repository_url=config.repository,
Expand All @@ -88,7 +107,7 @@ def main(argv): # noqa: C901
)

execution_time = execution_statistics["core"]["execution time"][0]
ConsoleWriter.print(f"Execution time: {execution_time:.3f}s")
ConsoleWriter.print(f"Execution time: {execution_time:.3f}s\n")

return

Expand Down
13 changes: 9 additions & 4 deletions prospector/config-sample.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# Wheter to preprocess only the repository's commits or fully run prospector
preprocess_only: False

Expand All @@ -12,7 +10,7 @@ fetch_references: False
use_nvd: True

# The NVD API token
nvd_token: Null
# nvd_token: <your_nvd_api_token>

# Wheter to use a backend or not: "always", "never", "optional"
use_backend: optional
Expand All @@ -30,6 +28,13 @@ database:

redis_url: redis://redis:6379/0

# LLM Usage (check README for help)
llm_service:
type: sap
model_name: gpt-4-turbo

use_llm_repository_url: True # whether to use LLM's to obtain the repository URL

# Report file format: "html", "json", "console" or "all"
# and the file name
report:
Expand All @@ -43,4 +48,4 @@ log_level: INFO
git_cache: /tmp/gitcache

# The GitHub API token
github_token: Null
# github_token: <your_api_token>
14 changes: 11 additions & 3 deletions prospector/core/prospector.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# flake8: noqa

import logging
import os
import re
import sys
import time
Expand Down Expand Up @@ -36,7 +37,7 @@
ONE_YEAR = 365 * SECS_PER_DAY

MAX_CANDIDATES = 2000
DEFAULT_BACKEND = "http://localhost:8000"
DEFAULT_BACKEND = "http://backend:8000"


core_statistics = execution_statistics.sub_collection("core")
Expand Down Expand Up @@ -157,7 +158,14 @@ def prospector( # noqa: C901
exc_info=get_level() < logging.WARNING,
)
if use_backend == "always":
print("Backend not reachable: aborting")
if backend_address == "http://localhost:8000" and os.path.exists(
"/.dockerenv"
):
print(
"The backend address should be 'http://backend:8000' when running the containerised version of Prospector: aborting"
)
else:
print("Backend not reachable: aborting")
sys.exit(1)
print("Backend not reachable: continuing")

Expand Down Expand Up @@ -227,7 +235,7 @@ def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[


def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]:
with ConsoleWriter("\nCandidate filtering\n") as console:
with ConsoleWriter("\nCandidate filtering") as console:
commits, rejected = filter_commits(commits)
if rejected > 0:
console.print(f"Dropped {rejected} candidates")
Expand Down
31 changes: 16 additions & 15 deletions prospector/datamodel/nlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,23 +139,24 @@ def extract_ghissue_references(repository: str, text: str) -> Dict[str, str]:
id = result.group(1)
url = f"{repository}/issues/{id}"
content = fetch_url(url=url, extract_text=False)
gh_ref_data = content.find_all(
attrs={
"class": ["comment-body", "markdown-title"],
},
recursive=False,
)
# TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
gh_ref_data.extend(
content.find_all(
if content is not None:
gh_ref_data = content.find_all(
attrs={
"id": re.compile(r"ref-issue|ref-pullrequest"),
}
"class": ["comment-body", "markdown-title"],
},
recursive=False,
)
# TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
gh_ref_data.extend(
content.find_all(
attrs={
"id": re.compile(r"ref-issue|ref-pullrequest"),
}
)
)
refs[id] = " ".join(
[" ".join(block.get_text().split()) for block in gh_ref_data]
)
)
refs[id] = " ".join(
[" ".join(block.get_text().split()) for block in gh_ref_data]
)

return refs

Expand Down
Loading

0 comments on commit 375158d

Please sign in to comment.