Skip to content

Commit

Permalink
Adds LLM support to obtain the repository URL of the repository conne…
Browse files Browse the repository at this point in the history
…cted to the supplied CVE.
  • Loading branch information
lauraschauer committed Jun 4, 2024
1 parent c520045 commit d3cdfa6
Show file tree
Hide file tree
Showing 19 changed files with 733 additions and 167 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ prospector/.coverage
**/cov_html
prospector/cov_html
.coverage
prospector/.venv
prospector/prospector.code-workspace
prospector/requests-cache.sqlite
prospector/prospector-report.html
Expand Down
88 changes: 74 additions & 14 deletions prospector/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,29 @@ currently under development: the instructions below are intended for development

:exclamation: Please note that **Windows is not supported** while WSL and WSL2 are fine.

## Description
## Table of Contents

1. [Description](#description)
2. [Quick Setup & Run](#setup--run)
3. [Development Setup](#development-setup)
4. [Contributing](#contributing)
5. [History](#history)

## 📖 Description

Prospector is a tool to reduce the effort needed to find security fixes for
*known* vulnerabilities in open source software repositories.

Given an advisory expressed in natural language, Prospector processes the commits found in the target source code repository, ranks them based on a set of predefined rules, and produces a report that the user can inspect to determine which commits to retain as the actual fix.

## Setup & Run
## ⚡️ Quick Setup & Run

Prerequisites:

:warning: The tool requires Docker and Docker-compose, as it employes Docker containers for certain functionalities. Make sure you have Docker installed and running before proceeding with the setup and usage of Prospector.
* Docker (make sure you have Docker installed and running before proceeding with the setup)
* Docker-compose

To quickly set up Prospector:
To quickly set up Prospector, follow these steps. This will run Prospector in its containerised version. If you wish to debug or run Prospector's components individually, follow the steps below at [Development Setup](#development-setup).

1. Clone the project KB repository
```
Expand Down Expand Up @@ -44,7 +55,42 @@ To quickly set up Prospector:
By default, Prospector saves the results in a HTML file named *prospector-report.html*.
Open this file in a web browser to view what Prospector was able to find!
## Development Setup
### 🤖 LLM Support
To use Prospector with LLM support, use the `--use-llm` flag or set the `use_llm` parameter in `config.yaml`. Additionally, you must specify required parameters in `config.yaml`. These parameters can vary depending on your way of accessing the LLMs, please follow what fits your needs:
<details><summary><b>Use SAP AI CORE SDK</b></summary>
You will need the following parameters in `config.yaml`:
```yaml
llm_service:
type: sap
model_type: <deployment_id>
```

`<deployment_id>` refers to the model names of the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme) you can find an overview of available models.

</details>

<details><summary><b>Use personal OpenAI account</b></summary>

1. You will need the following parameters in `config.yaml`:
```yaml
llm_service:
type: openai
model_type: <model>
```
`<model>` refers to the model names available on OpenAI, for example `gpt-4o`. You can find a list of them [here](https://platform.openai.com/docs/models).

2. Make sure to add your OpenAI API key to your `.env` file as `OPENAI_API_KEY`.

</details>

## 👩‍💻 Development Setup

Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing).

Prerequisites:

Expand All @@ -53,6 +99,8 @@ Prerequisites:
* gcc g++ libffi-dev python3-dev libpq-dev
* Docker & Docker-compose

### General

You can setup everything and install the dependencies by running:
```
make setup
Expand Down Expand Up @@ -81,11 +129,13 @@ your editor so that autoformatting is enforced "on save". The pre-commit hook en
black is run prior to committing anyway, but the auto-formatting might save you some time
and avoid frustration.

If you use VSCode, this can be achieved by pasting these lines in your configuration file:
If you use VSCode, this can be achieved by installing the Black Formatter extension and pasting these lines in your configuration file:

```
"python.formatting.provider": "black",
"editor.formatOnSave": true,
```json
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
}
```

### Starting the backend database and the job workers
Expand All @@ -94,17 +144,23 @@ If you run the client without running the backend you will get a warning and hav

You can then start the necessary containers with the following command:

`make docker-setup`
```bash
make docker-setup
```

This also starts a convenient DB administration tool at http://localhost:8080

If you wish to cleanup docker to run a fresh version of the backend you can run:

`make docker-clean`
```bash
make docker-clean
```

### Starting the RESTful server

`uvicorn api.main:app --reload`
```bash
uvicorn service.main:app --reload
```

Note, that it requires `POSTGRES_USER`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DBNAME` to be set in the .env file.

Expand All @@ -113,7 +169,9 @@ You might also want to take a look at `http://127.0.0.1:8000/docs`.

*Alternatively*, you can execute the RESTful server explicitly with:

`python api/main.py`
```bash
python api/main.py
```

which is equivalent but more convenient for debugging.

Expand All @@ -127,11 +185,13 @@ Prospector makes use of `pytest`.

:exclamation: **NOTE:** before using it please make sure to have running instances of the backend and the database.

## 🤝 Contributing

If you find a bug, please open an issue. If you can also fix the bug, please
create a pull request (make sure it includes a test case that passes with your correction
but fails without it)

## History
## 🕰️ History

The high-level structure of Prospector follows the approach of its
predecessor FixFinder, which is described in:
Expand Down
12 changes: 12 additions & 0 deletions prospector/cli/console.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import os
from contextlib import contextmanager, redirect_stderr, redirect_stdout
from enum import Enum
from typing import Optional

Expand Down Expand Up @@ -46,3 +48,13 @@ def print(note: str, status: Optional[MessageStatus] = None):
@staticmethod
def print_(status: MessageStatus):
print(f"[{status.value}{status.name}{Style.RESET_ALL}]", end="\n")


# Context Manager to suppress llm-commons output
# Credit to: https://stackoverflow.com/questions/60324614/suppress-output-on-library-import-in-python
@contextmanager
def suppress_stdout():
"""A context manager that redirects stdout to devnull"""
with open(os.devnull, "w") as fnull:
with redirect_stdout(fnull) as out, redirect_stderr(fnull) as err:
yield (out, err)
13 changes: 10 additions & 3 deletions prospector/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from dotenv import load_dotenv

import llm.llm_provider as llm
from util.http import ping_backend

path_root = os.getcwd()
Expand All @@ -32,10 +33,12 @@ def main(argv): # noqa: C901
with ConsoleWriter("Initialization") as console:
config = get_configuration(argv)
if not config:
logger.error("No configuration file found. Cannot proceed.")
logger.error(
"No configuration file found, or error in configuration file. Cannot proceed."
)

console.print(
"No configuration file found.",
"No configuration file found, or error in configuration file. Check logs.",
status=MessageStatus.ERROR,
)
return
Expand Down Expand Up @@ -63,6 +66,10 @@ def main(argv): # noqa: C901

logger.debug("Vulnerability ID: " + config.vuln_id)

# whether to use LLM support
if config.use_llm and not config.repository:
config.repository = llm.invoke(llm_config=config.llm, vuln_id=config.vuln_id)

results, advisory_record = prospector(
vulnerability_id=config.vuln_id,
repository_url=config.repository,
Expand All @@ -88,7 +95,7 @@ def main(argv): # noqa: C901
)

execution_time = execution_statistics["core"]["execution time"][0]
ConsoleWriter.print(f"Execution time: {execution_time:.3f}s")
ConsoleWriter.print(f"Execution time: {execution_time:.3f}s\n")

return

Expand Down
12 changes: 8 additions & 4 deletions prospector/config-sample.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# Wheter to preprocess only the repository's commits or fully run prospector
preprocess_only: False

Expand All @@ -12,7 +10,7 @@ fetch_references: False
use_nvd: True

# The NVD API token
nvd_token: Null
# nvd_token: <your_nvd_api_token>

# Wheter to use a backend or not: "always", "never", "optional"
use_backend: optional
Expand All @@ -30,6 +28,12 @@ database:

redis_url: redis://redis:6379/0

# LLM Usage (check README for help)
use_llm: False
llm_service:
type: sap
model_type: gpt-4-turbo

# Report file format: "html", "json", "console" or "all"
# and the file name
report:
Expand All @@ -43,4 +47,4 @@ log_level: INFO
git_cache: /tmp/gitcache

# The GitHub API token
github_token: Null
# github_token: <your_api_token>
14 changes: 11 additions & 3 deletions prospector/core/prospector.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# flake8: noqa

import logging
import os
import re
import sys
import time
Expand Down Expand Up @@ -36,7 +37,7 @@
ONE_YEAR = 365 * SECS_PER_DAY

MAX_CANDIDATES = 2000
DEFAULT_BACKEND = "http://localhost:8000"
DEFAULT_BACKEND = "http://backend:8000"


core_statistics = execution_statistics.sub_collection("core")
Expand Down Expand Up @@ -157,7 +158,14 @@ def prospector( # noqa: C901
exc_info=get_level() < logging.WARNING,
)
if use_backend == "always":
print("Backend not reachable: aborting")
if backend_address == "http://localhost:8000" and os.path.exists(
"/.dockerenv"
):
print(
"The backend address should be 'http://backend:8000' when running the containerised version of Prospector: aborting"
)
else:
print("Backend not reachable: aborting")
sys.exit(1)
print("Backend not reachable: continuing")

Expand Down Expand Up @@ -227,7 +235,7 @@ def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[


def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]:
with ConsoleWriter("\nCandidate filtering\n") as console:
with ConsoleWriter("\nCandidate filtering") as console:
commits, rejected = filter_commits(commits)
if rejected > 0:
console.print(f"Dropped {rejected} candidates")
Expand Down
31 changes: 16 additions & 15 deletions prospector/datamodel/nlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,23 +139,24 @@ def extract_ghissue_references(repository: str, text: str) -> Dict[str, str]:
id = result.group(1)
url = f"{repository}/issues/{id}"
content = fetch_url(url=url, extract_text=False)
gh_ref_data = content.find_all(
attrs={
"class": ["comment-body", "markdown-title"],
},
recursive=False,
)
# TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
gh_ref_data.extend(
content.find_all(
if content is not None:
gh_ref_data = content.find_all(
attrs={
"id": re.compile(r"ref-issue|ref-pullrequest"),
}
"class": ["comment-body", "markdown-title"],
},
recursive=False,
)
# TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
gh_ref_data.extend(
content.find_all(
attrs={
"id": re.compile(r"ref-issue|ref-pullrequest"),
}
)
)
refs[id] = " ".join(
[" ".join(block.get_text().split()) for block in gh_ref_data]
)
)
refs[id] = " ".join(
[" ".join(block.get_text().split()) for block in gh_ref_data]
)

return refs

Expand Down
Loading

0 comments on commit d3cdfa6

Please sign in to comment.