Adds LLM support to obtain the repository URL of the repository conne…

…cted to the supplied CVE.
SAP · Jun 4, 2024 · d3cdfa6 · d3cdfa6
1 parent c520045
commit d3cdfa6
Show file tree

Hide file tree

Showing 19 changed files with 733 additions and 167 deletions.
diff --git a/.gitignore b/.gitignore
@@ -51,6 +51,7 @@ prospector/.coverage
 **/cov_html
 prospector/cov_html
 .coverage
+prospector/.venv
 prospector/prospector.code-workspace
 prospector/requests-cache.sqlite
 prospector/prospector-report.html

diff --git a/prospector/README.md b/prospector/README.md
@@ -5,18 +5,29 @@ currently under development: the instructions below are intended for development
 
 :exclamation: Please note that **Windows is not supported** while WSL and WSL2 are fine.
 
-## Description
+## Table of Contents
+
+1. [Description](#description)
+2. [Quick Setup & Run](#setup--run)
+3. [Development Setup](#development-setup)
+4. [Contributing](#contributing)
+5. [History](#history)
+
+## 📖 Description
 
 Prospector is a tool to reduce the effort needed to find security fixes for
 *known* vulnerabilities in open source software repositories.
 
 Given an advisory expressed in natural language, Prospector processes the commits found in the target source code repository, ranks them based on a set of predefined rules, and produces a report that the user can inspect to determine which commits to retain as the actual fix.
 
-## Setup & Run
+## ⚡️ Quick Setup & Run
+
+Prerequisites:
 
-:warning: The tool requires Docker and Docker-compose, as it employes Docker containers for certain functionalities. Make sure you have Docker installed and running before proceeding with the setup and usage of Prospector.
+* Docker (make sure you have Docker installed and running before proceeding with the setup)
+* Docker-compose
 
-To quickly set up Prospector:
+To quickly set up Prospector, follow these steps. This will run Prospector in its containerised version. If you wish to debug or run Prospector's components individually, follow the steps below at [Development Setup](#development-setup).
 
 1. Clone the project KB repository
     ```
@@ -44,7 +55,42 @@ To quickly set up Prospector:
     By default, Prospector saves the results in a HTML file named *prospector-report.html*.
     Open this file in a web browser to view what Prospector was able to find!
 
-## Development Setup
+### 🤖 LLM Support
+
+To use Prospector with LLM support, use the `--use-llm` flag or set the `use_llm` parameter in `config.yaml`. Additionally, you must specify required parameters in `config.yaml`. These parameters can vary depending on your way of accessing the LLMs, please follow what fits your needs:
+
+<details><summary><b>Use SAP AI CORE SDK</b></summary>
+
+You will need the following parameters in `config.yaml`:
+
+```yaml
+llm_service:
+    type: sap
+    model_type: <deployment_id>
+```
+
+`<deployment_id>` refers to the model names of the Generative AI Hub in SAP AI Core. [Here](https://github.tools.sap/I343697/generative-ai-hub-readme) you can find an overview of available models.
+
+</details>
+
+<details><summary><b>Use personal OpenAI account</b></summary>
+
+1. You will need the following parameters in `config.yaml`:
+    ```yaml
+    llm_service:
+        type: openai
+        model_type: <model>
+    ```
+
+    `<model>` refers to the model names available on OpenAI, for example `gpt-4o`. You can find a list of them [here](https://platform.openai.com/docs/models).
+
+2. Make sure to add your OpenAI API key to your `.env` file as `OPENAI_API_KEY`.
+
+</details>
+
+## 👩‍💻 Development Setup
+
+Following these steps allows you to run Prospector's components individually: [Backend database and worker containers](#starting-the-backend-database-and-the-job-workers), [RESTful Server](#starting-the-restful-server) for API endpoints, [Prospector CLI](#running-the-cli-version) and [Tests](#testing).
 
 Prerequisites:
 
@@ -53,6 +99,8 @@ Prerequisites:
 * gcc g++ libffi-dev python3-dev libpq-dev
 * Docker & Docker-compose
 
+### General
+
 You can setup everything and install the dependencies by running:
 ```
 make setup
@@ -81,11 +129,13 @@ your editor so that autoformatting is enforced "on save". The pre-commit hook en
 black is run prior to committing anyway, but the auto-formatting might save you some time
 and avoid frustration.
 
-If you use VSCode, this can be achieved by pasting these lines in your configuration file:
+If you use VSCode, this can be achieved by installing the Black Formatter extension and pasting these lines in your configuration file:
 
-```
-    "python.formatting.provider": "black",
-    "editor.formatOnSave": true,
+```json
+    "[python]": {
+        "editor.defaultFormatter": "ms-python.black-formatter",
+        "editor.formatOnSave": true,
+    }
 ```
 
 ### Starting the backend database and the job workers
@@ -94,17 +144,23 @@ If you run the client without running the backend you will get a warning and hav
 
 You can then start the necessary containers with the following command:
 
-`make docker-setup`
+```bash
+make docker-setup
+```
 
 This also starts a convenient DB administration tool at http://localhost:8080
 
 If you wish to cleanup docker to run a fresh version of the backend you can run:
 
-`make docker-clean`
+```bash
+make docker-clean
+```
 
 ### Starting the RESTful server
 
-`uvicorn api.main:app --reload`
+```bash
+uvicorn service.main:app --reload
+```
 
 Note, that it requires `POSTGRES_USER`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DBNAME` to be set in the .env file.
 
@@ -113,7 +169,9 @@ You might also want to take a look at `http://127.0.0.1:8000/docs`.
 
 *Alternatively*, you can execute the RESTful server explicitly with:
 
-`python api/main.py`
+```bash
+python api/main.py
+```
 
 which is equivalent but more convenient for debugging.
 
@@ -127,11 +185,13 @@ Prospector makes use of `pytest`.
 
 :exclamation: **NOTE:** before using it please make sure to have running instances of the backend and the database.
 
+## 🤝 Contributing
+
 If you find a bug, please open an issue. If you can also fix the bug, please
 create a pull request (make sure it includes a test case that passes with your correction
 but fails without it)
 
-## History
+## 🕰️ History
 
 The high-level structure of Prospector follows the approach of its
 predecessor FixFinder, which is described in:

diff --git a/prospector/cli/console.py b/prospector/cli/console.py
@@ -1,3 +1,5 @@
+import os
+from contextlib import contextmanager, redirect_stderr, redirect_stdout
 from enum import Enum
 from typing import Optional
 
@@ -46,3 +48,13 @@ def print(note: str, status: Optional[MessageStatus] = None):
     @staticmethod
     def print_(status: MessageStatus):
         print(f"[{status.value}{status.name}{Style.RESET_ALL}]", end="\n")
+
+
+# Context Manager to suppress llm-commons output
+# Credit to: https://stackoverflow.com/questions/60324614/suppress-output-on-library-import-in-python
+@contextmanager
+def suppress_stdout():
+    """A context manager that redirects stdout to devnull"""
+    with open(os.devnull, "w") as fnull:
+        with redirect_stdout(fnull) as out, redirect_stderr(fnull) as err:
+            yield (out, err)
diff --git a/prospector/cli/main.py b/prospector/cli/main.py
@@ -7,6 +7,7 @@
 
 from dotenv import load_dotenv
 
+import llm.llm_provider as llm
 from util.http import ping_backend
 
 path_root = os.getcwd()
@@ -32,10 +33,12 @@ def main(argv):  # noqa: C901
     with ConsoleWriter("Initialization") as console:
         config = get_configuration(argv)
         if not config:
-            logger.error("No configuration file found. Cannot proceed.")
+            logger.error(
+                "No configuration file found, or error in configuration file. Cannot proceed."
+            )
 
             console.print(
-                "No configuration file found.",
+                "No configuration file found, or error in configuration file. Check logs.",
                 status=MessageStatus.ERROR,
             )
             return
@@ -63,6 +66,10 @@ def main(argv):  # noqa: C901
 
         logger.debug("Vulnerability ID: " + config.vuln_id)
 
+    # whether to use LLM support
+    if config.use_llm and not config.repository:
+        config.repository = llm.invoke(llm_config=config.llm, vuln_id=config.vuln_id)
+
     results, advisory_record = prospector(
         vulnerability_id=config.vuln_id,
         repository_url=config.repository,
@@ -88,7 +95,7 @@ def main(argv):  # noqa: C901
     )
 
     execution_time = execution_statistics["core"]["execution time"][0]
-    ConsoleWriter.print(f"Execution time: {execution_time:.3f}s")
+    ConsoleWriter.print(f"Execution time: {execution_time:.3f}s\n")
 
     return
 

diff --git a/prospector/config-sample.yaml b/prospector/config-sample.yaml
@@ -1,5 +1,3 @@
-
-
 # Wheter to preprocess only the repository's commits or fully run prospector
 preprocess_only: False
 
@@ -12,7 +10,7 @@ fetch_references: False
 use_nvd: True
 
 # The NVD API token
-nvd_token: Null
+# nvd_token: <your_nvd_api_token>
 
 # Wheter to use a backend or not: "always", "never", "optional"
 use_backend: optional
@@ -30,6 +28,12 @@ database:
 
 redis_url: redis://redis:6379/0
 
+# LLM Usage (check README for help)
+use_llm: False
+llm_service:
+  type: sap
+  model_type: gpt-4-turbo
+
 # Report file format: "html", "json", "console" or "all"
 # and the file name
 report:
@@ -43,4 +47,4 @@ log_level: INFO
 git_cache: /tmp/gitcache
 
 # The GitHub API token
-github_token: Null
+# github_token: <your_api_token>
diff --git a/prospector/core/prospector.py b/prospector/core/prospector.py
@@ -1,6 +1,7 @@
 # flake8: noqa
 
 import logging
+import os
 import re
 import sys
 import time
@@ -36,7 +37,7 @@
 ONE_YEAR = 365 * SECS_PER_DAY
 
 MAX_CANDIDATES = 2000
-DEFAULT_BACKEND = "http://localhost:8000"
+DEFAULT_BACKEND = "http://backend:8000"
 
 
 core_statistics = execution_statistics.sub_collection("core")
@@ -157,7 +158,14 @@ def prospector(  # noqa: C901
                     exc_info=get_level() < logging.WARNING,
                 )
                 if use_backend == "always":
-                    print("Backend not reachable: aborting")
+                    if backend_address == "http://localhost:8000" and os.path.exists(
+                        "/.dockerenv"
+                    ):
+                        print(
+                            "The backend address should be 'http://backend:8000' when running the containerised version of Prospector: aborting"
+                        )
+                    else:
+                        print("Backend not reachable: aborting")
                     sys.exit(1)
                 print("Backend not reachable: continuing")
 
@@ -227,7 +235,7 @@ def preprocess_commits(commits: List[RawCommit], timer: ExecutionTimer) -> List[
 
 
 def filter(commits: Dict[str, RawCommit]) -> Dict[str, RawCommit]:
-    with ConsoleWriter("\nCandidate filtering\n") as console:
+    with ConsoleWriter("\nCandidate filtering") as console:
         commits, rejected = filter_commits(commits)
         if rejected > 0:
             console.print(f"Dropped {rejected} candidates")

diff --git a/prospector/datamodel/nlp.py b/prospector/datamodel/nlp.py
@@ -139,23 +139,24 @@ def extract_ghissue_references(repository: str, text: str) -> Dict[str, str]:
         id = result.group(1)
         url = f"{repository}/issues/{id}"
         content = fetch_url(url=url, extract_text=False)
-        gh_ref_data = content.find_all(
-            attrs={
-                "class": ["comment-body", "markdown-title"],
-            },
-            recursive=False,
-        )
-        # TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
-        gh_ref_data.extend(
-            content.find_all(
+        if content is not None:
+            gh_ref_data = content.find_all(
                 attrs={
-                    "id": re.compile(r"ref-issue|ref-pullrequest"),
-                }
+                    "class": ["comment-body", "markdown-title"],
+                },
+                recursive=False,
+            )
+            # TODO: when an issue/pr is referenced somewhere, the page contains also the "message" of that reference (e.g. a commit). This may lead to unwanted detection of certain rules.
+            gh_ref_data.extend(
+                content.find_all(
+                    attrs={
+                        "id": re.compile(r"ref-issue|ref-pullrequest"),
+                    }
+                )
+            )
+            refs[id] = " ".join(
+                [" ".join(block.get_text().split()) for block in gh_ref_data]
             )
-        )
-        refs[id] = " ".join(
-            [" ".join(block.get_text().split()) for block in gh_ref_data]
-        )
 
     return refs