monarch-initiative · leokim-l · Nov 14, 2024 · Nov 12, 2024 · Nov 13, 2024 · Nov 13, 2024
diff --git a/README.md b/README.md
@@ -1,79 +1,35 @@
-# MALCO
+# pheval.llm 
 
-Multilingual Analysis of LLMs for Clinical Observations
+![Contributors](https://img.shields.io/github/contributors/monarch-initiative/pheval.llm?style=plastic)
+![Stars](https://img.shields.io/github/stars/monarch-initiative/pheval.llm)
+![Licence](https://img.shields.io/github/license/monarch-initiative/pheval.llm)
+![Issues](https://img.shields.io/github/issues/monarch-initiative/pheval.llm)
 
-Built using the PhEval runner template (see instructions below).
+## Evaluate LLMs' capability at performing differential diagnosis for rare genetic diseases through medical-vignette-like prompts created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt). 
 
-# Usage
-Let us start by documenting how to run the current version in a new folder. This has to be changed!
-```shell
-poetry install
-poetry shell
-mkdir myinputdirectory
-mkdir myoutputdirectory
-cp -r /path/to/promptdir myinputdirectory/
-cp inputdir/config.yaml myinputdirectory
-pheval run -i myinputdirectory -r "malcorunner" -o myoutputdirectory -t tests
-```
+### Description
+To systematically assess and evaluate an LLM's ability to perform differential diagnostics tasks, we employed prompts programatically created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt), thereby avoiding any patient privacy issues. The original data are phenopackets located at [phenopacket-store](https://github.com/monarch-initiative/phenopacket-store/). A programmatic approach for scoring and grounding results is also developed, made possible thanks to the ontological structure of the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/).
+
+Two main analyses are carried out:
+- A benchmark of some openAI GPT-models against a state of the art tool for differential diagnostics, [Exomiser](https://github.com/exomiser/Exomiser). The bottom line, Exomiser [clearly outperforms the LLMs](https://github.com/monarch-initiative/pheval.llm/blob/short_letter/notebooks/plot_exomiser_o1MINI_o1PREVIEW_4o.ipynb).
+- A comparison of gpt-4o's ability to carry out differential diagnosis when prompted in different languages. 
 
-## Template Runner for PhEval
+Formerly MALCO, Multilingual Analysis of LLMs for Clinical Observations.
+Built using the [PhEval](https://github.com/monarch-initiative/pheval) runner template.
 
-This serves as a template repository designed for crafting a personalised PhEval runner. Presently, the runner executes a mock predictor found in `src/pheval_template/run/fake_predictor.py`. Nevertheless, the primary objective is to leverage this repository as a starting point to develop your own runner for your tool, allowing you to customise and override existing methods effortlessly, given that it already encompasses all the necessary setup for integration with PhEval. There are exemplary methods throughout the runner to provide an idea on how things could be implemented.
 
-## Installation
+# Usage
+Before starting a run take care of editing the [run parameters](inputdir/run_parameters.csv) as follows:
+- The first line contains a non-empty comma-separated list of (supported) language codes between double quotation marks in which one wishes to prompt.
+- The second line contains a non-empty comma-separated list of (supported) model names between double quotation marks which one wishes to prompt.
+- The third line contains two comma-separated binary entries, represented by 0 (false) and 1 (true). The first set to true runs the prompting and grounding, i.e. the run step, the second one executes the scoring and the rest of the analysis, i.e. the post processing step. 
 
-```bash
-git clone https://github.com/yaseminbridges/pheval.template.git
-cd pheval.template
+At this point one can install and run the code by doing
+```shell
 poetry install
 poetry shell
+mkdir outputdirectory
+cp -r /path/to/promptdir inputdir/
+pheval run -i inputdir -r "malcorunner" -o outputdirectory -t tests
 ```
 
-## Configuring a run with the template runner
-
-A `config.yaml` should be located in the input directory and formatted like so:
-
-```yaml
-tool: template
-tool_version: 1.0.0
-variant_analysis: False
-gene_analysis: True
-disease_analysis: False
-tool_specific_configuration_options:
-```
-
-The testdata directory should include the subdirectory named `phenopackets` - which should contain phenopackets.
-
-## Run command
-
-```bash
-pheval run --input-dir /path/to/input_dir \
---runner templatephevalrunner \
---output-dir /path/to/output_dir \
---testdata-dir /path/to/testdata_dir
-```
-
-## Benchmark
-
-You can benchmark the run with the `pheval-utils benchmark` command:
-
-```bash
-pheval-utils benchmark --directory /path/to/output_directoy \
---phenopacket-dir /path/to/phenopacket_dir \
---output-prefix OUTPUT_PREFIX \
---gene-analysis \
---plot-type bar_cumulative
-```
-
-The path provided to the `--directory` parameter should be the same as the one provided to the `--output-dir` in the `pheval run` command
-
-## Personalising to your own tool
-
-If overriding this template to create your own runner implementation. There are key files that should change to fit with your runner implementation.
-
-1. The name of the Runner class in `src/pheval_template/runner.py` should be changed.
-2. Once the name of the Runner class has been customised, line 15 in `pyproject.toml` should also be changed to match the class name, then run `poetry lock` and `poetry install`
-
-The runner you give on the CLI will then change to the name of the runner class.
-
-You should also remove the `src/pheval_template/run/fake_predictor.py` and implement the running of your own tool. Methods in the post-processing can also be altered to process your own tools output.
diff --git a/docs/analysis.md b/docs/analysis.md
@@ -0,0 +1,7 @@
+# Scoring
+In order to fairly score clinically accurate diagnoses - considering we are only using phenotypic data - we needed to match the grounded answers by an LLM (or by Exomiser) to the correct result present in the phenopacket, consisting of an OMIM identifier. This is illustrated in the image below.
+ ![figure](images/mondo_grouping.png).
+
+# Statistics
+
+# More TBD
diff --git a/docs/images/mondo_grouping.png b/docs/images/mondo_grouping.png
diff --git a/docs/images/ppkt2score.png b/docs/images/ppkt2score.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,10 @@
+# Welcome to pheval.llm, formerly MALCO
+
+To systematically assess and evaluate an LLM's ability to perform differential diagnostics tasks, we employed prompts programatically created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt), thereby avoiding any patient privacy issues. The original data are phenopackets located at [phenopacket-store](https://github.com/monarch-initiative/phenopacket-store/). A programmatic approach for scoring and grounding results is also developed, made possible thanks to the ontological structure of the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/).
+
+Two main analyses are carried out:
+- A benchmark of some openAI GPT-models against a state of the art tool for differential diagnostics, [Exomiser](https://github.com/exomiser/Exomiser). The bottom line, Exomiser [clearly outperforms the LLMs](https://github.com/monarch-initiative/pheval.llm/blob/short_letter/notebooks/plot_exomiser_o1MINI_o1PREVIEW_4o.ipynb).
+- A comparison of gpt-4o's ability to carry out differential diagnosis when prompted in different languages. 
+
+## Project layout
+The description of the steps we take are found in the figure below ![figure](images/ppkt2score.png).
diff --git a/docs/layout.md b/docs/layout.md
@@ -0,0 +1,7 @@
+The first part of the code does:
+
+### Prepare step
+
+### Run step
+
+### Post process step
diff --git a/docs/reference.md b/docs/reference.md
@@ -0,0 +1,3 @@
+The grounding happens via
+
+::: src.malco.post_process.mondo_score_utils
diff --git a/docs/run.md b/docs/run.md
@@ -0,0 +1,10 @@
+# Grounding
+Since LLMs today, up to November 2024, show little ability to precisely and reliably return unique identifiers of some entity present in a database, we need to deal with this issue. In order to transform some human language disease name such as "cystic fibrosis" into its corresponding [OMIM identifier OMIM:219700](https://omim.org/entry/219700) we use the following approach:
+
+<!--- Add links to files as soon as they are merged--->
+1. First, we try exact lexical matching between the LLMs reply and the OMIM diseases label.
+2. Then we run [CurateGPT](https://github.com/monarch-initiative/curategpt) on the remaining ones that have not been grounded.
+
+We remark here that we ground to MONDO.
+
+# OntoGPT
diff --git a/docs/run_parameters.csv b/docs/run_parameters.csv
@@ -0,0 +1,3 @@
+"en"
+"gpt-4","gpt-3.5-turbo","gpt-4o","gpt-4-turbo"
+0,1
diff --git a/docs/setup.md b/docs/setup.md
@@ -0,0 +1,16 @@
+Before starting a run take care of editing the [run parameters](inputdir/run_parameters.csv) as follows:
+
+- The first line contains a non-empty comma-separated list of (supported) language codes between double quotation marks in which one wishes to prompt.
+- The second line contains a non-empty comma-separated list of (supported) model names between double quotation marks which one wishes to prompt.
+- The third line contains two comma-separated binary entries, represented by 0 (false) and 1 (true). The first set to true runs the prompting and grounding, i.e. the run step, the second one executes the scoring and the rest of the analysis, i.e. the post processing step. 
+
+At this point one can install and run the code by doing:
+```shell
+poetry install
+poetry shell
+mkdir outputdirectory
+cp -r /path/to/promptdir inputdir/
+pheval run -i inputdir -r "malcorunner" -o outputdirectory -t tests
+```
+
+As an example, the [input file](https://github.com/monarch-initiative/pheval.llm/tree/main/docs/run_parameters.csv) file will execute only the post_process block for English, prompting the models gpt-4, gpt-3.5-turbo, gpt-4o, and gpt-4-turbo.
diff --git a/inputdir/run_parameters.csv b/inputdir/run_parameters.csv
@@ -0,0 +1,3 @@
+"en"
+"gpt-4","gpt-3.5-turbo","gpt-4o","gpt-4-turbo"
+0,1