Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(draft) En analysis #35

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
2d89b7b
Make outdir if it doesn't exist
Jul 1, 2024
afbfbd9
Only English for this analysis
Jul 1, 2024
41be8e6
Add input and output from July run of malco
Jul 1, 2024
14ca7d1
Remove parallelization, add plot for top-k accuracy
Jul 1, 2024
9e5709d
Hard code path to results file
Jul 1, 2024
42bab8d
Add text mined results
Jul 2, 2024
c151b25
Correct order of args when calling ontogpt
Jul 3, 2024
af2a2db
Update ontogpt
Jul 6, 2024
183e700
Add results for (all 5,269) phenopackets
Jul 6, 2024
54a7bbe
Remove incomplete results
Jul 8, 2024
58a5339
Update script to make fig 2
Jul 8, 2024
37e524b
Notebook to calculate unique HPO terms and terms per case
Jul 8, 2024
ad2180b
NB to make fig 2
Jul 8, 2024
8e5d3b8
Change y axis on fig
Jul 8, 2024
f78eedc
Update summary statistics nb
Jul 10, 2024
6415c20
Add code to calculate % at rank 1, 3, 10
Jul 10, 2024
ad54d66
Use R instead of python for plot
Jul 10, 2024
211f902
Tweaks to R plot
Jul 10, 2024
640a5d0
Add JAMA color theme
Jul 10, 2024
5b87556
Commit text_mined phenopackets and prompts
Jul 11, 2024
55b3a84
Remove text-mined stuff
Jul 15, 2024
ce3efcd
Change plot to only show phenopacket (not text-mined) cases
Jul 15, 2024
18c4cd8
Tweaks to plot
Jul 15, 2024
678bd3f
Update fig 2
Jul 16, 2024
7b64f8d
Update notebook
Jul 26, 2024
739f346
mini script looking at how many times in published output the groundi…
leokim-l Aug 5, 2024
26b81e6
Add code to process exomiser results
Aug 29, 2024
fcbc65e
Parallelize processing exomiser results to get time down to hours ins…
Aug 30, 2024
8cf281b
Preliminary Exomiser results (not apples to apples with 4o data, sinc…
Sep 17, 2024
912d65c
Add stuff to gitignore
Sep 17, 2024
c6223e0
Script to make hits at n for exomiser
Sep 17, 2024
52e2ed9
Add stuff to .gitignore
Sep 17, 2024
06a70fb
Update nb to correcting rank Exomiser predictions
Sep 19, 2024
a3aefc0
Update notebook to finalize and analyze Exomiser results
Sep 20, 2024
57b5626
Merge branch 'short_letter' of https://github.com/monarch-initiative/…
Sep 20, 2024
0a094fe
Rename exomiser nb to be more descriptive
Sep 20, 2024
80ebfda
Working on grounding o1 results
Sep 22, 2024
75aee6c
Ground with Mondo
Sep 22, 2024
022b9a3
Add plots comparing the models
Sep 22, 2024
9d7713d
Fix bug writing out correct ID to output files
Sep 23, 2024
c922de0
Update o1 analysis nb
Sep 24, 2024
808dd1b
Add better grounding. Now o1 responses are about 98.4% grounded
Sep 24, 2024
1ede4f9
Add better grounding. Now o1 responses are about 98.4% grounded
Sep 24, 2024
e2732f6
Refactor grounding, no inexact matching, because it grounds spuriously
Sep 25, 2024
d8e8b47
Updated nb, died during run
Sep 26, 2024
fcf3969
Rerun analysis and plotting of o1 with exact match and split on ()
Sep 26, 2024
e534b0b
Working on better grounding for o1
Sep 26, 2024
8498a33
Update nb and results using OAK then curategpt for grounding
Sep 27, 2024
aa18420
Fix typo
Sep 27, 2024
bc06de8
Fix typo
Sep 27, 2024
43cb5cc
Add comment about how to create mondo collection using curategpt
Sep 27, 2024
60f1444
Add nb to (re)process gpt-4o
Sep 30, 2024
0bfdd4b
Notebook to make and analyze GPT-4o results (WIP)
Oct 1, 2024
eb236c2
Add gpt 4o analysis
Oct 1, 2024
861856e
Add o4 results to plot
Oct 1, 2024
3ef9ee6
Comment about 4o version
Oct 1, 2024
60fd23a
added prompts which hopefully will be the final ground thruth version…
leokim-l Oct 2, 2024
e95a15a
Writing nb to calculate stats for what diseases were found / not foun…
Oct 10, 2024
62b5361
Add o1 mini analysis, rename plot_exomiser_o1_4turbo.ipynb -> plot_ex…
Oct 15, 2024
7f03bd8
Add o1 mini notebook
Oct 16, 2024
1bf35bf
Add WIP mcnemars analysis
Oct 16, 2024
d2f213f
Add o1 preview results
Oct 16, 2024
187e121
Tweak legend
Oct 16, 2024
689cdbd
Tweak legend
Oct 16, 2024
aeb9f6a
Tweak figure
Oct 24, 2024
7b06b4e
Update plot_exomiser_o1MINI_o1PREVIEW_4o.ipynb
jmcmurry Oct 26, 2024
dc43c19
Merge pull request #53 from monarch-initiative/jmcmurry-patch-1
jmcmurry Oct 26, 2024
73a4b5d
Remove new (unused) prompts dir
Oct 29, 2024
db3a395
Tweak colors
Oct 29, 2024
a9395b9
Tidy up notebook dir
Oct 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ outputdir
inputdir/all_phenopackets.zip
inputdir/phenopacket-store/
.openai_cache.db
supplemental_data/
outputdir_all_2024_07_04/
stagedb
File renamed without changes.
Binary file added data/all/prompts/.DS_Store
Binary file not shown.
5,213 changes: 5,213 additions & 0 deletions data/all/prompts/correct_results.tsv

Large diffs are not rendered by default.

5,213 changes: 5,213 additions & 0 deletions data/all/prompts/correct_results.tsv.malformed

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10571775_KSN_II_1_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset was not specified.
She presented with Growth delay, Hypokalemia, Hyperchloremic metabolic acidosis, Alkaline urine, Distal renal tubular acidosis, Elliptocytosis, Impaired urinary acidification, and Decreased serum bicarbonate concentration. However, the following features were excluded: Elevated circulating creatinine concentration.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10571775_YAT_II_1_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a male. Disease onset was not specified.
He presented with Growth delay, Hypokalemia, Hyperchloremic metabolic acidosis, Alkaline urine, Distal renal tubular acidosis, Elliptocytosis, Impaired urinary acidification, and Decreased serum bicarbonate concentration. However, the following features were excluded: Elevated circulating creatinine concentration.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_11_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 46-year, 0-month old.
She presented with Atrial fibrillation, Elevated circulating creatine kinase concentration, and Second degree atrioventricular block. However, the following features were excluded: Dilated cardiomyopathy, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_13_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a male. Disease onset occurred when the proband was 43-year, 0-month old.
He presented with Atrial fibrillation, Sinus bradycardia, and First degree atrioventricular block. However, the following features were excluded: Dilated cardiomyopathy, Elevated circulating creatine kinase concentration, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_14_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 40-year, 0-month old.
She presented with First degree atrioventricular block. However, the following features were excluded: Atrial fibrillation, Dilated cardiomyopathy, Elevated circulating creatine kinase concentration, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_15_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 39-year, 0-month old.
She presented with Third degree atrioventricular block. However, the following features were excluded: Atrial fibrillation, Dilated cardiomyopathy, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_1_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a male. Disease onset occurred when the proband was 51-year, 0-month old.
He presented with Dilated cardiomyopathy and Third degree atrioventricular block. However, the following features were excluded: Atrial fibrillation, Elevated circulating creatine kinase concentration, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_5_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 39-year, 0-month old.
She was found not to have the following features: Atrial fibrillation, Dilated cardiomyopathy, Elevated circulating creatine kinase concentration, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_8__en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 42-year, 0-month old.
She presented with Atrial fibrillation, Dilated cardiomyopathy, Thromboembolism, Sinus bradycardia, and Abnormal atrioventricular conduction. However, the following features were excluded: Elevated circulating creatine kinase concentration, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_A_III_9_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a male. Disease onset occurred when the proband was 40-year, 0-month old.
He presented with Dilated cardiomyopathy, Congestive heart failure, and Third degree atrioventricular block. However, the following features were excluded: Atrial fibrillation, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_B_III_11_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 53-year, 0-month old.
She presented with Atrial fibrillation, Dilated cardiomyopathy, Stroke, and Abnormal atrioventricular conduction. However, the following features were excluded: Elevated circulating creatine kinase concentration, Sudden cardiac death, Progeroid facial appearance, and Lipodystrophy.
18 changes: 18 additions & 0 deletions data/all/prompts/en/PMID_10580070_B_III_13_en-prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts.
I am going to give you part of a medical case. In this case, you are “Dr. GPT-4”, an AI language model who is providing
a diagnosis. Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known
today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases
when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or
very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with
a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be
specified with disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is
Cystic fibrosis, provide this:

1. Branchiooculofacial syndrome
2. Cystic fibrosis

This list should provide as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Here is the case:

The proband was a female. Disease onset occurred when the proband was 39-year, 0-month old.
She presented with Atrial fibrillation, Dilated cardiomyopathy, Sudden cardiac death, Congestive heart failure, and Abnormal atrioventricular conduction. However, the following features were excluded: Progeroid facial appearance and Lipodystrophy.
Loading