Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update KEGG API #540

Merged
merged 5 commits into from
Mar 1, 2023
Merged

update KEGG API #540

merged 5 commits into from
Mar 1, 2023

Conversation

huerqiang
Copy link
Contributor

Fix #539
老师,我这次提交修改了kEGG pathway的API url

kegg_list <- function(db, species = NULL) {
    if (db == "pathway") {
        url <- paste("https://rest.kegg.jp/list", db, species, sep="/")
    } else {
        ## module do not need species
        url <- paste("https://rest.kegg.jp/list", db, sep="/")
    }
    
    kegg_rest(url)
}

在KEGG的内容更新后,原有的keggmodule2name.df[,1] %<>% gsub("md:", "", .)等语句便不再需要了,我这次将它们注释掉了。
现在KEGG的download to gson file 和 enrichment 均可以正常使用。
Example:
test_KEGG.pdf

@GuangchuangYu GuangchuangYu merged commit fa231bd into YuLab-SMU:master Mar 1, 2023
@huerqiang huerqiang deleted the KEGGapi branch March 1, 2023 12:38
@guidohooiveld
Copy link

guidohooiveld commented Mar 3, 2023

Hi, I am not 100% sure whether it relates to this update, or whether KEGG changed something internally, but could the organism be removed from the gene set descriptions? E.g. - Homo sapiens (human). This was the default output until very recently... See below. Unfortunately I did not track the package versions, so I don't know when, and how, exactly the organism name was added to the set descriptions. This change has also been implicitly noticed in this post: #542 (comment)

Thanks!

## generate input:
library(clusterProfiler)
data(geneList, package="DOSE")
 
## Select the 1st 150 genes as being an example set of upregulated genes
upreg.genes <- names(geneList)[1:150]

"Old" output; no organism added to set descriptions (copied from my post here):

> res1 <- enrichKEGG(gene=upreg.genes, organism = "hsa", keyType = "ncbi-geneid",
+     pvalueCutoff = 0.05, pAdjustMethod = "BH", universe=NULL, minGSSize = 10,
+     maxGSSize = 500, qvalueCutoff = 0.2)
Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/hsa"...
Warning message:
In utils::download.file(url, quiet = TRUE, method = method, ...) :
  the 'wininet' method is deprecated for http:// and https:// URLs
> 
> as.data.frame(res1)[1:3, ]
               ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04657 hsa04657                                       IL-17 signaling pathway
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/64 127/6451 4.302610e-14 5.464315e-12 5.163132e-12
hsa04657      9/64  94/6451 2.863277e-07 1.818181e-05 1.717966e-05
hsa04061      8/64 100/6451 5.441530e-06 2.303581e-04 2.176612e-04
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
         Count
hsa04110    16
hsa04657     9
hsa04061     8
> 

Current output, when using identical call, organism is added to descriptions. (Also more genes have been annotated to sets).

> as.data.frame(res1)[1:3, ]
               ID                                    Description GeneRatio
hsa04110 hsa04110              Cell cycle - Homo sapiens (human)     16/73
hsa04657 hsa04657 IL-17 signaling pathway - Homo sapiens (human)      9/73
hsa04114 hsa04114          Oocyte meiosis - Homo sapiens (human)      9/73
          BgRatio       pvalue     p.adjust       qvalue
hsa04110 127/8275 9.139892e-15 1.380124e-12 1.279585e-12
hsa04657  94/8275 1.140666e-07 8.612029e-06 7.984663e-06
hsa04114 131/8275 1.955475e-06 8.468563e-05 7.851648e-05
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04114                               991/9133/983/4085/51806/6790/891/9232/5347
         Count
hsa04110    16
hsa04657     9
hsa04114     9
>

@ksinghal28
Copy link

ksinghal28 commented Mar 3, 2023

Hi, I'm not sure if this is related to the comment and PR above, but I noticed a change when running the gseKEGG function.
When I was running it 2 weeks ago, my gene list with the following structure ran without any issues-

> head(kegg_gene_list, 10)
    12346     17748    217593     17750 105246823     15510 100504069     22042     17975     15519 
1.4887187 1.2145344 0.9816236 0.9398387 0.9026313 0.7862467 0.7230369 0.6888081 0.6491812 0.6275808

However today when I ran the same command, I got the following error-

preparing geneSet collections...
--> Expected input gene ID: 
Error in check_gene_id(geneList, geneSets) : 
  --> No gene can be mapped....

(Nothing in my input changed)

For more context, my data is from a mouse single-cell experiment.
After getting a list of genes and their fold change values (stored in the dataframe called markers), I used the following code to run it-
(the variable organism corresponds to org.Mm.eg.db )

 markers <- tibble::rownames_to_column(markers, "gene")
  temp_original_gene_list <- markers$avg_log2FC
  names(temp_original_gene_list) <- markers$gene
  ids <- bitr(names(temp_original_gene_list), fromType = "SYMBOL", toType = "ENTREZID", OrgDb = organism)
  
  # remove duplicate IDS 
  dedup_ids = ids[!duplicated(ids[c("SYMBOL")]),]
  # Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above
  df2 = markers[markers$gene %in% dedup_ids$SYMBOL,]
  # Create a new column in df2 with the corresponding ENTREZ IDs
  df2$Y = dedup_ids$ENTREZID
  # Create a vector of the gene unuiverse
  kegg_gene_list <- df2$avg_log2FC
  # Name vector with ENTREZ ids
  names(kegg_gene_list) <- df2$Y
  # omit any NA values 
  kegg_gene_list<-na.omit(kegg_gene_list)
  # sort the list in decreasing order (required for clusterProfiler)
  kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
  
  temp_kegg <- gseKEGG(geneList=kegg_gene_list,
                       keyType = "ncbi-geneid",
                    minGSSize = 120, 
                    pvalueCutoff = 0.05, 
                    organism = "mmu",
                    eps = 0)

Also, a quick comment about the error message, previous error messages online seem to have more information about the expected input gene IDs, but I didn't see any for my error message.

Please let me know if there's any other information I can provide to help debug this.
Thank you!

@ksinghal28
Copy link

Update- thanks to @guidohooiveld 's previous comment, I came across this message in the thread suggesting to run the devtools version

That seems to have done the trick and fixed my issue!

@ZZhenQi
Copy link

ZZhenQi commented Mar 5, 2023

Hi, I am not 100% sure whether it relates to this update, or whether KEGG changed something internally, but could the organism be removed from the gene set descriptions? E.g. - Homo sapiens (human). This was the default output until very recently... See below. Unfortunately I did not track the package versions, so I don't know when, and how, exactly the organism name was added to the set descriptions. This change has also been implicitly noticed in this post: #542 (comment)

Thanks!

## generate input:
library(clusterProfiler)
data(geneList, package="DOSE")
 
## Select the 1st 150 genes as being an example set of upregulated genes
upreg.genes <- names(geneList)[1:150]

"Old" output; no organism added to set descriptions (copied from my post here):

> res1 <- enrichKEGG(gene=upreg.genes, organism = "hsa", keyType = "ncbi-geneid",
+     pvalueCutoff = 0.05, pAdjustMethod = "BH", universe=NULL, minGSSize = 10,
+     maxGSSize = 500, qvalueCutoff = 0.2)
Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/hsa"...
Warning message:
In utils::download.file(url, quiet = TRUE, method = method, ...) :
  the 'wininet' method is deprecated for http:// and https:// URLs
> 
> as.data.frame(res1)[1:3, ]
               ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04657 hsa04657                                       IL-17 signaling pathway
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/64 127/6451 4.302610e-14 5.464315e-12 5.163132e-12
hsa04657      9/64  94/6451 2.863277e-07 1.818181e-05 1.717966e-05
hsa04061      8/64 100/6451 5.441530e-06 2.303581e-04 2.176612e-04
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
         Count
hsa04110    16
hsa04657     9
hsa04061     8
> 

Current output, when using identical call, organism is added to descriptions. (Also more genes have been annotated to sets).

> as.data.frame(res1)[1:3, ]
               ID                                    Description GeneRatio
hsa04110 hsa04110              Cell cycle - Homo sapiens (human)     16/73
hsa04657 hsa04657 IL-17 signaling pathway - Homo sapiens (human)      9/73
hsa04114 hsa04114          Oocyte meiosis - Homo sapiens (human)      9/73
          BgRatio       pvalue     p.adjust       qvalue
hsa04110 127/8275 9.139892e-15 1.380124e-12 1.279585e-12
hsa04657  94/8275 1.140666e-07 8.612029e-06 7.984663e-06
hsa04114 131/8275 1.955475e-06 8.468563e-05 7.851648e-05
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04114                               991/9133/983/4085/51806/6790/891/9232/5347
         Count
hsa04110    16
hsa04657     9
hsa04114     9
>

add following code can solve the problem:
kk@result$Description= gsub(' - Homo sapiens \(human\)','',kk@result$Description)

@fransiss
Copy link

Hi I am still having the issue

preparing geneSet collections...
--> Expected input gene ID:
Error in check_gene_id(geneList, geneSets) :
--> No gene can be mapped....

when I try to run gseKEGG and I tried all the possible conversion of mi 20k genes into Gene ID and KEGG ID
also tried to devtools::install_github("YuLab-SMU/clusterProfiler") install

but nothing worked,
has someone the stil same problem as well ?

thanks

@huerqiang
Copy link
Contributor Author

@fransiss Please provide your code.

@zapaterc
Copy link

I get the same error message...

" --> No gene can be mapped...."

I also have tried to re-install ClusterProfilerfrom the devtools and doesnt work still

flybaseToID <- AnnotationDbi::select(org.Dm.eg.db,
keys = myResDF$FLYBASE,
keytype = "FLYBASE",
columns="FLYBASECG")

myResDF_IDs <-merge(flybaseToID,myResDF)
myResDF_IDs$FLYBASECG = paste0('Dmel_', myResDF_IDs$FLYBASECG)
myResDF_IDs_ranked <- myResDF_IDs %>% mutate(rank = rank(stat, ties.method = "random")) %>%
arrange(desc(rank))
write.csv(myResDF_IDs_ranked,"EnrichmentAnalysis/AllChromosomes/Stranded/GSEA_Log2FC0/myResDF_IDs_ranked.csv", row.names = FALSE)

forKEGG <- myResDF_IDs_ranked$rank
names(forKEGG) <- myResDF_IDs_ranked$FLYBASECG
forKEGG <- forKEGG[order(forKEGG, decreasing = T)]
forKEGG_NoDups <- forKEGG[!duplicated(names(forKEGG))]
KEGG_tmodEnriched <- gseKEGG(
forKEGG_NoDups,
organism = "dme",
keyType = "kegg",
eps = 0,
scoreType = "pos")

@guidohooiveld
Copy link

Without also knowing your input it is impossible to give some help...

So please show the first 10 lines of input of myResDF$FLYBASE as well as forKEGG_NoDups.
Also include the output from str(myResDF$FLYBASE) and str(forKEGG_NoDups).

@zapaterc
Copy link

zapaterc commented Jul 15, 2024

The gene annotation that I've been using is that for FLYBASE (FBgn00XXXXX), however, my understanding is that gseKEGG(keyType = "kegg") wants a specific annotation (that's what the second paragraph of the codeis for): "Dmel_CGXXXX"

> myResDF$FLYBASE[1:10]
 [1] "FBgn0010423" "FBgn0032538" "FBgn0050029" "FBgn0032282" "FBgn0033600" "FBgn0284248" "FBgn0032280"
 [8] "FBgn0003375" "FBgn0003377" "FBgn0003060"
> forKEGG_NoDups[1:10]
 Dmel_CG3588 Dmel_CG17285  Dmel_CG7592  Dmel_CG3763  Dmel_CG6806 Dmel_CR11538 Dmel_CG34166 Dmel_CG33486 
       16212        16211        16210        16209        16208        16207        16206        16205 
 Dmel_CG9739  Dmel_CG9080 
       16204        16203 
> str(myResDF$FLYBASE)
 chr [1:16212] "FBgn0010423" "FBgn0032538" "FBgn0050029" "FBgn0032282" "FBgn0033600" "FBgn0284248" ...
> str(forKEGG_NoDups)
 Named int [1:16193] 16212 16211 16210 16209 16208 16207 16206 16205 16204 16203 ...
 - attr(*, "names")= chr [1:16193] "Dmel_CG3588" "Dmel_CG17285" "Dmel_CG7592" "Dmel_CG3763" ...

@guidohooiveld
Copy link

Would you mind editing or reposting your previous post as text (and no screenshots)? This will allow copy/pasting the ids (preventing making mistakes).

To generate code boxes:
1st: paste the text copied from the R console here on GitHub, then select the text and click the code button on top of the GitHub text box (=5th button from the left; <>).

@guidohooiveld
Copy link

I had some time left...

So your input are flybaseids.

Let's generate some sample data that is based in flybaseids. This is to mimic your dataset.
Note that the central identifier in org.Dm.eg.db are entrezid!

> ## load required libraries.
> library(clusterProfiler)
> library(org.Dm.eg.db)
> library(enrichplot)
> 
> ## extract all keys present in the OrgDb, and convert into flybaseids.
> ## note that for GSEA a per-gene metric is required that is used for ranking.
> ## fake the fold changes (= ranking metric) so they range from -15 to 15 for all flybaseids.
> ## this is thus your starting point (= geneList)
> 
> all.flybase.ids <- select(org.Dm.eg.db, keys = keys(org.Dm.eg.db),
+                           keytype = "ENTREZID",
+                           columns = c("FLYBASE") )$FLYBASE
'select()' returned 1:1 mapping between keys and columns
> 
> ## check
> length(all.flybase.ids)
[1] 25079
> 
> head(all.flybase.ids)
[1] "FBgn0040373" "FBgn0040372" "FBgn0261446" "FBgn0000316" "FBgn0005427"
[6] "FBgn0040370"
> 
> ## generate ranked list (based on flybaseids)
> ## generate input data
> geneList <- runif(n = length(all.flybase.ids), min = -15, max = 15)
> names(geneList) <- base::sample( all.flybase.ids, length(all.flybase.ids))
> ## for GSEA, sort on FC
> geneList <- sort(geneList, decreasing = TRUE)
> 
> ## check
> length(geneList)
[1] 25079
> 
> head(geneList)
FBgn0039765 FBgn0011283 FBgn0034667 FBgn0266787 FBgn0002805 FBgn0265932 
   14.99653    14.99473    14.99202    14.99193    14.99174    14.99161 
> 
> tail(geneList)
FBgn0265536 FBgn0002160 FBgn0034491 FBgn0025692 FBgn0004937 FBgn0000470 
  -14.98919   -14.98924   -14.99190   -14.99276   -14.99491   -14.99612 
> 
>

We are all set now, so let's get started!

> ## convert all flybaseids into keggids.
> ## according to you / zapaterc: keggids are 'Dme_ pasted' to corresponding flybasecg
> ##
> ## from inside-to-outside of code
> ## convert input data to keggids (function mapIds)
> ## remove all flybaseids that don't have a flybasecg annotation (function Filter + Negate)
> ## unlist
> 
> flyid2kegg <- unlist ( Filter( Negate(anyNA),
+                         mapIds(org.Dm.eg.db, keys=all.flybase.ids, column="FLYBASECG", keytype="FLYBASE")  ) )
'select()' returned 1:1 mapping between keys and columns
> 
> ## paste 'Dme_ pasted' to corresponding flybasecg
> ## note the use of [] !
> flyid2kegg[] <-  paste("Dmel",  flyid2kegg, sep="_")
> 
> ## check
> length(flyid2kegg)
[1] 17872
> head(flyid2kegg)
   FBgn0040373    FBgn0040372    FBgn0261446    FBgn0000316    FBgn0005427 
 "Dmel_CG3038"  "Dmel_CG2995" "Dmel_CG13377"  "Dmel_CG2945"  "Dmel_CG3114" 
   FBgn0040370 
"Dmel_CG13375" 
> 
> 
> ## convert input data (geneList) to keggids,
> names(geneList) <- flyid2kegg[names(geneList)]
> ## remove flybaseids without corresponding flybasecgid
> ## note that 25079 - 17872 = 7207 flybaseids do NOT have a corresponding flybasegcid.
> ## these 7202 genes are not recognized in KEGG, and will thus be removed.
> geneList <- geneList[!is.na(names(geneList))]
> 
> ## check
> length(geneList)
[1] 17872
> head(geneList)
 Dmel_CG9688  Dmel_CG6641 Dmel_CG13493 Dmel_CR45252 Dmel_CR44721  Dmel_CG3523 
    14.99653     14.99473     14.99202     14.99193     14.99161     14.98965 
> tail(geneList)
Dmel_CG42797  Dmel_CG6055  Dmel_CG2047 Dmel_CR44386 Dmel_CG11055  Dmel_CG3814 
   -14.98838    -14.98844    -14.98886    -14.98919    -14.99190    -14.99276 
> 
> 
> ## run gseKEGG with keggids
> ## note that NO significance cutoff is applied!
> KEGG_tmodEnriched <- gseKEGG(
+                      geneList,
+                      organism = "dme",
+                      keyType = "kegg",
+                      eps = 0,
+                      scoreType = "pos",
+                      pvalueCutoff = 1) 
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
> 
> ## further process output.
> KEGG_tmodEnriched <- pairwise_termsim(KEGG_tmodEnriched)
> 
> ## clean up output by removing name of organism
> KEGG_tmodEnriched@result$Description <- gsub(pattern = " - Drosophila melanogaster (fruit fly)", replacement = "", KEGG_tmodEnriched@result$Description, fixed = TRUE)
> 
> ## check
> KEGG_tmodEnriched
#
# Gene Set Enrichment Analysis
#
#...@organism    dme 
#...@setType     KEGG 
#...@keytype     kegg 
#...@geneList    Named num [1:17872] 15 15 15 15 15 ...
 - attr(*, "names")= chr [1:17872] "Dmel_CG9688" "Dmel_CG6641" "Dmel_CG13493" "Dmel_CR45252" ...
#...nPerm        
#...pvalues adjusted by 'BH' with cutoff <1 
#...121 enriched terms found
'data.frame':   121 obs. of  11 variables:
 $ ID             : chr  "dme00071" "dme00010" "dme00280" "dme03250" ...
 $ Description    : chr  "Fatty acid degradation" "Glycolysis / Gluconeogenesis" "Valine, leucine and isoleucine degradation" "Viral life cycle - HIV-1" ...
 $ setSize        : int  33 55 32 33 13 21 32 24 40 29 ...
 $ enrichmentScore: num  0.435 0.37 0.401 0.385 0.518 ...
 $ NES            : num  2.37 2.24 2.18 2.1 2.26 ...
 $ pvalue         : num  0.00216 0.00298 0.01099 0.01113 0.02689 ...
 $ p.adjust       : num  0.18 0.18 0.337 0.337 0.501 ...
 $ qvalue         : num  0.179 0.179 0.334 0.334 0.497 ...
 $ rank           : num  3661 4496 2728 4758 1366 ...
 $ leading_edge   : chr  "tags=42%, list=20%, signal=34%" "tags=45%, list=25%, signal=34%" "tags=31%, list=15%, signal=27%" "tags=52%, list=27%, signal=38%" ...
 $ core_enrichment: chr  "Dmel_CG8732/Dmel_CG4600/Dmel_CG6598/Dmel_CG4500/Dmel_CG13890/Dmel_CG3902/Dmel_CG12262/Dmel_CG3961/Dmel_CG5009/D"| __truncated__ "Dmel_CG5261/Dmel_CG10202/Dmel_CG6598/Dmel_CG17725/Dmel_CG2767/Dmel_CG32849/Dmel_CG5165/Dmel_CG10996/Dmel_CG9961"| __truncated__ "Dmel_CG17691/Dmel_CG4311/Dmel_CG4600/Dmel_CG18155/Dmel_CG3902/Dmel_CG12262/Dmel_CG15093/Dmel_CG3752/Dmel_CG8199/Dmel_CG3267" "Dmel_CG7185/Dmel_CG8055/Dmel_CG7626/Dmel_CG5994/Dmel_CG18734/Dmel_CG12876/Dmel_CG5179/Dmel_CG1404/Dmel_CG4913/D"| __truncated__ ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> as.data.frame(KEGG_tmodEnriched)[1:5,]
               ID                                Description setSize
dme00071 dme00071                     Fatty acid degradation      33
dme00010 dme00010               Glycolysis / Gluconeogenesis      55
dme00280 dme00280 Valine, leucine and isoleucine degradation      32
dme03250 dme03250                   Viral life cycle - HIV-1      33
dme00061 dme00061                    Fatty acid biosynthesis      13
         enrichmentScore      NES      pvalue  p.adjust    qvalue rank
dme00071       0.4352322 2.373871 0.002164406 0.1801904 0.1787012 3661
dme00010       0.3695726 2.241283 0.002978354 0.1801904 0.1787012 4496
dme00280       0.4013693 2.178607 0.010985174 0.3368226 0.3340389 2728
dme03250       0.3854747 2.102481 0.011134630 0.3368226 0.3340389 4758
dme00061       0.5179240 2.264565 0.026885051 0.5007849 0.4966462 1366
                           leading_edge
dme00071 tags=42%, list=20%, signal=34%
dme00010 tags=45%, list=25%, signal=34%
dme00280 tags=31%, list=15%, signal=27%
dme03250 tags=52%, list=27%, signal=38%
dme00061  tags=46%, list=8%, signal=43%
                                                                                                                                                                                                                                                                                                                core_enrichment
dme00071                                                                                                                                              Dmel_CG8732/Dmel_CG4600/Dmel_CG6598/Dmel_CG4500/Dmel_CG13890/Dmel_CG3902/Dmel_CG12262/Dmel_CG3961/Dmel_CG5009/Dmel_CG7461/Dmel_CG3752/Dmel_CG2107/Dmel_CG9629/Dmel_CG4592
dme00010 Dmel_CG5261/Dmel_CG10202/Dmel_CG6598/Dmel_CG17725/Dmel_CG2767/Dmel_CG32849/Dmel_CG5165/Dmel_CG10996/Dmel_CG9961/Dmel_CG3752/Dmel_CG11876/Dmel_CG7059/Dmel_CG5432/Dmel_CG11249/Dmel_CG8251/Dmel_CG1721/Dmel_CG12055/Dmel_CG9629/Dmel_CG6058/Dmel_CG15400/Dmel_CG11140/Dmel_CG32444/Dmel_CG9390/Dmel_CG7024/Dmel_CG10467
dme00280                                                                                                                                                                                            Dmel_CG17691/Dmel_CG4311/Dmel_CG4600/Dmel_CG18155/Dmel_CG3902/Dmel_CG12262/Dmel_CG15093/Dmel_CG3752/Dmel_CG8199/Dmel_CG3267
dme03250                                                                                                        Dmel_CG7185/Dmel_CG8055/Dmel_CG7626/Dmel_CG5994/Dmel_CG18734/Dmel_CG12876/Dmel_CG5179/Dmel_CG1404/Dmel_CG4913/Dmel_CG9670/Dmel_CG17051/Dmel_CG4107/Dmel_CG6605/Dmel_CG7815/Dmel_CG2848/Dmel_CG5874/Dmel_CG32217
dme00061                                                                                                                                                                                                                                              Dmel_CG3523/Dmel_CG8732/Dmel_CG16935/Dmel_CG18155/Dmel_CG4500/Dmel_CG3961
> 

That's it!

Yet.....: when continuing with visualization, it may be useful to convert the keggid back into flybaseids or symbols.

Note that keggid are thus not available in the OrgDb, so the function setReadable() can not be used! Therefore you need to manually create the mappings!
See also: YuLab-SMU/enrichplot#275 (comment)

> ## first, create mappings kegg2flyid and kegg2symbol
> flyid2symbol <- unlist ( Filter( Negate(anyNA),
+                      mapIds(org.Dm.eg.db, keys=all.flybase.ids, column="SYMBOL", keytype="FLYBASE")  ) )
'select()' returned 1:1 mapping between keys and columns
> ## then reverse flyid2kegg mapping
> kegg2flyid <- flyid2kegg; names(kegg2flyid) <- flyid2kegg; kegg2flyid[] <- names(flyid2kegg)
> ## finally, convert kegg2id into kegg2symbol
> kegg2symbol <- flyid2symbol[kegg2flyid]; names(kegg2symbol) <- names(kegg2flyid)
> 
> 
> 
> ## continue with extracting all genes from GSEA result
> gc <- geneInCategory(KEGG_tmodEnriched)  ## function geneInCategory is part of DOSE
> genes <- names(KEGG_tmodEnriched@geneList)  ## slot geneList in results contains input
> 
> ## Then, either A), when converting into flybaseids is wanted
> gn <- kegg2flyid[genes]
> 
> ## or, B), when converting into symbols is wanted
> gn <- kegg2symbol[genes]
> 
> 
> ## continue after either A) or B)
> gc <- lapply(gc, function(i) gn[i])
> 
> res <- KEGG_tmodEnriched@result
> gc <- gc[as.character(res$ID)]
> 
> geneID <- sapply(gc, paste0, collapse="/")
> res$core_enrichment <- unlist(geneID)
> 
> ## complete the output that is added to the returned object
> KEGG_tmodEnriched@gene2Symbol <- gn
> KEGG_tmodEnriched@readable <- TRUE
> KEGG_tmodEnriched@result <- res
> KEGG_tmodEnriched@keytype <- "KEGG"
>

A) after converting into flybaseids:

> ## check A) after converting into flybaseids
> KEGG_tmodEnriched
#
# Gene Set Enrichment Analysis
#
#...@organism    dme 
#...@setType     KEGG 
#...@keytype     KEGG 
#...@geneList    Named num [1:17872] 15 15 15 15 15 ...
 - attr(*, "names")= chr [1:17872] "Dmel_CG9688" "Dmel_CG6641" "Dmel_CG13493" "Dmel_CR45252" ...
#...nPerm        
#...pvalues adjusted by 'BH' with cutoff <1 
#...121 enriched terms found
'data.frame':   121 obs. of  11 variables:
 $ ID             : chr  "dme00071" "dme00010" "dme00280" "dme03250" ...
 $ Description    : chr  "Fatty acid degradation" "Glycolysis / Gluconeogenesis" "Valine, leucine and isoleucine degradation" "Viral life cycle - HIV-1" ...
 $ setSize        : int  33 55 32 33 13 21 32 24 40 29 ...
 $ enrichmentScore: num  0.435 0.37 0.401 0.385 0.518 ...
 $ NES            : num  2.37 2.24 2.18 2.1 2.26 ...
 $ pvalue         : num  0.00216 0.00298 0.01099 0.01113 0.02689 ...
 $ p.adjust       : num  0.18 0.18 0.337 0.337 0.501 ...
 $ qvalue         : num  0.179 0.179 0.334 0.334 0.497 ...
 $ rank           : num  3661 4496 2728 4758 1366 ...
 $ leading_edge   : chr  "tags=42%, list=20%, signal=34%" "tags=45%, list=25%, signal=34%" "tags=31%, list=15%, signal=27%" "tags=52%, list=27%, signal=38%" ...
 $ core_enrichment: chr  "FBgn0263120/FBgn0040064/FBgn0011768/FBgn0286723/FBgn0035169/FBgn0036824/FBgn0035811/FBgn0036821/FBgn0027572/FBg"| __truncated__ "FBgn0283658/FBgn0033969/FBgn0011768/FBgn0003067/FBgn0037537/FBgn0042710/FBgn0003076/FBgn0030525/FBgn0031451/FBg"| __truncated__ "FBgn0039993/FBgn0010611/FBgn0040064/FBgn0029945/FBgn0036824/FBgn0035811/FBgn0034390/FBgn0012036/FBgn0037709/FBgn0042083" "FBgn0035872/FBgn0086656/FBgn0040273/FBgn0017430/FBgn0004598/FBgn0086346/FBgn0019949/FBgn0020255/FBgn0026441/FBg"| __truncated__ ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> as.data.frame(KEGG_tmodEnriched)[1:5,]
               ID                                Description setSize
dme00071 dme00071                     Fatty acid degradation      33
dme00010 dme00010               Glycolysis / Gluconeogenesis      55
dme00280 dme00280 Valine, leucine and isoleucine degradation      32
dme03250 dme03250                   Viral life cycle - HIV-1      33
dme00061 dme00061                    Fatty acid biosynthesis      13
         enrichmentScore      NES      pvalue  p.adjust    qvalue rank
dme00071       0.4352322 2.373871 0.002164406 0.1801904 0.1787012 3661
dme00010       0.3695726 2.241283 0.002978354 0.1801904 0.1787012 4496
dme00280       0.4013693 2.178607 0.010985174 0.3368226 0.3340389 2728
dme03250       0.3854747 2.102481 0.011134630 0.3368226 0.3340389 4758
dme00061       0.5179240 2.264565 0.026885051 0.5007849 0.4966462 1366
                           leading_edge
dme00071 tags=42%, list=20%, signal=34%
dme00010 tags=45%, list=25%, signal=34%
dme00280 tags=31%, list=15%, signal=27%
dme03250 tags=52%, list=27%, signal=38%
dme00061  tags=46%, list=8%, signal=43%
                                                                                                                                                                                                                                                                                                     core_enrichment
dme00071                                                                                                                                     FBgn0263120/FBgn0040064/FBgn0011768/FBgn0286723/FBgn0035169/FBgn0036824/FBgn0035811/FBgn0036821/FBgn0027572/FBgn0034432/FBgn0012036/FBgn0035383/FBgn0036857/FBgn0032162
dme00010 FBgn0283658/FBgn0033969/FBgn0011768/FBgn0003067/FBgn0037537/FBgn0042710/FBgn0003076/FBgn0030525/FBgn0031451/FBgn0012036/FBgn0039635/FBgn0038957/FBgn0039425/FBgn0037115/FBgn0003074/FBgn0014869/FBgn0001091/FBgn0036857/FBgn0000064/FBgn0031463/FBgn0010548/FBgn0043783/FBgn0012034/FBgn0029722/FBgn0035679
dme00280                                                                                                                                                                                     FBgn0039993/FBgn0010611/FBgn0040064/FBgn0029945/FBgn0036824/FBgn0035811/FBgn0034390/FBgn0012036/FBgn0037709/FBgn0042083
dme03250                                                                                                 FBgn0035872/FBgn0086656/FBgn0040273/FBgn0017430/FBgn0004598/FBgn0086346/FBgn0019949/FBgn0020255/FBgn0026441/FBgn0028380/FBgn0015379/FBgn0020388/FBgn0000183/FBgn0036497/FBgn0031456/FBgn0038872/FBgn0014037
dme00061                                                                                                                                                                                                                                     FBgn0283427/FBgn0263120/FBgn0033883/FBgn0029945/FBgn0286723/FBgn0036821
> 
> 
> ## prepare dotplot
> p <- dotplot(KEGG_tmodEnriched, font.size=8, showCategory=8, title =("GSEA results"), split=".sign") + facet_grid(.~.sign)
> print(p)
> 
> ## prepare cnetplot
> ## note that for cnetplot geneList used as input 
> p <- cnetplot(KEGG_tmodEnriched, showCategory=5, cex.params = list(gene_label = 0.8),
+               color.params = list(foldChange = geneList, edge = FALSE) )
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
> 
> print(p)
>

Flybaseid-based dotplot below
image

Flybaseid-based cnetplot below
image

B) after converting into flybaseids:

> ## check B) after converting into flybaseids
> KEGG_tmodEnriched
#
# Gene Set Enrichment Analysis
#
#...@organism    dme 
#...@setType     KEGG 
#...@keytype     KEGG 
#...@geneList    Named num [1:17872] 15 15 15 15 15 ...
 - attr(*, "names")= chr [1:17872] "Dmel_CG9688" "Dmel_CG6641" "Dmel_CG13493" "Dmel_CR45252" ...
#...nPerm        
#...pvalues adjusted by 'BH' with cutoff <1 
#...121 enriched terms found
'data.frame':   121 obs. of  11 variables:
 $ ID             : chr  "dme00071" "dme00010" "dme00280" "dme03250" ...
 $ Description    : chr  "Fatty acid degradation" "Glycolysis / Gluconeogenesis" "Valine, leucine and isoleucine degradation" "Viral life cycle - HIV-1" ...
 $ setSize        : int  33 55 32 33 13 21 32 24 40 29 ...
 $ enrichmentScore: num  0.435 0.37 0.401 0.385 0.518 ...
 $ NES            : num  2.37 2.24 2.18 2.1 2.26 ...
 $ pvalue         : num  0.00216 0.00298 0.01099 0.01113 0.02689 ...
 $ p.adjust       : num  0.18 0.18 0.337 0.337 0.501 ...
 $ qvalue         : num  0.179 0.179 0.334 0.334 0.497 ...
 $ rank           : num  3661 4496 2728 4758 1366 ...
 $ leading_edge   : chr  "tags=42%, list=20%, signal=34%" "tags=45%, list=25%, signal=34%" "tags=31%, list=15%, signal=27%" "tags=52%, list=27%, signal=38%" ...
 $ core_enrichment: chr  "Acsl/yip2/Fdh/hll/Dci/CG3902/Mcad/CG3961/ACOX1/Acadvl/Aldh/CPT2/Aldh7A1/CG4592" "muc/Pgm2b/Fdh/Pepck1/CG2767/Hex-t2/Pgm1/CG10996/CG9961/Aldh/Pdhb/CG7059/Ald2/CG11249/Pgi/Pglym78/Gapdh1/Aldh7A1"| __truncated__ "BckdhB/Hmgs/yip2/Acsf3/CG3902/Mcad/CG15093/Aldh/BckdhA/Mccc2" "Cpsf6/shrb/Spt5/Nelf-E/Fur2/ALiX/Cdk9/Ran/ear/fal/dod/Gcn5/BicD/Ran-like/Tnpo-SR/Nelf-A/Su(Tpl)" ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> as.data.frame(KEGG_tmodEnriched)[1:5,]
               ID                                Description setSize
dme00071 dme00071                     Fatty acid degradation      33
dme00010 dme00010               Glycolysis / Gluconeogenesis      55
dme00280 dme00280 Valine, leucine and isoleucine degradation      32
dme03250 dme03250                   Viral life cycle - HIV-1      33
dme00061 dme00061                    Fatty acid biosynthesis      13
         enrichmentScore      NES      pvalue  p.adjust    qvalue rank
dme00071       0.4352322 2.373871 0.002164406 0.1801904 0.1787012 3661
dme00010       0.3695726 2.241283 0.002978354 0.1801904 0.1787012 4496
dme00280       0.4013693 2.178607 0.010985174 0.3368226 0.3340389 2728
dme03250       0.3854747 2.102481 0.011134630 0.3368226 0.3340389 4758
dme00061       0.5179240 2.264565 0.026885051 0.5007849 0.4966462 1366
                           leading_edge
dme00071 tags=42%, list=20%, signal=34%
dme00010 tags=45%, list=25%, signal=34%
dme00280 tags=31%, list=15%, signal=27%
dme03250 tags=52%, list=27%, signal=38%
dme00061  tags=46%, list=8%, signal=43%
                                                                                                                                                         core_enrichment
dme00071                                                                                  Acsl/yip2/Fdh/hll/Dci/CG3902/Mcad/CG3961/ACOX1/Acadvl/Aldh/CPT2/Aldh7A1/CG4592
dme00010 muc/Pgm2b/Fdh/Pepck1/CG2767/Hex-t2/Pgm1/CG10996/CG9961/Aldh/Pdhb/CG7059/Ald2/CG11249/Pgi/Pglym78/Gapdh1/Aldh7A1/Ald1/G6P/Aldh-III/CG32444/AcCoAS/CG7024/CG10467
dme00280                                                                                                    BckdhB/Hmgs/yip2/Acsf3/CG3902/Mcad/CG15093/Aldh/BckdhA/Mccc2
dme03250                                                                 Cpsf6/shrb/Spt5/Nelf-E/Fur2/ALiX/Cdk9/Ran/ear/fal/dod/Gcn5/BicD/Ran-like/Tnpo-SR/Nelf-A/Su(Tpl)
dme00061                                                                                                                             FASN1/Acsl/CG16935/Acsf3/hll/CG3961
> 
> 
> ## prepare dotplot
> p <- dotplot(KEGG_tmodEnriched, font.size=8, showCategory=8, title =("GSEA results"), split=".sign") + facet_grid(.~.sign)
> print(p)
> 
> 
> 
> ## prepare cnetplot
> ## note that for cnetplot geneList used as input 
> p <- cnetplot(KEGG_tmodEnriched, showCategory=5, cex.params = list(gene_label = 0.8),
+               color.params = list(foldChange = geneList, edge = FALSE) )
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
> 
> print(p)
> 

Flybaseid-based dotplot below
image

Flybaseid-based cnetplot below
image

@zapaterc
Copy link

zapaterc commented Jul 15, 2024

Hi Guido, thank you SO much for your help! however, even when trying your code the issue persists.

preparing geneSet collections...
--> Expected input gene ID: 
Error in check_gene_id(geneList, geneSets) : 
  --> No gene can be mapped....

Not sure how to proceed, as I have tried to re-install clusterProfiler from the YuLab Github direct download and am still getting the error message. Again thanks so much for spending the time helping out!

@guidohooiveld
Copy link

In the post you changed you posted:

> str(forKEGG_NoDups)
 Named int [1:16193] 16212 16211 16210 16209 16208 16207 16206 16205 16204 16203 ...
 - attr(*, "names")= chr [1:16193] "Dmel_CG3588" "Dmel_CG17285" "Dmel_CG7592" "Dmel_CG3763" ...

Based on that output:

  • What do the number represent? Thus 16212, etc? It should be the log2FC or something like that, but it seems to be the entrezid?
  • Also: your object forKEGG_NoDups is a named int(eger) vector, but it rather should be a named num(eric) vector! Thus integer versus numeric! What happens if you use as input for gseKEGG the object as.numeric(forKEGG_NoDups)? Note: for now you ignore the exact 'biological meaning' of the number...
  • Finally, if you are comfortable with it, feel free to share / attached the file myResDF_IDs_ranked.csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

download_KEGG produces no obs for KEGGPATHID2EXTID
7 participants