Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knowledge-collaboratory: meta_knowledge_graph/ endpoint in production does not respond (504 Gateway Time-out) #593

Closed
isbluis opened this issue Oct 19, 2023 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@isbluis
Copy link

isbluis commented Oct 19, 2023

ARAX is currently unable to expand to knowledge-collaboratory in production, since the published endpoint (https://collaboratory-api.transltr.io/meta_knowledge_graph) times out and does not return a valid meta_knowledge_graph.

To repro: wget https://collaboratory-api.transltr.io/meta_knowledge_graph

Or try via the OpenAPI page:
https://smart-api.info/ui/89054eff6ee6d91641d278d9ffdb3993#/trapi/Get_the_meta_knowledge_graph_of_the_Nanopublication_network_meta_knowledge_graph_get

@isbluis isbluis added the bug Something isn't working label Oct 19, 2023
@sstemann
Copy link

@CaseyTa is this something your team can take a look at?

@CaseyTa
Copy link

CaseyTa commented Oct 19, 2023

@vemonet Could you please check deployments to Test and Prod? I see CI is responsive for /meta_knowledge_graph, /query, and /health, but both Test and Prod are failing with 504 gateway timeouts.

@isbluis
Copy link
Author

isbluis commented Oct 19, 2023

Hi @CaseyTa -- I notice that the /query endpoint in production has returned a 504 error for about 25% of the times that ARAX has called it in the past month or so -- maybe a related issue?

image

(the above is taken from: https://arax.ncats.io/devLM/kptest/ with "production" selected)

@vemonet
Copy link
Member

vemonet commented Oct 20, 2023

Hi, I recently fixed this time-out issue in dev and ITRB CI, as @CaseyTa mentioned it, but I don't think it was pushed to dev and prod yet

I am making the request right now

@CaseyTa
Copy link

CaseyTa commented Nov 1, 2023

Fixed in test and prod now. Thanks, all!

@CaseyTa CaseyTa closed this as completed Nov 1, 2023
@isbluis
Copy link
Author

isbluis commented Nov 28, 2023

A similar issue has been happening on CI (with a response of 502: Bad Gateway) for the past week or so:

wget https://collaboratory-api.ci.transltr.io/meta_knowledge_graph

image

@isbluis isbluis reopened this Nov 28, 2023
@CaseyTa
Copy link

CaseyTa commented Nov 28, 2023

@isbluis Thanks!

@vemonet I'm seeing 502 on the /health endpoint for CI and "status": "ok" on TEST and PROD. Could you please check on CI? Thanks!

@isbluis
Copy link
Author

isbluis commented Jan 17, 2024

Hi @CaseyTa . Just following up on this in anticipation of next week's Relay, as this issue is still present in CI (we've logged over 700 failures in the past 3 days).

Let me know if you need any info from our team. Thanks!

@vemonet
Copy link
Member

vemonet commented Jan 17, 2024

Hi @isbluis it seems like https://collaboratory-api.ci.transltr.io is permanently returning 502 Bad Gateway

I am not sure why though, the latest commit deployed on our development server don't face the same error: https://api.collaboratory.semanticscience.org/docs

It would be nice if we could check ourselves which commit is deployed on CI, and if we could ourselves trigger deployment of the last commit (it seems like CI does not always automatically re-deploy the last commit), and ideally we should be able to see the logs of the container

@isbluis
Copy link
Author

isbluis commented Jan 18, 2024

Hi @vemonet . Yea, we have also had some issues in ARAX trying to get a view into what is happening in those CI instances. I am not in ARS or configure servers, so I am of limited help in this regard. Maybe request a bounce and see if that fixes it? (yea, the typical IT "solution" :) ) Thanks.

@isbluis
Copy link
Author

isbluis commented Jan 26, 2024

This appears to be resolved. Closing.

@isbluis isbluis closed this as completed Jan 26, 2024
@isbluis
Copy link
Author

isbluis commented Feb 23, 2024

Hello again,
A similar issue has been happening in test for the past week or so -- looks like an incorrect URL? (the one specified in SmartAPI is https://collaboratory-api.test.transltr.io/meta_knowledge_graph )

image

@isbluis isbluis reopened this Feb 23, 2024
@isbluis
Copy link
Author

isbluis commented Mar 6, 2024

Hi @CaseyTa . Just following up on this in anticipation of this week's closing of the code window, as this issue is still present in TEST (we've logged almost 3500 failures since Feb 16).

@CaseyTa
Copy link

CaseyTa commented Mar 11, 2024

@isbluis Thanks for the notice. ITRB were able to resolve the issue for us.

@CaseyTa CaseyTa closed this as completed Mar 11, 2024
@isbluis
Copy link
Author

isbluis commented Mar 13, 2024

Hi @CaseyTa . It appears that the issue is still present, as ARAX keeps being unable to retrieve a meta_knowledge_graph from TEST, CI, and even sometimes PRODUCTION instances. It seems that the issue is that it takes a very long time to get a response, so we just time out and move on without expanding.

As an example, you can try this on the command-line:
time wget https://collaboratory-api.test.transltr.io/meta_knowledge_graph
In my testing, this takes over half a minute each time:
image

I don't know the source of the latency, but this appears specific to infores:knowledge-collaboratory .

Hope this helps!

@isbluis isbluis reopened this Mar 13, 2024
@CaseyTa
Copy link

CaseyTa commented Mar 13, 2024

@vemonet Could we add a cache to the meta_knowledgegraph endpoint?

@vemonet
Copy link
Member

vemonet commented Apr 4, 2024

Hi, everytime I check the metaKG it takes consistently 3 to 4s to answer.

But since it queries the Nanopub network endpoint (public endpoint), the time might change depending on the load

Yes, @CaseyTa you can add a cache for the meta_kg endpoint

@CaseyTa
Copy link

CaseyTa commented Apr 4, 2024

@vemonet Thanks, I did test the meta_kg endpoints a few weeks ago and observed response times consistently around 30 sec, but I am seeing 3-5 sec response times right now. Will see if we can get caching added in this or next sprint cycle.

@CaseyTa
Copy link

CaseyTa commented May 17, 2024

We will track this issue in the Knowledge Collab repo so that we can take it off the hands of the TAQA group.

@CaseyTa CaseyTa closed this as completed May 17, 2024
@isbluis
Copy link
Author

isbluis commented May 31, 2024

This has been an issue again in ci and dev since about May 15. Both other endpoints (prod, test) appear to be fine.

@isbluis
Copy link
Author

isbluis commented Jun 17, 2024

This is currently happening in PRODUCTION as of this weekend. Perhaps related to TRAPI 1.5 vs. 1.4?

@isbluis isbluis reopened this Jun 17, 2024
@CaseyTa CaseyTa assigned micheldumontier and unassigned vemonet Jul 10, 2024
@CaseyTa
Copy link

CaseyTa commented Jul 10, 2024

MaastrichtU-IDS/knowledge-collaboratory#17 addresses this issue. Tested working in CI.

@CaseyTa
Copy link

CaseyTa commented Jul 11, 2024

Fixes deployed and working in ITRB-Test now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants