You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@thomasyu888@gkowalski In the current design of the infrastructure, the Orchestrator gets a page (as in "paginated response") of clinical notes from a Data Node (e.g. 50 clinical notes), sends them to NLP Tool being evaluated, receives the results and repeat with the next page of clinical notes. In addition of allowing us to controller the flow of information to the NLP Tool, which limit its memory need, we can evaluate and ideally report the following metrics:
Completion rate: number of notes processed / number of notes in the dataset
Time required to process a clinical note (average, std)
the timer start after the request has been sent (clinical notes sent to the NLP Tool)
the timer stops when all the responses have been received from the NLP Tool for the clinical notes sent
The motivation for reporting the completion rate to the user is that it will allow him/her to better predict when the results are out. This can also be used by the user to track whether the tool takes too my time to complete. For the staff maintaining a Data Hosting Site, it would be nice to have a report in ELK that shows the Tools that are being evaluated and their completion rate.
The motivation for reporting information about the processing time is that a hospital who is looking for a tool to use in production by visiting a Leaderboard of the NLP Sandbox may identify that a Tool would take too much time to process their volume of clinical note. One option could be to extrapolate and show the time required to process 1 million of notes. It's important that any time information are very much dependent on the spec of the infrastructure used (number and frequency of CPU cores, etc.). We should be able to provide information about the spec used when reporting a time information. Note that this spec may vary from one Data Hosting Site, in which case we would probably want to report the time for each dataset / Data Hosting Site used to evaluate a NLP Tool.
The text was updated successfully, but these errors were encountered:
One important distinction to make is that currently the orchestrator doesn't do any of those things. The workflow that you see in this repository would be in charge of doing those things and the orchestrator is responsible for connecting participant submissions with this workflow.
One of my biggest concerns is that there isn't an "elegant" way with CWL to
Get 50 nodes
Process 50 nodes
annotate with metrics for those 50
Repeat step 1 until finished
Currently the workflow would be
Get a million nodes but split into chunks of 50
Process the chunks of 50 In parallel and annotate metrics.
I think the above metrics are obtainable, will just take an example submission to figure out what is and isn't possible.
Step 3 would be out of the loop: we process all the clinical notes and then we evaluate the performance. The loop 1-2 could be implemented in the NLP Sandbox Client as one command.
@thomasyu888 @gkowalski In the current design of the infrastructure, the Orchestrator gets a page (as in "paginated response") of clinical notes from a Data Node (e.g. 50 clinical notes), sends them to NLP Tool being evaluated, receives the results and repeat with the next page of clinical notes. In addition of allowing us to controller the flow of information to the NLP Tool, which limit its memory need, we can evaluate and ideally report the following metrics:
The motivation for reporting the completion rate to the user is that it will allow him/her to better predict when the results are out. This can also be used by the user to track whether the tool takes too my time to complete. For the staff maintaining a Data Hosting Site, it would be nice to have a report in ELK that shows the Tools that are being evaluated and their completion rate.
The motivation for reporting information about the processing time is that a hospital who is looking for a tool to use in production by visiting a Leaderboard of the NLP Sandbox may identify that a Tool would take too much time to process their volume of clinical note. One option could be to extrapolate and show the time required to process 1 million of notes. It's important that any time information are very much dependent on the spec of the infrastructure used (number and frequency of CPU cores, etc.). We should be able to provide information about the spec used when reporting a time information. Note that this spec may vary from one Data Hosting Site, in which case we would probably want to report the time for each dataset / Data Hosting Site used to evaluate a NLP Tool.
The text was updated successfully, but these errors were encountered: