Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 data point per document vs 1 data point per package #18

Open
Bisaloo opened this issue Sep 16, 2024 · 17 comments
Open

1 data point per document vs 1 data point per package #18

Bisaloo opened this issue Sep 16, 2024 · 17 comments
Labels
question Further information is requested
Milestone

Comments

@Bisaloo
Copy link
Collaborator

Bisaloo commented Sep 16, 2024

This again came up in a discussion with @avinashladdha:

Do we want to have 1 data point per document or 1 data point per package? How to make this happen?

From a user point of view, it probably makes more sense to have a single data point (single point on the map & single search answer) per package.

Currently, we have multiple documents per package so if we wanted to have 1 point per package, how can we do this?

I don't know if it makes sense to concatenate all documents to have a single document per package as we may end up averaging points we a large amount of variability and end up with an average that is not meaningful.

@avinashladdha mentioned we could have a post-process deduplication step where we keep only the best score / best matching document for each package. Are there any downsides to this approach? How could we apply something similar to the map?

@Bisaloo Bisaloo added the question Further information is requested label Sep 16, 2024
@avinashladdha
Copy link

avinashladdha commented Sep 16, 2024

From a user point of view, it probably makes more sense to have a single data point (single point on the map & single search answer) per package.

Currently we are returning the final response on package level instead of document/module level hence the user will get one data point with either approach.

a. For 1-document-1-embedding approach:
Calculates the similarity score for each document in a package and return the highest score. The top 3 scores (and thus relevant package name) across all packages are returned as the final output.

b. For 1-package-1-embedding approach:
Calculate similarity score for each package (need to aggregate embeddings across all documents) and return top 3 results.

@avinashladdha
Copy link

Based on what is more relevant for Epi package search we can take either route.

From embedding/visualisation point I would note:
a. 1 embedding for 1 document -
Concatenating all documents might not be the best approach as it will dilute individual document nuances and we might need to break down texts into chunks if the text size of combined document is beyond a threshold. In which case we will need to aggregate the embeddings.

Other approaches of aggregating embedding from a single package could be the following:

  1. Averaging embeddings for a package across the constituent documents.
  2. Better still is to average word vectors and then subtract the first principal component, which reduces the dominance of common words. This may retain more meaningful semantic information.

@Bisaloo Bisaloo reopened this Sep 16, 2024
@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Sep 23, 2024

This was discussed today with the WHO Collaboratory team at our monthly stand up and there were not a very strong push to either side.

There seemed to be a small preference for returning specific tool (i.e., 1 data point per package), with the caveat that we should then also indicate which document led to the high score.

One potential option may also be to create and present both alternatives and see which one gathers more positive user feedback.

@paulkorir
Copy link

paulkorir commented Sep 25, 2024

I also agree that we should have results drill down to the module though I'm not sure how many embeddings this implies. From my experiments (https://github.com/paulkorir/working-with-embeddings/blob/master/experiments.py) I would imagine that you will only have one embedding model. I could be wrong.

@paulkorir
Copy link

OK. I've been getting up to scratch with the topic and it seems to me that the application of an embedding model on a set of documents results in the as set of vectors. There is only ever one embedding model at play. This embedding model can be at the level of word, sentence or document. Therefore, the decision to be made is which level of embedding model will be most useful. In my opinion, we should be trying them all and examine the results to select the best one.

@chartgerink
Copy link
Member

for what it's worth - the 2d map as we have talked about it until now has always been at the package level. If we do it at the document level, we may get a cluster for one package, or not. 👍

@paulkorir
Copy link

I hear you. I believe that it may be most useful for the user to see the viz in terms of the final tool they need, not necessarily the package. In any case, it would be useful for the user to toggle between the package and module level. At the package level they will know what to install but at the module level they will know which function to run.

@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Sep 26, 2024

I don't think we can identify specific functions with the initial infrastructure because the source data (= the documentation we feed to the language model) is not structured by function.

What can be done is what Dina proposed: we return the package name, and a link to the source document that lead us to return this result. From here, the user can read the document and see how they can perform their task, which will often be a combination of steps/functions.

In a future version, we can try to make a "best guess" at the function call(s) to perform the queried task but I believe it's a distinct issue. Likely something that will require using the generative feature of our language model. We can open a new issue to track this.

@paulkorir
Copy link

Can this be solved at the level of documentation extraction? It could be substantially easier to do this during data extraction than downstream during search.

@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Sep 26, 2024

Can this be solved at the level of documentation extraction?

No, because a large portion of the source documents do not present the tool by function but by task or topic and these tasks usually involve multiple functions.

See for example https://epiforecasts.io/EpiNow2/articles/estimate_infections_workflow.html or https://epiverse-trace.github.io/finalsize/articles/finalsize.html

@paulkorir
Copy link

I see. That makes sense. However, I thought that the reference documentation (e.g. https://epiverse-trace.github.io/finalsize/reference/dot-final_size.html) would also be included. These would be at the function level.

@paulkorir
Copy link

This one is even better and it is pertinent to a single function: https://epiverse-trace.github.io/finalsize/reference/final_size.html.

@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Sep 26, 2024

I thought that the reference documentation (e.g. epiverse-trace.github.io/finalsize/reference/dot-final_size.html) would also be included.

Yes, both this and the other type of document I shared are included. But it is unclear which ones will usually lead to better results. Which is why I propose we delay this specific feature until we have good results at the package level and we can identify which document (reference manual or articles/vignettes) produced these results.

Since it seems we are slightly deviating from the initial conversation, I have opened #21.

In this issue, let's try to stick to the initial question: how to go from multiple documents / package to 1 point per package? Should we concatenate documents before feeding them to the LM? Should we compute embeddings per document but only return the one with the highest score? etc.

@Bisaloo Bisaloo added this to the search 0.1.0 milestone Sep 26, 2024
@Bisaloo Bisaloo closed this as completed by moving to Done in Epiverse phase 2 Sep 26, 2024
@Bisaloo Bisaloo reopened this Sep 26, 2024
@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Oct 17, 2024

The current approach that Avinash is using to summarise the multiple documents into a single data point is to average the embeddings.

I had a quick look at the approach using a PCA to generate the map (to be refined in epiverse-connect/epiverse-map#10) and the various documents for a given package (one colour for each package on the plot below) are spread across the map. I'm therefore afraid that we get averages that are not representing what we want.

image

I wonder if we could have better results by:

@avinashladdha
Copy link

The primary concern with concatenating documents is that the resulting embeddings will be heavily influenced by the number of documents in each folder. Folders with more documents will have a disproportionate impact on the overall representation.
I am unsure of the interpretation of two matrices when one has 5000 data points (A matrix) as compared to other with 1000 data points(B matrix) which to make them of same dimensions (required for computation in next steps) we pad it with 4000 zeros.

@paulkorir
Copy link

Noted. However, given that the search process tries to find the set of vectors which match most closely with the search vector, provided that the resulting embeddings are non-random (and they should be because of the encoded semantic content), then the number of embeddings per folder should not be a problem. It would be useful to see if the search results are distorted by a disproportionate number of embeddings.

@Bisaloo
Copy link
Collaborator Author

Bisaloo commented Oct 22, 2024

Folders with more documents will have a disproportionate impact on the overall representation.

I don't follow why this would be the case. As long as we have a single vector / package, all packages should have the same weight, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Done
Development

No branches or pull requests

4 participants