Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ml pipeline #29

Merged
merged 25 commits into from
Apr 5, 2024
Merged

Feature/ml pipeline #29

merged 25 commits into from
Apr 5, 2024

Conversation

nmenezes0
Copy link
Contributor

@nmenezes0 nmenezes0 commented Mar 19, 2024

Context

Add the ML pipeline to generate the themes/topics and save them to the DB. For each question in a consultation, we need to classify the free text responses for each question into topics (also called "themes") and save these info to the DB.

Out of scope for this PR:

  • Adding the LLM calls to generate the theme summaries will be done separately.
  • Evaluation/improvement of the modelling approach - at the moment, we just want to implement the existing workflow that was used for previous consultations.
  • Dealing with outliers - at the moment they are just saved like any other topic (as we do still want them).

Changes proposed in this pull request

  • Adding BERTopic and other ML packages.
  • Adding a question field to the Theme model. Whilst it feels like a bit of duplication as there is also a question field on the Answer model - this allows us to add uniqueness criteria, simplifies logic, and the question is only assigned to the answer and theme once (i.e. it isn't going to change and get out of date).
  • Adding the ML pipeline to assign topics ("themes") for each question - using BERTopic.

ML pipeline overview for a given question (with free text that needs to be classified):

  • Get the answers for a given question i.e. the free text responses corresponding to that question.
  • Get embeddings for the list of free text responses i.e. representations of each response as vectors.
  • Using these embeddings, generate a topic model (i.e. this classifies the responses into different topics i.e. clusters the vectors above).
  • Save each of these topics/themes to the DB, and save the theme for each answer.

Follows the first part of this: https://github.com/i-dot-ai/ova-consultation/blob/main/run_full_analysis_pipeline.py (not the LLM summary bit). BERTopic docs are pretty good too: https://maartengr.github.io/BERTopic/.

Guidance to review

Check the pipeline runs and things are saved to the DB as appropriate - this is covered in the tests (but check you are happy the tests check the entire pipeline runs, and checks themes are saved to DB).

Check the ML pipeline gets topics as per Jonah's code: https://github.com/i-dot-ai/ova-consultation/blob/main/run_full_analysis_pipeline.py

Link to JIRA ticket

https://technologyprogramme.atlassian.net/browse/CON-47

Things to check

  • I have added any new ENV vars in all deployed environments and updated the .env.example and .env.test files in the repo

@nmenezes0 nmenezes0 marked this pull request as draft March 19, 2024 23:50
@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch 2 times, most recently from 5bdbbfb to 2c8f09f Compare March 20, 2024 17:38
@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch 2 times, most recently from f9626bb to b37e273 Compare March 25, 2024 13:24
@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch from 139e02b to 58f641b Compare March 29, 2024 20:12
@nmenezes0 nmenezes0 changed the title Work in progress: Feature/ml pipeline Feature/ml pipeline Apr 2, 2024
@nmenezes0 nmenezes0 marked this pull request as ready for review April 2, 2024 09:12
@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch from 58f641b to 4f4b443 Compare April 2, 2024 09:13
@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch 2 times, most recently from 977ef03 to 84124e7 Compare April 4, 2024 11:14
Copy link

@rachaelcodes rachaelcodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a couple of clarifying questions, but otherwise LGTM

@nmenezes0 nmenezes0 force-pushed the feature/ml-pipeline branch from fdd0ad5 to 1f15c6c Compare April 5, 2024 13:08
@nmenezes0 nmenezes0 merged commit c308e03 into main Apr 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants