Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Make Joiner example 07 more compelling #1148

Open
Vincent-Maladiere opened this issue Nov 20, 2024 · 0 comments
Open

[DOC] Make Joiner example 07 more compelling #1148

Vincent-Maladiere opened this issue Nov 20, 2024 · 0 comments
Labels
documentation Add or improve the documentation no changelog needed

Comments

@Vincent-Maladiere
Copy link
Member

Describe the issue linked to the documentation

As @jeromedockes mentioned in #1145, example 07 has several flaws:

In a second time, we can work on improving that example and maybe using a different dataset. Indeed, at the moment
- it downloads a very large dataset
- the join with weather data does not improve predictions significantly IIRC
- the predictions are not super far from chance level which makes the example less compelling IMO

Suggest a potential alternative/fix

We could try to either:

  • Find another dataset on which fuzzy joining would boost performance
  • Change the task to joining, without focusing on learning. A common use case for people working with databases is to create "proxy keys" from multiple columns. This technique is helpful to join tables coming from different storing systems or databases, where there isn't a foreign key to join on. Without necessarily having a downstream learning task, we can quantify the fuzzy join with e.g. recall and precision, by defining "false positive" as incorrect joins and "false negative" as missed joins.
@Vincent-Maladiere Vincent-Maladiere added documentation Add or improve the documentation no changelog needed labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Add or improve the documentation no changelog needed
Projects
None yet
Development

No branches or pull requests

1 participant