-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add currency tests #53
Conversation
hey @ivsanro1! It seems in most cases the support is 1. What do you think about considering those as test cases instead? It seems these examples are very similar to the existing test cases we have (https://github.com/scrapinghub/price-parser/blob/master/tests/test_price_parsing.py), but with more variety for currency. |
My main worry about these things is the workflow. If we go with an evaluation approach, we should figure out how contributors should be using it. Probably we should run it on CI. But it's less clear to me what to do with the results. |
Hey @kmike , thanks for the suggestion! I also thought about this, but how we'd handle the failing cases? Should we add them in a different test and mark is as xfail? I think it would be nice to easily add cases that we know we fail until they're fixed.
Good point. I also think putting it in CI would be the best option if we decide to go with the evaluation and not move it to tests. |
This sounds good! That's exactly how the current testing suite is working. In the current testing suite each example is a separate test (via pytest.mark.parametrize); the ones which are failing are marked as xfail. The existing testing suite was actually created in a similar way - a few hundred example input strings from a dataset were annotated and converted to tests. Then, there was a round of improving the parsing rules, and after that another set of examples was added, to control the overfitting (mostly to get an impression, that's not very scientific). |
Unlike standard ML problems, I'm not too worried about overfitting here. I.e. it seems to be fine to fix xfail tests one-by-one, as in most cases it'd just mean adding more aliases for currencies. It still could be good to do a second round afterwards, e.g. to confirm that the accuracy improves after the first round of fixes. |
By the way, in the current testing suite there is a lot of overlapping or similar examples; that's because many websites show prices in a similar way. So, it's more like a dataset, not like a hand-crafted testing suite, with a few exceptions. This allows to make trade-off decisions: it could be the case that a rule may fix some examples and break others; we can use the test count to help us decide which way to go in such situations. |
I see! Thanks for the explanation 👍 Then, what I'll do instead of adding this evaluation script is to add all the examples as tests, and those that fail add them in Please feel free to close the PR if you consider it appropiate |
You can also implement the new approach in this PR if you want, instead of creating a separate one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests, thanks @ivsanro1! 👍
Those were the last tests I wanted to add for this PR. I was also thinking to use this PR to implement these cases and then move the tests from xfail to pass, but it might take some time. Should we merge it and add the cases to price-parser in another PR @kmike @Gallaecio ? |
Sounds good to me. |
Before merging, could you please ensure that all tests pass @ivsanro1? I think some of them fail currently. |
…ss that were just failing because small local change
Yes, of course! I made the changes in 7529b2e These mypy errors were showing:
I didn't see any particular problems with these lines. I saw we could update Also, I added the Now everything is passing:
|
It may make sense to look into addressing #52 before merging here. |
Thanks @ivsanro1! |
Hello,
Since we might add some features for fuzzy search, especially when the currency is not found (see #28 (comment)), we think it'd be nice to have some evaluation script and dataset that we can use to compare the quality of the extractions (thanks for the idea @lopuhin !).
The purpose of the evaluation is to compare the currency extraction quality of
Price.fromstring
.If the functionality of this function is extended via PR (e.g. we perform a fuzzy search if the currency is not found), we should be able to see an improvement in the metrics.
I decided not to add it as a test because the evaluation metrics can be interpreted in several ways.
For now, the metrics are simply global accuracy and accuracy per symbol (ground truth) for quality evaluation, and total extraction time and extraction time per sample for performance evaluation. These metrics can be easily extended, as the script is quite simple (
evaluation/evaluate_currency_extraction.py
).If in the future we add other evaluation metrics or datasets, this script can be improved and generalized (e.g. adding
argparse
). For now I decided to keep it simple.The dataset (
dataset_eval.json
) can also be easily extended to include more cases.Here's a sample of the
dataset_eval.json
:The evaluation script will extract the currency from
string
usingPrice.fromstring
and compare it withcurrency
. A correct extraction happens when both are equal.With the current dataset and price_parser versions, this is the output of the evaluation:
I'm open to suggestions