Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric calculation is bogus #223

Open
nanounanue opened this issue Sep 19, 2017 · 5 comments
Open

Metric calculation is bogus #223

nanounanue opened this issue Sep 19, 2017 · 5 comments

Comments

@nanounanue
Copy link
Contributor

nanounanue commented Sep 19, 2017

precision calculation is currently taking predictions for several as of dates, and calculating precision across all of them together, resulting in bogus results. need to look at how to do it for each as of date separately and then aggregate or something more reasonable.

@nanounanue nanounanue added the bug label Sep 19, 2017
@thcrock
Copy link
Contributor

thcrock commented Jan 19, 2018

Not actionable as written. Closing, can reopen with more details if needed.

@thcrock thcrock closed this as completed Jan 19, 2018
@thcrock thcrock reopened this Jan 19, 2018
@thcrock thcrock removed the bug label Feb 1, 2018
@nanounanue nanounanue changed the title Precision calculation is bogus Metric calculation is bogus Feb 6, 2019
@nanounanue
Copy link
Contributor Author

Given the following temporal configuration:

temporal_config:
    feature_start_time: '2010-01-04'
    feature_end_time: '2019-01-01'
    label_start_time: '2015-02-01'
    label_end_time: '2019-01-01'

    model_update_frequency: '1y'
    training_label_timespans: ['1month']
    training_as_of_date_frequencies: '1month'

    test_durations: '1y'
    test_label_timespans: ['1month']
    test_as_of_date_frequencies: '1month'

Resulting in the following temporal configuration:

inspections_baselinepng

As you can see, we will realize 12 different predictions in the test using the train model.

Should we get 12 different metric calculations? An array? Just the total one?

@ecsalomon
Copy link
Contributor

ecsalomon commented Feb 7, 2019

My feeling on this is that there should be a different set of parameters in your temporal config, test_frequency and test_interval or somesuch that determines how many and which test matrices your model is evaluated on, and the test_duration and test_example_frequency are for how many and which dates to perform a single evaluation on (whether combining all of the dates in the way currently done makes sense is, I think, debatable). When we initially wrote the test_duration and test_example_frequency keys, we were thinking of cases where test predictions are also event-based, so each date may be sparsely labeled and combining multiple dates is necessary.

I feel like there are already issues to this effect somewhere.

@ecsalomon
Copy link
Contributor

Ah, yes, I said the same thing in #378. Doesn't make me right, just consistent. :)

@ecsalomon
Copy link
Contributor

Another thought on this: We are doing evaluations the same way (making one evaluation over all dates) in both test and train. For EWS problems, presumably, this method is equally bogus in both train and test. Should there be a flag to control this behavior?

ecsalomon added a commit that referenced this issue Apr 24, 2019
This commit addresses #663, #378, #223 by allowing a model to be
evaluated multiple times and thereby allowing users to see whether
performance of single trained model degrades over the time following
training.

Users must now set a timechop parameter, `test_evaluation_frequency` that
will add multiple test matrices to a time split. A model will be tested
once on each matrix in its list. Matrices are added until they reach the
label time limit, testing all models on the final test period (assuming
that you make model_update_frequency evenly dividable by
test_evaluation_frequency).

This initial commit only makes changes to timechop proper. Remaining
work includes:

- Write tests for the new behavior
- Make timechop plotting work with new behavior

New issues that I do not plan to address in the forthcoming PR:

- Incorporate multiple evaluation times into audition and/or
  postmodeling
- Maybe users should be able to set a maximum evaluation horizon so that
  early models are not tested for, say, 100 time periods
- Evaluation time-splitting could (or should) eventually not be done with
  pre-made matrices but on-the-fly atevaluation time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants