Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-validation and model selection #5

Open
3 tasks
gully opened this issue Jun 13, 2017 · 2 comments
Open
3 tasks

Cross-validation and model selection #5

gully opened this issue Jun 13, 2017 · 2 comments

Comments

@gully
Copy link
Member

gully commented Jun 13, 2017

We'll need:

  • A function that computes the cross-validation score for subsamples
  • A function that iterates the cross-validation for varying model complexity
  • A function that ties this all together for multiple sources.
@gully
Copy link
Member Author

gully commented Jun 13, 2017

Our ultimate goal is to have a dictionary of dictionaries for each source that contains entries for:

  • Top 5 periods as determined from multiterm LombScargle
  • Lomb Scargle Scores of those top 5 periods
  • Linear regression coefficients for underly polynomial, with length set by cross-validation
  • Linear regression coefficients for sines and cosines for each of the 5 periods
  • Number of sines and cosines (up to five) desired by cross-validation.
  • Linear regression coefficients for sines and cosines for cross-validated subset of terms (non-orthogonal!)

This dictionary is a dimensionality-reduced representation of the data in the interval.

After we do have this, we can do lots of fun things-- go back and inspect the residual spectrum-- what is the actual noise distribution? How many outliers (cosmic rays, flares) are there, and where are they? We could then go back and re-do everything with a refined noise model, masked cosmic-rays, and maybe non-linear regression methods.

@gully
Copy link
Member Author

gully commented Jun 13, 2017

Note that it's a little awkward that we're bungling multiterm Lomb Scargle and top N periods. Strictly speaking, those top N periods arise from assumptions of an underlying Fourier series, so we should actually have N_top_periods x N_Fourier_terms = 5 * 4 = 20 (times 2 = 40 for sines and cosines!) linearly-regressed coefficients in our model. However, that's not the right thing to do, since many of the top_N_periods are actually aliases of the main period, by design. So what we're doing is some weird approximation of strictly Fourier methods. Our strategy has the drawback of being non-orthogonal, but has the (potential, unproven) benefit of picking up real physics that has multiple periods (e.g. differential rotation? multiple stars? weird physics?). Let's try it anyways...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant