Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

Closed
SteveOv opened this issue Jul 31, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@SteveOv
Copy link
Owner

SteveOv commented Jul 31, 2024

Following on from investigation in #68

I suspect this will involve generating the csv and tfrecord files together, so that the csv can be kept in synch if we have to flip the components to match a revised mag feature.

Will also need to revise training set distributions as initial investigations show that this change will lead to a significantly wider range on k.

Some noddy, samply code from initial investigation; snippet-roll-LC-to-switch.txt](https://github.com/user-attachments/files/16446430/snippet-roll-LC-to-switch.txt)

@SteveOv SteveOv added the enhancement New feature or request label Jul 31, 2024
@SteveOv SteveOv self-assigned this Jul 31, 2024
@SteveOv
Copy link
Owner Author

SteveOv commented Aug 1, 2024

First job is to refactor the code for generating the trainsets & datasets. Currently, this is split into two distinct steps; generating the trainset CSV files then using the contents of the CSVs to generate the mags/LCs and write the corresponding dataset files. We'll need to rework the code (mainly what's in datasets) to do both concurrently. This will make it possible to amend an instance's params based on the results of generating its LC model.

SteveOv added a commit that referenced this issue Aug 2, 2024
…#79)

Big change
Previously creating a dataset was a two step process;
1. generate full set of instances with params for each and save to CSV file(s)
2. make the corresponding tfrecord(s) from the CSVs, using the params to create the model LC upon which the mags feature is based.  This is also where train/val/test split happens.

With this change, both the csvs and tfrecords are written concurrently. The benefit being that we now assemble all of the data for each instance (including mags feature) before saving to both csv and tfrecord, meaning we can add logic which may discard or modify the instance based on the outcome of generating the mags feature. Previously, membership and labels were decided at the CSV stage and these could not be updated when generating the mags data.

Because of changes (fixes) to the random generators/seeds datasets generated from here will not have same members as previously. However, consistency is now improved going forward.

Fringe benefit; the new code uses less RAM and is slightly faster than the previous approach.
SteveOv added a commit that referenced this issue Aug 2, 2024
The instance generator functions no longer yield from a for loop based on an instance_count argument. Instead they will yield indefinitely, with datasets.make_dataset_file() closing the generator once it has the required number of usable instance. This will allow make_dataset_file() to discard instances which cause an error or fail to meet some future criterion.
SteveOv added a commit that referenced this issue Aug 2, 2024
Since the refactoring these are no longer true.
SteveOv added a commit that referenced this issue Aug 2, 2024
Previously the plot_trainset_histogram() function was hard coded to look for trainset*.csv files. Its replacement now takes an iterable of files, so it's up to the caller to glob the appropriate files (which can now be called whatever is wanted).

Took the opportunity to further revise this to return a fig as the other plot functions in this module do.
SteveOv added a commit that referenced this issue Aug 5, 2024
The check for usable instances is now handled as a callback from make_dataset_file() to a function in the make_* modules. Previously this effectively the other way round; a function in datasets callled by the modules.

This change will allow the criteria to be specific to the dataset and for the check to be used within the logic of the controlling make_dataset_file() function.
SteveOv added a commit that referenced this issue Aug 6, 2024
…s to #79)

Added logic which will update the params dict to swap the primary/secondary stars over and will roll the secondary eclipse to phase zero if the original secondary eclipse is found to be deeper than the primary.

Controlling flag defaults to False so that existing behaviour is unchanged.
@SteveOv
Copy link
Owner Author

SteveOv commented Aug 12, 2024

This is now done. Applying this to the full synthetic-mis-tess-dataset test dataset leads to about a quarter of the instances having their components swapped over.

We we also find is that the range for K & J are greatly extended; for extreme systems with a large difference in the component masses. This can be handled by applying maximum criteria to k, J and qphot params (applied after the any swap has been made). Alternative, we could implement a restriction on the difference in the component masses when the instances are originally generated (before any decision to swap is made).

On testing with these "swap" datasets we see a consistent phenomenon where we see a "bar" of very poor predictions where k is predicted to be near zero. Sometimes horizontal, as in the image below, or with other training sets both horizontal and vertical (where the label for k is near 1 but the predictions extend upwards). These extend most of the way to k~1.0 so are not just instances with k outside the trained range. The majority of affected predictions are for transiting systems.

All instances Transiting instances
predictions-nonmc-vs-labels-synthetic-mist-tess-dataset-swap10 predictions-nonmc-vs-labels-synthetic-mist-tess-dataset-swap10-transiting

This needs investigating and resolving.

@SteveOv
Copy link
Owner Author

SteveOv commented Aug 12, 2024

Closing this issue and raised #80 to cover the investigation of the above.

@SteveOv SteveOv closed this as completed Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant