Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

SteveOv · 2024-07-31T19:39:03Z

Following on from investigation in #68

I suspect this will involve generating the csv and tfrecord files together, so that the csv can be kept in synch if we have to flip the components to match a revised mag feature.

Will also need to revise training set distributions as initial investigations show that this change will lead to a significantly wider range on k.

Some noddy, samply code from initial investigation; snippet-roll-LC-to-switch.txt](https://github.com/user-attachments/files/16446430/snippet-roll-LC-to-switch.txt)

SteveOv · 2024-08-01T21:27:45Z

First job is to refactor the code for generating the trainsets & datasets. Currently, this is split into two distinct steps; generating the trainset CSV files then using the contents of the CSVs to generate the mags/LCs and write the corresponding dataset files. We'll need to rework the code (mainly what's in datasets) to do both concurrently. This will make it possible to amend an instance's params based on the results of generating its LC model.

…#79) Big change Previously creating a dataset was a two step process; 1. generate full set of instances with params for each and save to CSV file(s) 2. make the corresponding tfrecord(s) from the CSVs, using the params to create the model LC upon which the mags feature is based. This is also where train/val/test split happens. With this change, both the csvs and tfrecords are written concurrently. The benefit being that we now assemble all of the data for each instance (including mags feature) before saving to both csv and tfrecord, meaning we can add logic which may discard or modify the instance based on the outcome of generating the mags feature. Previously, membership and labels were decided at the CSV stage and these could not be updated when generating the mags data. Because of changes (fixes) to the random generators/seeds datasets generated from here will not have same members as previously. However, consistency is now improved going forward. Fringe benefit; the new code uses less RAM and is slightly faster than the previous approach.

The instance generator functions no longer yield from a for loop based on an instance_count argument. Instead they will yield indefinitely, with datasets.make_dataset_file() closing the generator once it has the required number of usable instance. This will allow make_dataset_file() to discard instances which cause an error or fail to meet some future criterion.

Since the refactoring these are no longer true.

Previously the plot_trainset_histogram() function was hard coded to look for trainset*.csv files. Its replacement now takes an iterable of files, so it's up to the caller to glob the appropriate files (which can now be called whatever is wanted). Took the opportunity to further revise this to return a fig as the other plot functions in this module do.

The check for usable instances is now handled as a callback from make_dataset_file() to a function in the make_* modules. Previously this effectively the other way round; a function in datasets callled by the modules. This change will allow the criteria to be specific to the dataset and for the check to be used within the logic of the controlling make_dataset_file() function.

…s to #79) Added logic which will update the params dict to swap the primary/secondary stars over and will roll the secondary eclipse to phase zero if the original secondary eclipse is found to be deeper than the primary. Controlling flag defaults to False so that existing behaviour is unchanged.

SteveOv · 2024-08-12T10:47:26Z

This is now done. Applying this to the full synthetic-mis-tess-dataset test dataset leads to about a quarter of the instances having their components swapped over.

We we also find is that the range for K & J are greatly extended; for extreme systems with a large difference in the component masses. This can be handled by applying maximum criteria to k, J and qphot params (applied after the any swap has been made). Alternative, we could implement a restriction on the difference in the component masses when the instances are originally generated (before any decision to swap is made).

On testing with these "swap" datasets we see a consistent phenomenon where we see a "bar" of very poor predictions where k is predicted to be near zero. Sometimes horizontal, as in the image below, or with other training sets both horizontal and vertical (where the label for k is near 1 but the predictions extend upwards). These extend most of the way to k~1.0 so are not just instances with k outside the trained range. The majority of affected predictions are for transiting systems.

All instances	Transiting instances

This needs investigating and resolving.

SteveOv · 2024-08-12T11:02:40Z

Closing this issue and raised #80 to cover the investigation of the above.

SteveOv added the enhancement New feature or request label Jul 31, 2024

SteveOv self-assigned this Jul 31, 2024

SteveOv added a commit that referenced this issue Aug 2, 2024

removed old comments about the 2-stage process (relates to #79)

cf59042

Since the refactoring these are no longer true.

SteveOv added a commit that referenced this issue Aug 2, 2024

revert dataset_dir changes committed in error (relates to #79)

628a40d

SteveOv added a commit that referenced this issue Aug 5, 2024

improved message for make_dataset() (relates to #79)

a7e74bd

SteveOv added a commit that referenced this issue Aug 7, 2024

added a timestamp to output of make_dataset() (relates to #79)

4ee9500

SteveOv closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

SteveOv commented Jul 31, 2024

SteveOv commented Aug 1, 2024

SteveOv commented Aug 12, 2024 •

edited

Loading

SteveOv commented Aug 12, 2024

Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

Rework trainset/dataset code so that it can flip the components/roll mags if secondary is deeper #79

Comments

SteveOv commented Jul 31, 2024

SteveOv commented Aug 1, 2024

SteveOv commented Aug 12, 2024 • edited Loading

SteveOv commented Aug 12, 2024

SteveOv commented Aug 12, 2024 •

edited

Loading