Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare DNN and XGB results #151

Open
Tracked by #54
bfhealy opened this issue Nov 3, 2022 · 11 comments
Open
Tracked by #54

Compare DNN and XGB results #151

bfhealy opened this issue Nov 3, 2022 · 11 comments
Assignees

Comments

@bfhealy
Copy link
Collaborator

bfhealy commented Nov 3, 2022

Once DNN training is complete, we should begin our inference on the same fields as analyzed by the XGB algorithm to compare the results from each. The comparison may identify areas of improvement for one or both algorithms. We can then expand inference to all fields.

@bfhealy bfhealy mentioned this issue Nov 3, 2022
48 tasks
@bfhealy
Copy link
Collaborator Author

bfhealy commented Nov 30, 2022

@markkennedy3 Once all 20 fields of DNN inference are available, could you please remake your multi-panel comparison plot from a few months ago?

@mkenne15
Copy link

Will do. Do you want me to include the notebooks for producing them in scope? Maybe in the tools dir?

@bfhealy
Copy link
Collaborator Author

bfhealy commented Nov 30, 2022

That sounds like a good idea, thanks!

@mkenne15
Copy link

mkenne15 commented Dec 6, 2022

Are the DR5 predictions for the XGB model the ones we'd like to compare the current DNN results to? If so, I've recreated the plot I sent to you a few months ago, but replaced the DNN results with those from the 20 fields predictions (attached). I've had to do a bit of massaging as the msms column in the XGB results is now called emsms in the DNN table.

XGB_vs_DNN_Distributions

@bfhealy
Copy link
Collaborator Author

bfhealy commented Dec 6, 2022

Thanks Mark, this is a very helpful visualization! I think the existing DR5 XGB predictions are the ones we should compare to the new DNN results.

@AshishMahabal
Copy link
Collaborator

Thanks, Mark. Can you clarify which objects are shown in the plots? For instance, in the VNV plot the XGB column is close to one while the DNN column is close to 0. Also, the binwidths for XGB and DNN seem to be diferent.

@mkenne15
Copy link

mkenne15 commented Dec 7, 2022

Hey Ashish. Each figure plots the histogram of probabilities that either the XGB or DNN has assigned to all objects that each classifier has seen, for all of the different classifications. The binwidths are indeed different, at the moment I'm just letting numpy automatically determine the appropriate binwidths for each class.

Also, as you highlighted, there are some peculiar behaviours occurring here. As you higlighted, vnv is very different between the XGB and DNN classifiers (for the XGB just about everything is classified as a VNV, while nothing is when using the DNN). I think the distribution for the DNN makes sense to me, as these results are after running the DNN on 20 fields, and we'd expect the majority of objects to be non-variable assuming we were classifying absolutely every object in those 20 fields (and not just selecting variable objects and classifying them - Brian maybe you could clarify?). I need to think a bit more about the XGB distribution, but other panels suggest the XGB is struggling anyway (consider YSO's, where it thinks every object should have a minimum 33% change of being a YSO!)

@AshishMahabal
Copy link
Collaborator

Density plots with variable binwidths can be an issue. Can you produce pure histograms for one or two (vnv and yso would be great since you named those). So, numeric y-axis rather than density. In fact it could be worth plotting XGB and DNN in two separate panels so that a side-by-side comparison can be done.

@mkenne15
Copy link

mkenne15 commented Dec 7, 2022

Hey Ashish. Do you mean something like this? (I haven't done side-by-side, but rather just set up separate y-axes to allow for easier comparisons.

VNV_probabilities

@mkenne15
Copy link

mkenne15 commented Dec 7, 2022

As a quick follow up, I've recreated the plots, but filtered the DNN results to remove any object where P(VNV)<0.9, such that we can better see when an object is being classified as variable, what the probabilities of the it belonging to one of the other classes are coming out as. (This probably makes more sense to look at than the previous large panel plot, as the spikes at 0 for each class when looking at the DNN results are due to non-variable sources).

XGB_vs_DNN_Distributions_vnv

@bfhealy
Copy link
Collaborator Author

bfhealy commented Dec 7, 2022

Thanks Mark - I can confirm that the DNN 20-field classifications are performed on all sources in each field, not just those previously identified as variable. Thus it could make sense for many of the results to be labeled as non-variable, especially if the training set for the vnv classifier has any bias toward the brighter sources in a field:

While the mean Gaia G mag of variable sources in the current training set is ~16.9, my examination of a few fields finds a mean G mag ~18.3. The fainter variable sources in these fields may not be picked up by the vnv classifier due to added noise in their light curves.

Related, a challenge with AGN/YSO is their tendency to display stochastic variability that can be confused with noise for faint sources. It is possible that many faint variables that are not intrinsically AGN/YSO are being classified as such because the training doesn't establish a clear enough distinction between faint variables and true AGN/YSO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants