Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix base plot "checking for variable name" #75

Merged
merged 2 commits into from
Feb 22, 2022

Conversation

mnlang
Copy link
Contributor

@mnlang mnlang commented Feb 21, 2022

Alex (@alexpghayes),

this contains a minor PR for fixing the warnings in the base plot generic encountered in PR #74 by @zeileis. The plot function currently checks if the variable name is a single upper case letter.

This only works for existing distribution objects:

R> N <- Normal(1)
R> plot(N)
R> plot(Normal(1))
Warning message:
In if (nchar(variable_name) == 1 & toupper(variable_name) == variable_name) { :
  the condition has length > 1 and only the first element will be used

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

  • In a first step, it would be reasonable to unify plot() and autoplot() with the arguments type = c("pdf", "cdf")).
  • In a further step it might be even possible to provide an interface to the package ggdist?

@zeileis
Copy link
Collaborator

zeileis commented Feb 21, 2022

Thanks, Moritz. (As I had already indicated in the Poisson PR I was about to provide a fix for that in a separate PR.)

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

Regarding the type = c("pdf", "cdf") argument you suggest in the comment above: I would recommend to use a different argument name for that. It would be nice to export the type = NULL argument and then switch to type = "p" etc. as appropriate - but then the user can also set, for example, type = "o" or type = "h". This is what I would have liked to do in the poisson vignette.

@zeileis
Copy link
Collaborator

zeileis commented Feb 21, 2022

Separate but related question (possibly better for a separate issue): The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible. Should this be modularized in some way? There is already the support() extractor but it only returns the minimum and maximum for a given distribution. Maybe adding some sort of extractor for the "minimum increment" would be helpful? This would be 0 for continuous distributions and 1 for most discrete distributions.

@alexpghayes
Copy link
Owner

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

In a further step it might be even possible to provide an interface to the package ggdist?

I'm intrigued, what are you thinking here?

The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible.

I'm open to changing this; basically the current way we represent a distribution is a quickly thrown together hack. I started thinking about a more careful class system a while back but didn't really have the capacity to push it anywhere. As a start it seems like subclasses for discrete and for continuous distributions would be an easy win, and I would welcome a PR to make those changes.

@mnlang
Copy link
Contributor Author

mnlang commented Feb 21, 2022

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

My bad. I have adjusted the code accordingly!

@mnlang
Copy link
Contributor Author

mnlang commented Feb 21, 2022

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case.
Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

@rmtrane
Copy link
Collaborator

rmtrane commented Feb 21, 2022

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case. Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

When I first wrote the plot_cdf and plot_pdf functions, I simply needed something quick and reusable for plotting CDFs and PDFs for an intro stats class I taught. Coincidentally, @paulnorthrop wrote the plot and autoplot function at about the same time, and we decided it was easier for the time being to simply include both. It was mainly due to a lack of time to find out what would be the best approach.

My original idea was to create geom_pdf and geom_cdf to make it easy to both plot PDFs and CDFs, but also to overlay distributions onto histograms. My goal originally was to get to a point where I could do something like

# A normal curve
ggplot() + geom_pdf(d = Normal()) 

# Normal curve on top of histogram. Might be super tricky cause you need to be careful about
# bin sizes and normalize, but never got to a point where that was the issue...
ggplot(data = students, aes(x = height)) + geom_histogram(...) + geom_pdf(d = Normal(..., ...))

Unfortunately, I didn't have time to dive deep enough into how to write geom_* and stat_* functions -- as you can tell from my code, I tried, but didn't quite get there. I still think this would be the most elegant solution, but in my (limited) experience, extending ggplot2 is hard...

As @alexpghayes, I would prefer to keep the plot_cdf() and plot_pdf(). Personally, I wouldn't mind making ggplot2 a dependency for this and getting rid of plot and autoplot, but I understand if you want to keep a base option around.

Anyway, just wanted to chime in. If anyone wants to work on an implementation that functions as a ggplot2 extension, I think that would be awesome, and I'd happily help.

@zeileis
Copy link
Collaborator

zeileis commented Feb 21, 2022

Thanks everybody for the discussion so far. I will try to structure my feedback on the different aspects a bit.

Discrete vs. continuous

  • As the nature of the variable is already implied by the class name itself, I wouldn't do anything else to store that information in the object itself, but put it into a dedicated method like is_discrete() or minincrement() or something along those lines.
  • My feeling is that separate classes for this would be overkill. I would typically use a different class if the structure of the object changes, e.g., some list elements don't exist or have different names, etc. But this is not the case here.
  • Finally, I think it will be hard to come up with a structure that generally covers all possible forms of support that are conceivable (this could be very complex). But most of the cases we need in practice are either continuous in an interval or discrete with steps of 1. So I would aim at making the computations for these important special cases easy. Everything else could be handled by writing a dedicated plot.class method that might call the general plot.distribution function internally after setting up some variables (like ranges, breaks, etc.).

Functions vs. methods

  • I don't like the split into plot_pdf() and plot_cdf() because this is not very "DRY" (don't repeat yourself). I think there is more what the PDF and CDF visualizations have in common than what differs, so I would keep them in a single function.
  • I'm not sure if the same reasoning would apply to the base graphics vs. ggplot2 visualizations. I think many aspects between these are quite different so my feeling is that they could be kept separate.
  • I agree with Moritz that plot() and autoplot() are nice and intuitively distinguish between the plotting systems. (And even if plot_*df() functions are kept around I would recommend building the methods on top of them, just for convenience.)

ggplot2 dependency

  • Given that visualization is not the main focus of the package I think it is worth trying not to have ggplot2 as a strong dependency but just as a Suggests for example. In other packages I have done so by registering the autoplot() methods only conditionally on ggplot2 being loaded, adding a stopifnot(requireNamespace("ggplot2")) in the function, and only using fully-qualified ggplot:: function names internally.
  • But Moritz might have thought about this some more than I have and has also looked into other infrastructure for geoms etc.

@alexpghayes
Copy link
Owner

Merging the fix in for now, let's continue discussion about plotting code refactor at #77.

@alexpghayes alexpghayes merged commit ef02528 into alexpghayes:main Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants