-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Survey data into ggplot2 #6
Comments
Hey Thomas, Do you have an example visual you would like to shoot for? |
The problem is primarily how to specify the design object as a data argument, but there's also the secondary issue that weights and clusters should ideally be provided to geoms. Consider these two from ?stat_smooth
and
A survey design object is not a data frame and doesn't inherit from data.frame: it contains a data frame and also contains a bunch of metadata including weights and cluster identifiers. In a perfect world, I'd be able to do
to get a weighted scatterplot of DBP against age (weighted using alpha channel or using hexbins or using thinning, all as in survey::svyplot), with a weighted smoother. I could get the data into ggplot() by supplying the original data frame, but that will only work if no transformations have been done on the survey object (including to construct the metadata). Also, the survey package functions for smoothing, fitting lines, and so on expect their data as survey design object, not a data frame. The simplest solution may be a set of adapters that convert to data frames going into ggplot and to design objects coming out? |
This is an interesting and exciting idea. I think there are at least a couple of potential approaches:
I'm keen to hear other takes on this too. |
There's a further option which is inheriting from |
Hello, I often work with survey data, with the svyby_barplot <- function(x, by, design){
# Loads required packages if not already loaded
library(labelled)
library(reshape2)
library(ggplot2)
library(survey)
# Extracts variable names to insert them as legend in the plot
var_x <- paste(deparse(substitute(design)), "$variables$", x[2], sep = "")
lab_x <- var_label(eval(parse(text = var_x)))
lab_leg <- paste(deparse(substitute(design)), "$variables$", by[2], sep = "")
lab_leg <- var_label(eval(parse(text = lab_leg)))
# Uses the survey package to compute estimates and confidence intervals
byout <- svyby(x, by, design, FUN=svymean, na.rm=TRUE, vartype="ci")
## Fixing an apparent bug of survey package that pastes variable name and modality name in the svyby output
# Counting the number of character to delete
y <- as.numeric(nchar(x)[2])
#prepares a table with the right format for ggplot2
byout[,-1] <- byout[,-1]*100
col_est <- (ncol(byout)-1)/3
estimates <- melt(byout[,1:col_est+1])
low_ci <- melt(byout[,(col_est+1):(col_est*2)+1])
high_ci <- melt(byout[,(col_est*2+1):(col_est*3)+1])
out <- cbind(estimates, low_ci[,2], high_ci[,2])
out$by <- byout[,1]
colnames(out) <- c("cat", "est", "lci", "hci", "by")
# if y, take out the variable name
if(y>0){
out[,1] <- substring(out[,1], y+1)
}
# si, factorisée, on réintègre les facteurs de la variable d'origine
# pour afficher les résultats dans le même ordre
if (!is.null(levels(eval(parse(text = var_x))))) {
out$cat <- factor(out$cat, levels(eval(parse(text = var_x))))
}
# Produces the graph
graph <- ggplot(data=out, aes(x=cat, y=est, fill=by))
graph + theme(panel.border = element_blank(),
panel.grid.major = element_blank()) +
geom_bar(stat="identity", position=position_dodge()) +
geom_errorbar(aes(ymin=lci, ymax=hci), width=.1, position=position_dodge(0.9) ) +
geom_point(position=position_dodge(0.9)) +
scale_y_continuous(limits = c(-0.5,100)) +
theme (axis.text.x = element_text(angle=90, vjust=0.5)) +
scale_fill_discrete(name=lab_leg) +
# labs(title = svy_lab) +
xlab(lab_x) +
ylab("%")
} The DHS program (one of the main technical assistance programs to statistical offices in developing countries) provides example datasets of health survey data at this page: http://dhsprogram.com/data/Download-Model-Datasets.cfm library(haven)
library(survey)
dhs_demo <- as_factor(read_dta("ZZHR62FL.DTA"))
dhs_design <- svydesign(ids=~hv021+hv002, strata=~hv023, weights=~hv005, data=dhs_demo)
svyby_barplot(~hv201, by=~hv025, design=dhs_design) It produces the following graph: I use similar functions to produce maps or more readable tables to print with knitr/RMarkdown. I'm not sure that it helps, but in any case... |
seems useful. Thank you! |
Either for its own sake or as an example of how to do it, develop add-on to ggplot2 to allow weighted survey data.
Data from complex multistage surveys comes with weights and clusters that are often relevant for display, whether directly (opacity, hexbinning) or in stat() and stat_smooth() aesthetics. I don't know from ggplot2, so it would be good to work with people who do.
The text was updated successfully, but these errors were encountered: