-
Notifications
You must be signed in to change notification settings - Fork 449
short course
Analyze Survey Data for Free (http://asdfree.com/)
Public Microdata From An Easy to Type Website
List by presentation order. Email and office phone and fax numbers are to be included. It is essential that the Education Department at ASA is notified of any changes that occur between the time of submission and the time of presentation.
Anthony Damico
Provide an abstract not to exceed 200 words of the proposed course including the prerequisite for the anticipated audience. If the course is selected, this abstract will be used for advertising purposes in the registration material and on the JSM web site. Prerequisite knowledge or assumptions regarding the background of the attendees must be included in the abstract. If the abstract is more than 200 words, it will be edited by ASA.
Governments, NGOs, and other research institutes spend billions of dollars each year collecting demographic, economic, and health information about their populations. These efforts form the basis of many official reports, academic journal articles, and public health surveillance systems, each of which motivate public policy or inform the public to varying degrees. Though dependent on the sensitivity of the topic, these sponsoring organizations often publish household-level, person-level, or company-level datasets alongside their final, summary report. This response-level data (commonly known as microdata) allows external researchers both to reproduce the original findings and also to more deeply focus on segments of the population perhaps not discussed in the data products released by the authors of the original investigation. For example, the Census Bureau publishes an annual report, "Income and Poverty in the United States" with a series of tables, and also a database with one record per individual within each sampled households. While the Bureau helpfully provides many different cross-tabulations of their results, an external researcher might find utility in this dataset by investigating other groups (such as different age cutoffs or dollar thresholds), and so the public microdata files allow continued research where it otherwise might end. The website http://asdfree.com/ offers obsessively-detailed instructions to analyze a wide variety of publicly-available datasets using the R language. This resource generally contains three core components, each with step-by-step instructions: (1) Download automation or data acquisition; (2) Helpfully-noted analysis examples; (3) Replication of published estimates to prove correct methodology.
Provide a detailed outline of the entire program. Describe what will occur during each segment. DO NOT INCLUDE chapters of an upcoming book. Provide a description of the target audience.
Researchers interested in conducting original research with the extremely rich and varied amount of public data available. This course could be of interest to anyone hoping to learn more about quantitative research, economics, public policy, demography, or any other field reliant on social statistics to better understand individuals and businesses.
Both beginner or advanced R users are welcome, however some understanding of R syntax will be helpful depending on the complexity of the microdata chosen. The instructor will attempt to guide participants toward datasets appropriate for their coding skill level.
-
The NHANES mobile examination center performs in-person dental examinations and blood labs on a representative sample of the country but not a simple random sample of the country
-
In-person interviewers administer the Consumer Expenditure Survey and the American Time Use Survey by instructing respondents how to record every expenditure into a ledger, every ten minutes into a journal, respectively. Both of these result in representative samples, but neither are simple random samples.
-
The American Housing Survey visits each selected housing unit, collecting information with Computer-Assisted Personal Interviewing on both occupied and unoccupied housing units. Again, this allows for a dataset that generalizes to the country without being a simple random sample.
Mobile Examination Center: https://blogs.cdc.gov/nchs/2013/04/17/164/
over-sampling of the wealthy: https://web.archive.org/web/20240620102045/https://www.bis.org/ifc/publ/ifcb28zzn.pdf
Fundamentally, a complex sample survey aims to save money on the transportation costs of its interviewers by sampling geographies first and then people (or businesses or structures) within the geographies. So instead of sampling individuals nationwide, a survey administrator samples twenty towns and cities across the country, and then within those geographic areas, again samples multiple individuals. Nationwide, everyone still has the same probability of being sampled, but once the first stage of sampling occurs - when geographies are sampled - then suddenly the residents of those sampled geographies have a much higher probability of inclusion and everyone else's inclusion probability goes to zero. But now, instead of sending an in-person survey team to ten thousand different interviews across the country, they'll only need to travel to twenty. Suddenly, the survey interviewer transportation budget looks much nicer.
?svymean
example code to show confint() incorrectly decreases with removal of clustering variable
Course participants will discuss any publication history or experience using any publicly available dataset, and what research questions they have answered (or would like to answer) with any publicly available dataset. (instructor will take notes for post-break discussion)
PDF pages 5 thru 8: https://academic.oup.com/jssam/article/11/4/743/7136601?login=false
(The participants might collectively agree on which dataset they have most familiarity with or interest in.)
Each dataset presented on asdfree includes three major components: 1. Download automation or data acquisition; 2. Helpfully-noted analysis examples; 3. Replication of published estimates to prove correct methodology. We will walk through each of these segments for one dataset, with participants testing out the same R code on their local laptops. Given the high similarity of the structure of each dataset, participants will ideally quickly understand that once they are able to get started using any of these entries, it's quite simple to apply the same knowledge to all of these entries.
(instructor will use notes from pre-break discussion)
As an example, if a participant mentions interest in health insurance coverage in the United States, we might review the strengths and weaknesses of different surveys on the topic. SIPP interviews individuals every year for multiple years, and asks about every single month of coverage. CPS interviews individuals with the full ASEC one time, asking for health insurance at the point of interview and also monthly through the prior year. CPS also asks many labor force questions, and is representative at the state-level. NHIS asks about health insurance only at the single interview, but also asks many health status and health behavior questions. BRFSS only asks a single question about health insurance, but it has a large sample size, even at the state-level.
Figure #1: https://www.kff.org/medicare/issue-brief/retiree-health-benefits-going-going-nearly-gone
Selecting any dataset from the list of available datasets on asdfree.com, participants will follow a single entry from start to finish.
(a) Learning outcomes (performance objectives): The proposal must include a clear and concise statement of intended learning outcomes for the course. Learning outcomes are statements that identify what knowledge, skills and/or attitudes attendees are expected to accomplish/demonstrate as a result of the course. The attainment of the stated learning outcomes will be assessed as part of the CE Course evaluation process at the conclusion of the course so it is imperative that the presenter teach to these objectives.
Participants will ideally complete this course with the ability to explain why governments and other organizations fund complex sample surveys rather than drawing simple random samples. Participants will also feel confident reproducing the published statistics directly using publicly available microdata. And given the high similarity of the structure of each dataset on http://asdfree.com/, participants will ideally quickly understand that once they are able to get started using any of these entries, it's quite simple to apply the same knowledge to all of these entries.
(b) Content and instructional methods: The presenter must include a description of course content and instructional strategies based on the learning outcomes (performance objectives).
This hands-on course will begin with a powerpoint-free discussion of the motivations behind survey methodology, followed by a mix of discussions of the wide breadth available public data and also time to test out this three-step syntax while having its author present and available for questions.
Paragraph highlighting instructor’s background and experience with subject. DO NOT include resumes and/or curriculum vitae.
Anthony Damico is an Independent Consultant who conducts data analysis for health care policy research. He has published in peer-reviewed policy and methods journals using the R, SAS, Stata, and SUDAAN statistical programming languages. Prior to becoming an independent consultant, he was with the Kaiser Family Foundation in Washington, D.C. Anthony holds a Bachelor’s degree in Mathematics from Oberlin College and a Masters in Health Policy from Johns Hopkins University.
Each presentation will be provided with one screen, one data projector and one lavaliere microphone. A flip chart and second screen are available at no extra charge upon request. Presenters desiring additional AV equipment are responsible for additional equipment expense. Details are available upon request.
Presenter will just need a projector and wifi for laptop
Participants will need a laptop with wifi and R installed