-
Notifications
You must be signed in to change notification settings - Fork 5
Testing sampling weights on lsasim
Thank you for your help in testing lsasim, an R package for simulating Large Scale Assessment (LSA) data.
This document is intended to aid the development of the next stable version of lsasim (possibly numbered 2.1.0). The current stable version of lsasim (2.0.0) is available on CRAN and on GitHub.
I invite you to read the next subsections of this introduction—available on the other tabs up there—even if you’re familiar with lsasim. The final subsection (How to contribute to testing) is especially useful in showing you how to give feedback on your tests.
This document should be a standalone guide to working with the
cluster_gen
function of lsasim
for testing purposes. However, it is
still a work in progress and feedback on missing or incorrect
information is welcome. In addition, the help file of cluster_gen
may
be useful in understanding how the function works. You can access the
function documentation by running help("cluster_gen", "lsasim")
or
?lsasim::cluster_gen
in the R terminal.
The table below contains information about the three stable releases of lsasim. The innovations under testing in this document will be part of the next stable release.
Version | CRAN release date | Innovations |
---|---|---|
1.0.0 | 2017-02-23 | Simulates cognitive and background test data |
1.0.1 | 2017-05-10 | Bug fixes |
2.0.0 | 2019-09-12 | Expanded functionality of background questionnaires |
We appreciate any help in the development of lsasim. In order to make the best of everyone’s time, though, it is desirable that the tester has:
- Access to R version 3.6.0 or newer
- Permission to install R packages in their working computer
- Knowledge of sampling weights, especially:
- How to calculate sampling weights
- How those weights are usually calculated in LSAs
- How such data is usually displayed to analysts of LSA datasets
In order to keep things organized (and make sure your contribution gets officially recorded), bugs should ideally be reported to https://github.com/tmatta/lsasim/issues/. This requires you to have a (free) GitHub account. If you have found several examples of the same issue, please report them as one issue.
As an alternative to using our GitHub issues tracker, you can send an e-mail to the package maintainer.
- Replicate weights
- Within and between group correlation
The development version of lsasim can be downloaded from GitHub by issuing the following command on your R console.
First, install the remotes package. You can skip this step if remotes is
already installed on your machine. If you don’t know if remotes is
installed on your machine, try running library(remotes)
and see if
there are any errors.
install.packages("remotes")
If the installation goes well, you should see this at the bottom of the output:
## * DONE (remotes)
##
## The downloaded source packages are in
Next, we use the install_github
function to install the development
version of lsasim locally. There are actually two versions to choose
from:
- The recommended version, 2.0.0.9103 (older, but more stable and with results comparable to this document)
- The bleeding edge version (newer, but less stable and with results that will differ from this document even with equal seeds)
To install the recommended version, please run the following on your R terminal:
remotes::install_github("tmatta/lsasim", ref="v2.0.0.9103")
The bleeding edge version (> 2.0.0.9103) is available by simply
changing the ref
argument:
remotes::install_github("tmatta/lsasim", ref="develop")
Note: Installing the version from the develop branch will result in more features but results that are different from the ones shown in this document. If you would like to reproduce the results shown here, you must install version 2.0.0.9103.
After issuing install_github
, R will tell the user it is checking,
preparing, excuting and testing the installation of lsasim. The most
important output is the final message, which should read “DONE
(lsasim)”. Ir could also read something like “Skipping install of
‘lsasim’ from a github remote, the SHA1 (…) has not changed since last
install”, which means that you already have the latest version. In these
cases, you can force the installation by including force=TRUE
as an
argument to install_github
. This can be useful in cases where a new
version is available but R fails to recognize the difference between
that version and the one installed on your computer.
Finally, we load the installed lsasim package to our current R session
and check the build version (your output of packageVersion
should
match the output below (boxes containing lines beginning with
double-hashes (##
) are the expected output).
library(lsasim)
packageVersion("lsasim")
## [1] '2.0.0.9103'
Once lsasim is installed and loaded, you are ready to test it. Click the next tab to continue.
This test concerns the generation of sampling weights for background questionnaire data generated in a hierarchical structure. Each hierarchical level is composed of clusters, which can be sampled from a population using either Simple Random Sampling (SRS) or with Probabilities Proportional to Size (PPS).
Basic background questionnaire data generation is handled by the
function questionnaire_gen
, present in lsasim since its first release.
The way cluster background data generation works is through a function
called cluster_gen
, which calls questionnaire_gen
on each cluster
level.
We will start with a simple example, where 2 schools and 10 students in each school are selected. This structure is represented by the following vector:
n1 <- c(2, 10)
The structure can be checked with the function
draw_cluster_sctructure
, which creates a visual representation of the
hierarchical tree in the R console:
draw_cluster_structure(n1) # pay no mind to the "NULL" printed at the end
## school1 (10 students)
## school2 (10 students)
## NULL
It may not look like much now, but when more complex scenarios start showing up, this visual representation can really help one understand what is going on!
In order to generate clustered responses for n1
, we call the
cluster_gen
function, which is the star of this test. The first
argument of cluster_gen
is called n
and corresponds to the number of
sampled observations on each level. Two ways of calling cluster_gen
with n = n1
are cluster_gen(n = n1)
and cluster_gen(n1)
, where
omitting n =
just tells R to assume that the order of the arguments
you are passing is the same one the function expects. To see the
argument order that cluster_gen
expects, see the “Usage” section of
the ?cluster_gen
help page.
The set.seed
function we call right before cluster_gen
is there to
make sure that your data will match the output below. If that command is
dropped, the test results will change each time cluster_gen
is called.
set.seed(1234)
cluster_gen(n1)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools
## Total respondents: 20 (10 + 10)
## school1 (10 students)
## school2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (20)
## $school
## $school[[1]]
## subject q1 q2 q3 q4 q5 q6 school.weight within.school.weight final.student.weight
## 1 1 0.005006950 2 2 1 2 2 1 1 1
## 2 2 -0.037630263 1 1 1 2 1 1 1 1
## 3 3 0.723976061 2 2 1 2 1 1 1 1
## 4 4 -0.496738863 2 2 1 2 1 1 1 1
## 5 5 0.011395161 2 1 1 2 1 1 1 1
## 6 6 0.009859946 1 2 2 1 2 1 1 1
## 7 7 0.678271423 1 1 1 2 2 1 1 1
## 8 8 1.029563029 2 2 1 2 1 1 1 1
## 9 9 -1.729528504 1 2 2 2 1 1 1 1
## 10 10 -2.204348095 1 1 2 2 2 1 1 1
## uniqueID
## 1 student1_school1
## 2 student2_school1
## 3 student3_school1
## 4 student4_school1
## 5 student5_school1
## 6 student6_school1
## 7 student7_school1
## 8 student8_school1
## 9 student9_school1
## 10 student10_school1
##
## $school[[2]]
## subject q1 q2 q3 q4 q5 q6 school.weight within.school.weight final.student.weight
## 1 1 -0.242559707 1 1 1 1 2 1 1 1
## 2 2 2.187119161 1 2 2 2 2 1 1 1
## 3 3 -0.581727450 1 1 2 1 1 1 1 1
## 4 4 0.700080227 2 1 2 1 1 1 1 1
## 5 5 1.492176579 1 2 1 1 1 1 1 1
## 6 6 0.526553441 1 1 2 1 2 1 1 1
## 7 7 1.037772101 2 2 2 2 2 1 1 1
## 8 8 -1.860716351 1 1 2 1 1 1 1 1
## 9 9 -0.426574240 2 1 2 1 1 1 1 1
## 10 10 -0.001137045 1 1 2 1 1 1 1 1
## uniqueID
## 1 student1_school2
## 2 student2_school2
## 3 student3_school2
## 4 student4_school2
## 5 student5_school2
## 6 student6_school2
## 7 student7_school2
## 8 student8_school2
## 9 student9_school2
## 10 student10_school2
Notice how cluster_gen
prints the cluster strucute as well as other
important information before showing the background data itself. This
can be disabled by inserting verbose = FALSE
into the cluster_gen
call.
By default, cluster_gen
will determine the number of continuous (X)
and categorical (W) background questions. In this case,
X = {X1} (represented in the output by q1
) and
W = {W1, …, W5} (represented in the output
by q2
through q6
). This can be customized, and for the sake of
simplicity, we will have only one categorical background variable and no
continuous variables. This time, the output will also be assigned to
data
, which is finally printed for us to see what it looks like.
set.seed(2345)
data <- cluster_gen(n1, n_X = 0, n_W = list(1))
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools
## Total respondents: 20 (10 + 10)
## school1 (10 students)
## school2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (20)
data
## $school
## $school[[1]]
## subject q1 school.weight within.school.weight final.student.weight uniqueID
## 1 1 3 1 1 1 student1_school1
## 2 2 3 1 1 1 student2_school1
## 3 3 1 1 1 1 student3_school1
## 4 4 1 1 1 1 student4_school1
## 5 5 3 1 1 1 student5_school1
## 6 6 3 1 1 1 student6_school1
## 7 7 3 1 1 1 student7_school1
## 8 8 4 1 1 1 student8_school1
## 9 9 4 1 1 1 student9_school1
## 10 10 2 1 1 1 student10_school1
##
## $school[[2]]
## subject q1 school.weight within.school.weight final.student.weight uniqueID
## 1 1 1 1 1 1 student1_school2
## 2 2 4 1 1 1 student2_school2
## 3 3 1 1 1 1 student3_school2
## 4 4 4 1 1 1 student4_school2
## 5 5 4 1 1 1 student5_school2
## 6 6 2 1 1 1 student6_school2
## 7 7 4 1 1 1 student7_school2
## 8 8 2 1 1 1 student8_school2
## 9 9 3 1 1 1 student9_school2
## 10 10 4 1 1 1 student10_school2
Notice how n_W
is defined as a list where each element—only one in
this case—corresponds to the number of variables at a particular level.
This is done so that n_W
can support more complex calls such as
n_W = list(list(2, 2), 5)
, which corresponds to telling cluster_gen
that the first level will have two binary categorical variables and the
second level will have 5 categorical variables (the number of categories
being randomly determined).
Let us now consider a second hierarchical structure, composed of a cluster of 2 schools which are divided into 3 classes each; each class contains 5 students:
n2 <- c(2, 3, 5)
set.seed(2345)
cluster_gen(n2)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)
## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.teacher.weight should add up to the number of teachers in the population (6)
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (6, counting once per class)
## $school
## $school[[1]]
## subject q1 q2 q3 school.weight within.school.weight final.teacher.weight uniqueID
## 1 1 -0.16566248 1 2 1 1 1 class1_school1
## 2 2 -0.88234450 1 1 1 1 1 class2_school1
## 3 3 -0.01332182 2 2 1 1 1 class3_school1
##
## $school[[2]]
## subject q1 q2 q3 school.weight within.school.weight final.teacher.weight uniqueID
## 1 1 0.07879383 1 1 1 1 1 class1_school2
## 2 2 -0.88209970 2 2 1 1 1 class2_school2
## 3 3 0.89263571 2 1 1 1 1 class3_school2
##
##
## $class
## $class[[1]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 2.1527725 2 1 2 1 2 1 1 1
## 2 2 0.5173488 1 2 2 1 2 1 1 1
## 3 3 -1.2601526 2 1 2 2 2 1 1 1
## 4 4 0.4095549 1 1 1 2 1 1 1 1
## 5 5 -0.3379999 2 1 2 1 1 1 1 1
## uniqueID
## 1 student1_class1_school1
## 2 student2_class1_school1
## 3 student3_class1_school1
## 4 student4_class1_school1
## 5 student5_class1_school1
##
## $class[[2]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 -2.15015256 1 1 2 2 1 1 1 1
## 2 2 1.63216373 2 1 1 1 2 1 1 1
## 3 3 0.47573673 2 2 1 2 2 1 1 1
## 4 4 -1.10436289 1 2 2 2 2 1 1 1
## 5 5 -0.05614962 2 1 1 2 1 1 1 1
## uniqueID
## 1 student1_class2_school1
## 2 student2_class2_school1
## 3 student3_class2_school1
## 4 student4_class2_school1
## 5 student5_class2_school1
##
## $class[[3]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 1.0676829 1 1 1 1 1 1 1 1
## 2 2 -1.0448467 2 1 1 2 2 1 1 1
## 3 3 0.7418229 1 1 2 1 1 1 1 1
## 4 4 -0.2396375 2 2 1 1 1 1 1 1
## 5 5 0.5653863 1 2 2 1 1 1 1 1
## uniqueID
## 1 student1_class3_school1
## 2 student2_class3_school1
## 3 student3_class3_school1
## 4 student4_class3_school1
## 5 student5_class3_school1
##
## $class[[4]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 -0.31211831 2 2 1 1 2 1 1 1
## 2 2 -1.06488440 2 2 1 2 1 1 1 1
## 3 3 0.06095831 2 1 1 1 2 1 1 1
## 4 4 0.74802298 1 2 1 1 2 1 1 1
## 5 5 2.74479129 1 1 1 1 2 1 1 1
## uniqueID
## 1 student1_class1_school2
## 2 student2_class1_school2
## 3 student3_class1_school2
## 4 student4_class1_school2
## 5 student5_class1_school2
##
## $class[[5]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 0.6141850 2 1 1 1 1 1 1 1
## 2 2 1.8841624 1 1 2 1 2 1 1 1
## 3 3 -0.2516623 2 2 2 2 1 1 1 1
## 4 4 0.7501333 2 1 2 2 2 1 1 1
## 5 5 0.4777128 2 1 1 1 2 1 1 1
## uniqueID
## 1 student1_class2_school2
## 2 student2_class2_school2
## 3 student3_class2_school2
## 4 student4_class2_school2
## 5 student5_class2_school2
##
## $class[[6]]
## subject q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1 1 -0.4050786 2 1 1 2 1 1 1 1
## 2 2 0.4307551 2 2 1 2 1 1 1 1
## 3 3 -0.3358192 2 1 2 1 2 1 1 1
## 4 4 -0.4681827 1 2 1 2 2 1 1 1
## 5 5 0.5989933 1 1 2 2 2 1 1 1
## uniqueID
## 1 student1_class3_school2
## 2 student2_class3_school2
## 3 student3_class3_school2
## 4 student4_class3_school2
## 5 student5_class3_school2
Notice how the output above contains 2 school questionnaires with 3 answers each (from the teachers who answered for the classes) as well as 2 × 3 = 6 questionnaires, each of which applied 5 students in each class. Notice how the teacher questionnaires are the same, with one X and 2 W variables, and the student questionnaires are also the same, with one X and 5 Ws. By default, the means of the continuous variables are the same (0), and the proportions of the categorical variables are randomly determined.
n1
and n2
are unnamed vectors, so cluster_gen
determined the names
of the clusters itself using a pre-built sequence. Nonetheless, the user
is free to use whatever labels they want. This can be done either by
passing names to the n
argument or by passing character vectors to the
cluster_labels
and resp_labels
arguments. See the examples below:
cluster_gen(n = c(a = 2, b = 3))
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for as, bs
## Total respondents: 6 (3 + 3)
## a1 (3 bs)
## a2 (3 bs)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the a level
## a.weight should add up to the number of as in the population (2, counting once per a)
## $a
## $a[[1]]
## subject q1 q2 q3 q4 q5 q6 q7 q8 q9 a.weight within.a.weight final.b.weight
## 1 1 3.9427498 -3.5449571 1 1 1 1 2 1 2 1 1 1
## 2 2 -0.5029523 0.3083131 1 2 1 1 2 2 2 1 1 1
## 3 3 -0.6693996 -0.6230736 1 2 1 1 1 1 1 1 1 1
## uniqueID
## 1 b1_a1
## 2 b2_a1
## 3 b3_a1
##
## $a[[2]]
## subject q1 q2 q3 q4 q5 q6 q7 q8 q9 a.weight within.a.weight final.b.weight uniqueID
## 1 1 0.2319406 1.4823528 2 2 2 1 2 1 2 1 1 1 b1_a2
## 2 2 1.9505050 0.2887602 1 1 1 2 2 1 2 1 1 1 b2_a2
## 3 3 0.4564811 0.5387870 2 1 2 2 2 1 2 1 1 1 b3_a2
cluster_gen(n = c(2, 3), cluster_labels = c("group"), resp_labels = c("person"))
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for groups
## Total respondents: 6 (3 + 3)
## group1 (3 persons)
## group2 (3 persons)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the group level
## group.weight should add up to the number of groups in the population (2, counting once per group)
## $group
## $group[[1]]
## subject q1 q2 q3 q4 group.weight within.group.weight final.person.weight uniqueID
## 1 1 -0.3544601 1 1 2 1 1 1 person1_group1
## 2 2 0.5769583 1 1 1 1 1 1 person2_group1
## 3 3 -0.6274591 1 1 2 1 1 1 person3_group1
##
## $group[[2]]
## subject q1 q2 q3 q4 group.weight within.group.weight final.person.weight uniqueID
## 1 1 0.5236884 1 2 2 1 1 1 person1_group2
## 2 2 -1.0585904 2 2 1 1 1 1 person2_group2
## 3 3 -0.0260521 1 2 2 1 1 1 person3_group2
Your data should vary from the output below (due to the lack of a fixed seed), but the labels and the hierarchical structure should be the same.
As we said before, n
corresponds to the number of sampled observations
on each level. This means that each level will have the same number of
sublevels, in what one could call a symmetric hierarchical structure.
Asymmetric structures can also be determined, and they use the following
syntax (the vector velow is named for convenience, but it may also be
nameless):
n3 <- list(sch = 3, cls = c(2, 1, 2), stu = c(5, 4, 2, 3, 2))
The list above corresponds to 3 schools, each one containing 2, 1 and 2 classes. These 5 classes respectively contain 5, 4, 2, 3 and 2 students.
As you can imagine, this sort of structure can easily become complicated
to imagine. This is when the draw_cluster_structure
function can be
helpful:
draw_cluster_structure(n3)
## sch1
## ├─sch1_cls1 (5 stus)
## └─sch1_cls2 (4 stus)
## sch2
## └─sch2_cls1 (2 stus)
## sch3
## ├─sch3_cls1 (3 stus)
## └─sch3_cls2 (2 stus)
## NULL
As an exercise, try calling cluster_gen(n3)
and see if the number of
responses corresponds to your expectations.
n
can also be passed as a range of values, randomly determined by the
function. For example, if we set
n4 <- list(school = 4, class = ranges(5, 10), student = ranges(20, 50))
Then, once we call cluster_gen
on n4
we are telling R that each of
the 4 schools have between 5 and 10 classes, and each class has between
20 and 50 students. Let us use draw_cluster_structure
to see what the
generated structure looks like
set.seed(6789)
draw_cluster_structure(n4)
## school1
## ├─school1_class1 (46 students)
## ├─school1_class2 (31 students)
## ├─school1_class3 (38 students)
## ├─school1_class4 (37 students)
## ├─school1_class5 (34 students)
## ├─school1_class6 (34 students)
## ├─school1_class7 (26 students)
## ├─school1_class8 (48 students)
## └─school1_class9 (45 students)
## school2
## ├─school2_class1 (40 students)
## ├─school2_class2 (42 students)
## ├─school2_class3 (30 students)
## ├─school2_class4 (24 students)
## ├─school2_class5 (22 students)
## └─school2_class6 (48 students)
## school3
## ├─school3_class1 (32 students)
## ├─school3_class2 (35 students)
## ├─school3_class3 (41 students)
## ├─school3_class4 (45 students)
## ├─school3_class5 (35 students)
## ├─school3_class6 (21 students)
## ├─school3_class7 (29 students)
## └─school3_class8 (26 students)
## school4
## ├─school4_class1 (42 students)
## ├─school4_class2 (22 students)
## ├─school4_class3 (48 students)
## ├─school4_class4 (37 students)
## ├─school4_class5 (27 students)
## ├─school4_class6 (41 students)
## ├─school4_class7 (42 students)
## ├─school4_class8 (40 students)
## ├─school4_class9 (37 students)
## └─school4_class10 (47 students)
## NULL
So far, we have only worked with the sampled elements, which are passed
as the first argument of cluster_gen
. By default, cluster_gen
assumes N = n
, meaning that n
actually corresponds to a census
(where all the elements of the population are selected). In practice,
though, this is rarely the case, and custer_gen
can receive other
values to indicate the population structure under the N
argument. See
the examples below:
n5 <- c(3, 4)
N5 <- 2
This is the most basic way to determine a different population size: by
passing a single number to N
. In that case, N
will be interpreted as
a multiplier of n
. In other words, the syntax above basically says
that the sample is composed of 3 schools and 4 students in each school,
whereas the population is twice as large at all levels. This is all
explicit when cluster_gen
is called (see the hierarchical structures
printed below):
data5 <- cluster_gen(n = n5, N = N5)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (8 students)
## school2 (8 students)
## school3 (8 students)
## school4 (8 students)
## school5 (8 students)
## school6 (8 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 12 (4 + 4 + 4)
## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (48)
In the example above, the questionnaire answers are stored in data5
,
which is why they do not appear in the R terminal. The user messages are
still printed, as they are not stored in data5
.
The population structure can also be explicitly defined:
n6 <- c(3, 4)
N6 <- c(4, 5)
data6 <- cluster_gen(n = n6, N = N6)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (5 students)
## school2 (5 students)
## school3 (5 students)
## school4 (5 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 12 (4 + 4 + 4)
## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (20)
Just like n
, N
can also be defined as lists:
n7 <- list(3, c(4, 2, 3))
N7 <- list(10, c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19))
data7 <- cluster_gen(n = n7, N = N7)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (10 students)
## school2 (11 students)
## school3 (12 students)
## school4 (13 students)
## school5 (14 students)
## school6 (15 students)
## school7 (16 students)
## school8 (17 students)
## school9 (18 students)
## school10 (19 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 9 (4 + 2 + 3)
## school1 (4 students)
## school2 (2 students)
## school3 (3 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (145)
Mixing ranges for n
and explicit lists for N
is also possible.
set.seed(345)
n8 <- list(3, ranges(5, 10))
N8 <- list(10, c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19))
data8 <- cluster_gen(n = n8, N = N8)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (10 students)
## school2 (11 students)
## school3 (12 students)
## school4 (13 students)
## school5 (14 students)
## school6 (15 students)
## school7 (16 students)
## school8 (17 students)
## school9 (18 students)
## school10 (19 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 25 (9 + 7 + 9)
## school1 (9 students)
## school2 (7 students)
## school3 (9 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (145)
n9 <- list(3, ranges(5, 10))
N9 <- list(10, ranges(50, 100))
data9 <- cluster_gen(n = n9, N = N9)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (79 students)
## school2 (65 students)
## school3 (59 students)
## school4 (59 students)
## school5 (64 students)
## school6 (98 students)
## school7 (82 students)
## school8 (63 students)
## school9 (97 students)
## school10 (56 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 23 (5 + 8 + 10)
## school1 (5 students)
## school2 (8 students)
## school3 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (722)
There are other structure combinations for n
and N
which give output
that could be confusing for a user. One example is when the population
is smaller than the sample. This example is illustrated below. If you
find other misbehaving or otherwise noteworthy examples, please
report.
set.seed(345)
n10 <- c(3, 4)
N10 <- c(2, 3)
cluster_gen(n = n10, N = N10) # notice the missing weights
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Population structure
## school1 (3 students)
## school2 (3 students)
## Sampled structure
## Generating questionnaires for schools
## Total respondents: 12 (4 + 4 + 4)
## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the school level
## final.student.weight should add up to the number of students in the population (6)
## $school
## $school[[1]]
## subject q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1 1 1.7863951 2 2 1 2 0.6666667 0.75 0.5
## 2 2 0.1956436 1 2 1 2 0.6666667 0.75 0.5
## 3 3 0.7482214 2 2 1 2 0.6666667 0.75 0.5
## 4 4 -0.2938856 1 2 1 1 0.6666667 0.75 0.5
## uniqueID
## 1 student1_school1
## 2 student2_school1
## 3 student3_school1
## 4 student4_school1
##
## $school[[2]]
## subject q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1 1 -0.4178153 2 2 1 2 0.6666667 0.75 0.5
## 2 2 -1.2977486 2 1 1 2 0.6666667 0.75 0.5
## 3 3 -1.0928678 2 1 1 2 0.6666667 0.75 0.5
## 4 4 -0.5008169 2 2 1 2 0.6666667 0.75 0.5
## uniqueID
## 1 student1_school2
## 2 student2_school2
## 3 student3_school2
## 4 student4_school2
##
## $school[[3]]
## subject q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1 1 -1.8998894 2 2 2 1 NA NA NA
## 2 2 0.2750867 1 2 2 1 NA NA NA
## 3 3 1.8452385 1 1 1 2 NA NA NA
## 4 4 -0.2048277 1 2 1 2 NA NA NA
## uniqueID
## 1 student1_school3
## 2 student2_school3
## 3 student3_school3
## 4 student4_school3
Understanding the commands above is all that you need to start checking
the sampling weights. However, you might be interested in knowing some
other things that cluster_gen
can already do.
For instance, the user might be interested in keeping this cluster structure, but only generating questionnaires at the student level. This can be done by running
set.seed(2345)
df <- cluster_gen(n2, separate_questionnaires = FALSE)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for students
## Total respondents: 30 (5 + 5 + 5 + 5 + 5 + 5)
## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (6, counting once per class)
Print df
and notice how the data generated is different from the
previous one even though both calls share the same seed. This is because
the former call generates student questionnaires only after the teacher
questionnaires, so the seed is effectively different when it comes to
generating student questionnaires.
Back to the case of separate questionnaires, the user may want to
collapse the questionnaires per level, so that all the questionnaires on
the same level are put together; alternatively, all the questionnaires
can be collapsed into one data frame, with answers from higher levels
being repeated at the lowest level. Perhaps this can be better
understood in the example below. The relevant argument here is
collapse
; n_X = 0
, n_W = 1
and calc_weights = FALSE
were set to
make the output shorter, thus making it easier to understand the effect
of different collapse
options.
set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "none") # default behavior
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)
## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## $school
## $school[[1]]
## subject q1 uniqueID
## 1 1 2 class1_school1
## 2 2 2 class2_school1
## 3 3 2 class3_school1
##
## $school[[2]]
## subject q1 uniqueID
## 1 1 1 class1_school2
## 2 2 2 class2_school2
## 3 3 3 class3_school2
##
##
## $class
## $class[[1]]
## subject q1 uniqueID
## 1 1 2 student1_class1_school1
## 2 2 2 student2_class1_school1
## 3 3 2 student3_class1_school1
## 4 4 2 student4_class1_school1
## 5 5 1 student5_class1_school1
##
## $class[[2]]
## subject q1 uniqueID
## 1 1 2 student1_class2_school1
## 2 2 2 student2_class2_school1
## 3 3 2 student3_class2_school1
## 4 4 2 student4_class2_school1
## 5 5 1 student5_class2_school1
##
## $class[[3]]
## subject q1 uniqueID
## 1 1 2 student1_class3_school1
## 2 2 1 student2_class3_school1
## 3 3 1 student3_class3_school1
## 4 4 1 student4_class3_school1
## 5 5 1 student5_class3_school1
##
## $class[[4]]
## subject q1 uniqueID
## 1 1 2 student1_class1_school2
## 2 2 3 student2_class1_school2
## 3 3 3 student3_class1_school2
## 4 4 1 student4_class1_school2
## 5 5 3 student5_class1_school2
##
## $class[[5]]
## subject q1 uniqueID
## 1 1 1 student1_class2_school2
## 2 2 1 student2_class2_school2
## 3 3 1 student3_class2_school2
## 4 4 1 student4_class2_school2
## 5 5 4 student5_class2_school2
##
## $class[[6]]
## subject q1 uniqueID
## 1 1 2 student1_class3_school2
## 2 2 2 student2_class3_school2
## 3 3 1 student3_class3_school2
## 4 4 2 student4_class3_school2
## 5 5 1 student5_class3_school2
set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "partial")
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)
## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## $school
## subject q1 uniqueID
## 1 1 2 class1_school1
## 2 2 2 class2_school1
## 3 3 2 class3_school1
## 4 4 1 class1_school2
## 5 5 2 class2_school2
## 6 6 3 class3_school2
##
## $class
## subject q1 uniqueID
## 1 1 2 student1_class1_school1
## 2 2 2 student2_class1_school1
## 3 3 2 student3_class1_school1
## 4 4 2 student4_class1_school1
## 5 5 1 student5_class1_school1
## 6 6 2 student1_class2_school1
## 7 7 2 student2_class2_school1
## 8 8 2 student3_class2_school1
## 9 9 2 student4_class2_school1
## 10 10 1 student5_class2_school1
## 11 11 2 student1_class3_school1
## 12 12 1 student2_class3_school1
## 13 13 1 student3_class3_school1
## 14 14 1 student4_class3_school1
## 15 15 1 student5_class3_school1
## 16 16 2 student1_class1_school2
## 17 17 3 student2_class1_school2
## 18 18 3 student3_class1_school2
## 19 19 1 student4_class1_school2
## 20 20 3 student5_class1_school2
## 21 21 1 student1_class2_school2
## 22 22 1 student2_class2_school2
## 23 23 1 student3_class2_school2
## 24 24 1 student4_class2_school2
## 25 25 4 student5_class2_school2
## 26 26 2 student1_class3_school2
## 27 27 2 student2_class3_school2
## 28 28 1 student3_class3_school2
## 29 29 2 student4_class3_school2
## 30 30 1 student5_class3_school2
set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "full")
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)
## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## subject q1.student uniqueID.student q1.teacher
## 1 1 2 student1_class1_school1 2
## 2 2 2 student2_class1_school1 2
## 3 3 2 student3_class1_school1 2
## 4 4 2 student4_class1_school1 2
## 5 5 1 student5_class1_school1 2
## 6 6 2 student1_class1_school2 1
## 7 7 3 student2_class1_school2 1
## 8 8 3 student3_class1_school2 1
## 9 9 1 student4_class1_school2 1
## 10 10 3 student5_class1_school2 1
## 11 11 2 student1_class2_school1 2
## 12 12 2 student2_class2_school1 2
## 13 13 2 student3_class2_school1 2
## 14 14 2 student4_class2_school1 2
## 15 15 1 student5_class2_school1 2
## 16 16 1 student1_class2_school2 2
## 17 17 1 student2_class2_school2 2
## 18 18 1 student3_class2_school2 2
## 19 19 1 student4_class2_school2 2
## 20 20 4 student5_class2_school2 2
## 21 21 2 student1_class3_school1 2
## 22 22 1 student2_class3_school1 2
## 23 23 1 student3_class3_school1 2
## 24 24 1 student4_class3_school1 2
## 25 25 1 student5_class3_school1 2
## 26 26 2 student1_class3_school2 3
## 27 27 2 student2_class3_school2 3
## 28 28 1 student3_class3_school2 3
## 29 29 2 student4_class3_school2 3
## 30 30 1 student5_class3_school2 3
This is the final section of this document, and at this point you are
assumed to be familiar with how cluster_gen
works, but before
proceeding there is one last argument you should be familiar with,
called sampling_method
.
Consider the example below. The n*
numbering is reset for convenience,
and print_pop_structure
was set to FALSE
to suppress the otherwise
lengthy output of the population structure (the tree would contain
5 × 9 × 6 = 270 lines). You can check draw_cluster_structure(N1)
for
yourself if you’re interested:
n1 <- c(2, 3, 2, 10)
N1 <- c(5, 9, 6, 50)
data1 <- cluster_gen(n = n1, N = N1, print_pop_structure = FALSE)
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for states, schools, classes
## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)
## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
## ├─state1_school3_class1 (10 students)
## └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
## ├─state2_school3_class1 (10 students)
## └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the state level
## state.weight should add up to the number of states in the population (5, counting once per state)
## - Calculating PPS weights at the school level
## final.teacher.weight should add up to the number of teachers in the population (270)
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (270, counting once per class)
Since N != n
, the output of cluter_gen
included some information on
sampling weights. This is crucial for understanding how the weights were
calculated and checking if they were indeed correctly calculated. The
default behavior of cluster_gen
is to use PPS (Probabilities
Proportional to Size) whenever it detects “school” as a label and SRS
(Simple Random Sampling) otherwise. This can, however, be changed. See
the following examples (also notice how print_pop_structure
was
abbreviated; this is OK as long as it’s still clear to R what you are
referring to):
data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "mixed") # default
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for states, schools, classes
## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)
## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
## ├─state1_school3_class1 (10 students)
## └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
## ├─state2_school3_class1 (10 students)
## └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the state level
## state.weight should add up to the number of states in the population (5, counting once per state)
## - Calculating PPS weights at the school level
## final.teacher.weight should add up to the number of teachers in the population (270)
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (270, counting once per class)
data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "SRS") # always SRS
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for states, schools, classes
## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)
## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
## ├─state1_school3_class1 (10 students)
## └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
## ├─state2_school3_class1 (10 students)
## └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating SRS weights at the state level
## state.weight should add up to the number of states in the population (5, counting once per state)
## - Calculating SRS weights at the school level
## school.weight should add up to the number of schools in the population (45, counting once per school)
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (270, counting once per class)
data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "PPS") # always PPS
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for states, schools, classes
## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)
## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
## ├─state1_school3_class1 (10 students)
## └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
## ├─state2_school3_class1 (10 students)
## └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the state level
## final.principal.weight should add up to the number of principals in the population (45)
## - Calculating PPS weights at the school level
## final.teacher.weight should add up to the number of teachers in the population (270)
## - Calculating PPS weights at the class level
## final.student.weight should add up to the number of students in the population (13500)
data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = c("PPS", "PPS", "SRS")) # customized
## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────
## Generating questionnaires for states, schools, classes
## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)
## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
## ├─state1_school3_class1 (10 students)
## └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
## ├─state2_school3_class1 (10 students)
## └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────
## - Calculating PPS weights at the state level
## final.principal.weight should add up to the number of principals in the population (45)
## - Calculating PPS weights at the school level
## final.teacher.weight should add up to the number of teachers in the population (270)
## - Calculating SRS weights at the class level
## class.weight should add up to the number of classes in the population (270, counting once per class)
As a tester, your main task is to check the calculation of all sampling weights. The weights were calculated based on the PISA Data Analysis Manual, which contains one chapter explaining how such weights are calculated, but of course there are other valid references on the subject you may check.
When validating the output of cluster_gen
, please check:
- If the “information on sampling weights” is correct (especially the totals in parenthesis)
- If the labels of the
*.weight
columns are correct. - If the values of the
*.weight
columns are correct.
If an error is found in the weight columns, its origin is likely either
in the *.weight
or the within.*.weight
column. The final.*.weight
is calculated as a product of the former, so errors found here are
nothing but a propagation of the others.
Thank you for also reporting any other errors found. Please read the “How to give feedback” section for more information about how to report errors.
Here are some examples of pairs of n
and N
which can be used to get
you started. Use each (n*
, N*
) pair as input to cluster_gen
.
n1 <- 1:4
N1 <- 5
n2 <- c(5, 1, 3)
N2 <- list(6, c(2, 4, 2, 1, 6, 7), rep(10, sum(c(2, 4, 2, 1, 6, 7))))
n3 <- list(3, c(4, 2, 4), c(8, 2, 1, 3, 4, 6, 9, 10, 2, 10))
N3 <- c(3, 4, 10)
The true power of having multiple collaborators is that multiple brains can come with more examples than only one. Use your imagination, try to break the package and find as many examples as you can which don’t work as they should.