Skip to content

Testing sampling weights on lsasim

wleoncio edited this page Oct 22, 2019 · 2 revisions

Introduction

About this document

Thank you for your help in testing lsasim, an R package for simulating Large Scale Assessment (LSA) data.

This document is intended to aid the development of the next stable version of lsasim (possibly numbered 2.1.0). The current stable version of lsasim (2.0.0) is available on CRAN and on GitHub.

I invite you to read the next subsections of this introduction—available on the other tabs up there—even if you’re familiar with lsasim. The final subsection (How to contribute to testing) is especially useful in showing you how to give feedback on your tests.

This document should be a standalone guide to working with the cluster_gen function of lsasim for testing purposes. However, it is still a work in progress and feedback on missing or incorrect information is welcome. In addition, the help file of cluster_gen may be useful in understanding how the function works. You can access the function documentation by running help("cluster_gen", "lsasim") or ?lsasim::cluster_gen in the R terminal.

Quick history of lsasim

The table below contains information about the three stable releases of lsasim. The innovations under testing in this document will be part of the next stable release.

Version CRAN release date Innovations
1.0.0 2017-02-23 Simulates cognitive and background test data
1.0.1 2017-05-10 Bug fixes
2.0.0 2019-09-12 Expanded functionality of background questionnaires

How to contribute to testing

We appreciate any help in the development of lsasim. In order to make the best of everyone’s time, though, it is desirable that the tester has:

  1. Access to R version 3.6.0 or newer
  2. Permission to install R packages in their working computer
  3. Knowledge of sampling weights, especially:
    1. How to calculate sampling weights
    2. How those weights are usually calculated in LSAs
    3. How such data is usually displayed to analysts of LSA datasets

How to give feedback

In order to keep things organized (and make sure your contribution gets officially recorded), bugs should ideally be reported to https://github.com/tmatta/lsasim/issues/. This requires you to have a (free) GitHub account. If you have found several examples of the same issue, please report them as one issue.

As an alternative to using our GitHub issues tracker, you can send an e-mail to the package maintainer.

Future features to be tested

  • Replicate weights
  • Within and between group correlation

Testing sampling weights

Installing lsasim (development version)

The development version of lsasim can be downloaded from GitHub by issuing the following command on your R console.

First, install the remotes package. You can skip this step if remotes is already installed on your machine. If you don’t know if remotes is installed on your machine, try running library(remotes) and see if there are any errors.

install.packages("remotes")

If the installation goes well, you should see this at the bottom of the output:

## * DONE (remotes)
## 
## The downloaded source packages are in

Next, we use the install_github function to install the development version of lsasim locally. There are actually two versions to choose from:

  1. The recommended version, 2.0.0.9103 (older, but more stable and with results comparable to this document)
  2. The bleeding edge version (newer, but less stable and with results that will differ from this document even with equal seeds)

To install the recommended version, please run the following on your R terminal:

remotes::install_github("tmatta/lsasim", ref="v2.0.0.9103")

The bleeding edge version (> 2.0.0.9103) is available by simply changing the ref argument:

remotes::install_github("tmatta/lsasim", ref="develop")

Note: Installing the version from the develop branch will result in more features but results that are different from the ones shown in this document. If you would like to reproduce the results shown here, you must install version 2.0.0.9103.

After issuing install_github, R will tell the user it is checking, preparing, excuting and testing the installation of lsasim. The most important output is the final message, which should read “DONE (lsasim)”. Ir could also read something like “Skipping install of ‘lsasim’ from a github remote, the SHA1 (…) has not changed since last install”, which means that you already have the latest version. In these cases, you can force the installation by including force=TRUE as an argument to install_github. This can be useful in cases where a new version is available but R fails to recognize the difference between that version and the one installed on your computer.

Finally, we load the installed lsasim package to our current R session and check the build version (your output of packageVersion should match the output below (boxes containing lines beginning with double-hashes (##) are the expected output).

library(lsasim)
packageVersion("lsasim")

## [1] '2.0.0.9103'

Once lsasim is installed and loaded, you are ready to test it. Click the next tab to continue.

Generating clustered test data

This test concerns the generation of sampling weights for background questionnaire data generated in a hierarchical structure. Each hierarchical level is composed of clusters, which can be sampled from a population using either Simple Random Sampling (SRS) or with Probabilities Proportional to Size (PPS).

Basic background questionnaire data generation is handled by the function questionnaire_gen, present in lsasim since its first release. The way cluster background data generation works is through a function called cluster_gen, which calls questionnaire_gen on each cluster level.

Two-level structures

We will start with a simple example, where 2 schools and 10 students in each school are selected. This structure is represented by the following vector:

n1 <- c(2, 10)

The structure can be checked with the function draw_cluster_sctructure, which creates a visual representation of the hierarchical tree in the R console:

draw_cluster_structure(n1)  # pay no mind to the "NULL" printed at the end

## school1 (10 students)
## school2 (10 students)

## NULL

It may not look like much now, but when more complex scenarios start showing up, this visual representation can really help one understand what is going on!

In order to generate clustered responses for n1, we call the cluster_gen function, which is the star of this test. The first argument of cluster_gen is called n and corresponds to the number of sampled observations on each level. Two ways of calling cluster_gen with n = n1 are cluster_gen(n = n1) and cluster_gen(n1), where omitting n = just tells R to assume that the order of the arguments you are passing is the same one the function expects. To see the argument order that cluster_gen expects, see the “Usage” section of the ?cluster_gen help page.

The set.seed function we call right before cluster_gen is there to make sure that your data will match the output below. If that command is dropped, the test results will change each time cluster_gen is called.

set.seed(1234)
cluster_gen(n1)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools

## Total respondents: 20 (10 + 10)

## school1 (10 students)
## school2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (20)

## $school
## $school[[1]]
##    subject           q1 q2 q3 q4 q5 q6 school.weight within.school.weight final.student.weight
## 1        1  0.005006950  2  2  1  2  2             1                    1                    1
## 2        2 -0.037630263  1  1  1  2  1             1                    1                    1
## 3        3  0.723976061  2  2  1  2  1             1                    1                    1
## 4        4 -0.496738863  2  2  1  2  1             1                    1                    1
## 5        5  0.011395161  2  1  1  2  1             1                    1                    1
## 6        6  0.009859946  1  2  2  1  2             1                    1                    1
## 7        7  0.678271423  1  1  1  2  2             1                    1                    1
## 8        8  1.029563029  2  2  1  2  1             1                    1                    1
## 9        9 -1.729528504  1  2  2  2  1             1                    1                    1
## 10      10 -2.204348095  1  1  2  2  2             1                    1                    1
##             uniqueID
## 1   student1_school1
## 2   student2_school1
## 3   student3_school1
## 4   student4_school1
## 5   student5_school1
## 6   student6_school1
## 7   student7_school1
## 8   student8_school1
## 9   student9_school1
## 10 student10_school1
## 
## $school[[2]]
##    subject           q1 q2 q3 q4 q5 q6 school.weight within.school.weight final.student.weight
## 1        1 -0.242559707  1  1  1  1  2             1                    1                    1
## 2        2  2.187119161  1  2  2  2  2             1                    1                    1
## 3        3 -0.581727450  1  1  2  1  1             1                    1                    1
## 4        4  0.700080227  2  1  2  1  1             1                    1                    1
## 5        5  1.492176579  1  2  1  1  1             1                    1                    1
## 6        6  0.526553441  1  1  2  1  2             1                    1                    1
## 7        7  1.037772101  2  2  2  2  2             1                    1                    1
## 8        8 -1.860716351  1  1  2  1  1             1                    1                    1
## 9        9 -0.426574240  2  1  2  1  1             1                    1                    1
## 10      10 -0.001137045  1  1  2  1  1             1                    1                    1
##             uniqueID
## 1   student1_school2
## 2   student2_school2
## 3   student3_school2
## 4   student4_school2
## 5   student5_school2
## 6   student6_school2
## 7   student7_school2
## 8   student8_school2
## 9   student9_school2
## 10 student10_school2

Notice how cluster_gen prints the cluster strucute as well as other important information before showing the background data itself. This can be disabled by inserting verbose = FALSE into the cluster_gen call.

By default, cluster_gen will determine the number of continuous (X) and categorical (W) background questions. In this case, X = {X1} (represented in the output by q1) and W = {W1, …, W5} (represented in the output by q2 through q6). This can be customized, and for the sake of simplicity, we will have only one categorical background variable and no continuous variables. This time, the output will also be assigned to data, which is finally printed for us to see what it looks like.

set.seed(2345)
data <- cluster_gen(n1, n_X = 0, n_W = list(1))

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools

## Total respondents: 20 (10 + 10)

## school1 (10 students)
## school2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (20)

data

## $school
## $school[[1]]
##    subject q1 school.weight within.school.weight final.student.weight          uniqueID
## 1        1  3             1                    1                    1  student1_school1
## 2        2  3             1                    1                    1  student2_school1
## 3        3  1             1                    1                    1  student3_school1
## 4        4  1             1                    1                    1  student4_school1
## 5        5  3             1                    1                    1  student5_school1
## 6        6  3             1                    1                    1  student6_school1
## 7        7  3             1                    1                    1  student7_school1
## 8        8  4             1                    1                    1  student8_school1
## 9        9  4             1                    1                    1  student9_school1
## 10      10  2             1                    1                    1 student10_school1
## 
## $school[[2]]
##    subject q1 school.weight within.school.weight final.student.weight          uniqueID
## 1        1  1             1                    1                    1  student1_school2
## 2        2  4             1                    1                    1  student2_school2
## 3        3  1             1                    1                    1  student3_school2
## 4        4  4             1                    1                    1  student4_school2
## 5        5  4             1                    1                    1  student5_school2
## 6        6  2             1                    1                    1  student6_school2
## 7        7  4             1                    1                    1  student7_school2
## 8        8  2             1                    1                    1  student8_school2
## 9        9  3             1                    1                    1  student9_school2
## 10      10  4             1                    1                    1 student10_school2

Notice how n_W is defined as a list where each element—only one in this case—corresponds to the number of variables at a particular level. This is done so that n_W can support more complex calls such as n_W = list(list(2, 2), 5), which corresponds to telling cluster_gen that the first level will have two binary categorical variables and the second level will have 5 categorical variables (the number of categories being randomly determined).

Three-level structures

Let us now consider a second hierarchical structure, composed of a cluster of 2 schools which are divided into 3 classes each; each class contains 5 students:

n2 <- c(2, 3, 5)
set.seed(2345)
cluster_gen(n2)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools, classes

## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)

## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.teacher.weight should add up to the number of teachers in the population (6)

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (6, counting once per class)

## $school
## $school[[1]]
##   subject          q1 q2 q3 school.weight within.school.weight final.teacher.weight       uniqueID
## 1       1 -0.16566248  1  2             1                    1                    1 class1_school1
## 2       2 -0.88234450  1  1             1                    1                    1 class2_school1
## 3       3 -0.01332182  2  2             1                    1                    1 class3_school1
## 
## $school[[2]]
##   subject          q1 q2 q3 school.weight within.school.weight final.teacher.weight       uniqueID
## 1       1  0.07879383  1  1             1                    1                    1 class1_school2
## 2       2 -0.88209970  2  2             1                    1                    1 class2_school2
## 3       3  0.89263571  2  1             1                    1                    1 class3_school2
## 
## 
## $class
## $class[[1]]
##   subject         q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1  2.1527725  2  1  2  1  2            1                   1                    1
## 2       2  0.5173488  1  2  2  1  2            1                   1                    1
## 3       3 -1.2601526  2  1  2  2  2            1                   1                    1
## 4       4  0.4095549  1  1  1  2  1            1                   1                    1
## 5       5 -0.3379999  2  1  2  1  1            1                   1                    1
##                  uniqueID
## 1 student1_class1_school1
## 2 student2_class1_school1
## 3 student3_class1_school1
## 4 student4_class1_school1
## 5 student5_class1_school1
## 
## $class[[2]]
##   subject          q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1 -2.15015256  1  1  2  2  1            1                   1                    1
## 2       2  1.63216373  2  1  1  1  2            1                   1                    1
## 3       3  0.47573673  2  2  1  2  2            1                   1                    1
## 4       4 -1.10436289  1  2  2  2  2            1                   1                    1
## 5       5 -0.05614962  2  1  1  2  1            1                   1                    1
##                  uniqueID
## 1 student1_class2_school1
## 2 student2_class2_school1
## 3 student3_class2_school1
## 4 student4_class2_school1
## 5 student5_class2_school1
## 
## $class[[3]]
##   subject         q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1  1.0676829  1  1  1  1  1            1                   1                    1
## 2       2 -1.0448467  2  1  1  2  2            1                   1                    1
## 3       3  0.7418229  1  1  2  1  1            1                   1                    1
## 4       4 -0.2396375  2  2  1  1  1            1                   1                    1
## 5       5  0.5653863  1  2  2  1  1            1                   1                    1
##                  uniqueID
## 1 student1_class3_school1
## 2 student2_class3_school1
## 3 student3_class3_school1
## 4 student4_class3_school1
## 5 student5_class3_school1
## 
## $class[[4]]
##   subject          q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1 -0.31211831  2  2  1  1  2            1                   1                    1
## 2       2 -1.06488440  2  2  1  2  1            1                   1                    1
## 3       3  0.06095831  2  1  1  1  2            1                   1                    1
## 4       4  0.74802298  1  2  1  1  2            1                   1                    1
## 5       5  2.74479129  1  1  1  1  2            1                   1                    1
##                  uniqueID
## 1 student1_class1_school2
## 2 student2_class1_school2
## 3 student3_class1_school2
## 4 student4_class1_school2
## 5 student5_class1_school2
## 
## $class[[5]]
##   subject         q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1  0.6141850  2  1  1  1  1            1                   1                    1
## 2       2  1.8841624  1  1  2  1  2            1                   1                    1
## 3       3 -0.2516623  2  2  2  2  1            1                   1                    1
## 4       4  0.7501333  2  1  2  2  2            1                   1                    1
## 5       5  0.4777128  2  1  1  1  2            1                   1                    1
##                  uniqueID
## 1 student1_class2_school2
## 2 student2_class2_school2
## 3 student3_class2_school2
## 4 student4_class2_school2
## 5 student5_class2_school2
## 
## $class[[6]]
##   subject         q1 q2 q3 q4 q5 q6 class.weight within.class.weight final.student.weight
## 1       1 -0.4050786  2  1  1  2  1            1                   1                    1
## 2       2  0.4307551  2  2  1  2  1            1                   1                    1
## 3       3 -0.3358192  2  1  2  1  2            1                   1                    1
## 4       4 -0.4681827  1  2  1  2  2            1                   1                    1
## 5       5  0.5989933  1  1  2  2  2            1                   1                    1
##                  uniqueID
## 1 student1_class3_school2
## 2 student2_class3_school2
## 3 student3_class3_school2
## 4 student4_class3_school2
## 5 student5_class3_school2

Notice how the output above contains 2 school questionnaires with 3 answers each (from the teachers who answered for the classes) as well as 2 × 3 = 6 questionnaires, each of which applied 5 students in each class. Notice how the teacher questionnaires are the same, with one X and 2 W variables, and the student questionnaires are also the same, with one X and 5 Ws. By default, the means of the continuous variables are the same (0), and the proportions of the categorical variables are randomly determined.

n1 and n2 are unnamed vectors, so cluster_gen determined the names of the clusters itself using a pre-built sequence. Nonetheless, the user is free to use whatever labels they want. This can be done either by passing names to the n argument or by passing character vectors to the cluster_labels and resp_labels arguments. See the examples below:

cluster_gen(n = c(a = 2, b = 3))

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for as, bs

## Total respondents: 6 (3 + 3)

## a1 (3 bs)
## a2 (3 bs)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the a level

##   a.weight should add up to the number of as in the population (2, counting once per a)

## $a
## $a[[1]]
##   subject         q1         q2 q3 q4 q5 q6 q7 q8 q9 a.weight within.a.weight final.b.weight
## 1       1  3.9427498 -3.5449571  1  1  1  1  2  1  2        1               1              1
## 2       2 -0.5029523  0.3083131  1  2  1  1  2  2  2        1               1              1
## 3       3 -0.6693996 -0.6230736  1  2  1  1  1  1  1        1               1              1
##   uniqueID
## 1    b1_a1
## 2    b2_a1
## 3    b3_a1
## 
## $a[[2]]
##   subject        q1        q2 q3 q4 q5 q6 q7 q8 q9 a.weight within.a.weight final.b.weight uniqueID
## 1       1 0.2319406 1.4823528  2  2  2  1  2  1  2        1               1              1    b1_a2
## 2       2 1.9505050 0.2887602  1  1  1  2  2  1  2        1               1              1    b2_a2
## 3       3 0.4564811 0.5387870  2  1  2  2  2  1  2        1               1              1    b3_a2

cluster_gen(n = c(2, 3), cluster_labels = c("group"), resp_labels = c("person"))

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for groups

## Total respondents: 6 (3 + 3)

## group1 (3 persons)
## group2 (3 persons)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the group level

##   group.weight should add up to the number of groups in the population (2, counting once per group)

## $group
## $group[[1]]
##   subject         q1 q2 q3 q4 group.weight within.group.weight final.person.weight       uniqueID
## 1       1 -0.3544601  1  1  2            1                   1                   1 person1_group1
## 2       2  0.5769583  1  1  1            1                   1                   1 person2_group1
## 3       3 -0.6274591  1  1  2            1                   1                   1 person3_group1
## 
## $group[[2]]
##   subject         q1 q2 q3 q4 group.weight within.group.weight final.person.weight       uniqueID
## 1       1  0.5236884  1  2  2            1                   1                   1 person1_group2
## 2       2 -1.0585904  2  2  1            1                   1                   1 person2_group2
## 3       3 -0.0260521  1  2  2            1                   1                   1 person3_group2

Your data should vary from the output below (due to the lack of a fixed seed), but the labels and the hierarchical structure should be the same.

Asymmetric structures

As we said before, n corresponds to the number of sampled observations on each level. This means that each level will have the same number of sublevels, in what one could call a symmetric hierarchical structure. Asymmetric structures can also be determined, and they use the following syntax (the vector velow is named for convenience, but it may also be nameless):

n3 <- list(sch = 3, cls = c(2, 1, 2), stu = c(5, 4, 2, 3, 2))

The list above corresponds to 3 schools, each one containing 2, 1 and 2 classes. These 5 classes respectively contain 5, 4, 2, 3 and 2 students.

As you can imagine, this sort of structure can easily become complicated to imagine. This is when the draw_cluster_structure function can be helpful:

draw_cluster_structure(n3)

## sch1
## ├─sch1_cls1 (5 stus)
## └─sch1_cls2 (4 stus)
## sch2
## └─sch2_cls1 (2 stus)
## sch3
## ├─sch3_cls1 (3 stus)
## └─sch3_cls2 (2 stus)

## NULL

As an exercise, try calling cluster_gen(n3) and see if the number of responses corresponds to your expectations.

n can also be passed as a range of values, randomly determined by the function. For example, if we set

n4 <- list(school = 4, class = ranges(5, 10), student = ranges(20, 50))

Then, once we call cluster_gen on n4 we are telling R that each of the 4 schools have between 5 and 10 classes, and each class has between 20 and 50 students. Let us use draw_cluster_structure to see what the generated structure looks like

set.seed(6789)
draw_cluster_structure(n4)

## school1
## ├─school1_class1 (46 students)
## ├─school1_class2 (31 students)
## ├─school1_class3 (38 students)
## ├─school1_class4 (37 students)
## ├─school1_class5 (34 students)
## ├─school1_class6 (34 students)
## ├─school1_class7 (26 students)
## ├─school1_class8 (48 students)
## └─school1_class9 (45 students)
## school2
## ├─school2_class1 (40 students)
## ├─school2_class2 (42 students)
## ├─school2_class3 (30 students)
## ├─school2_class4 (24 students)
## ├─school2_class5 (22 students)
## └─school2_class6 (48 students)
## school3
## ├─school3_class1 (32 students)
## ├─school3_class2 (35 students)
## ├─school3_class3 (41 students)
## ├─school3_class4 (45 students)
## ├─school3_class5 (35 students)
## ├─school3_class6 (21 students)
## ├─school3_class7 (29 students)
## └─school3_class8 (26 students)
## school4
## ├─school4_class1 (42 students)
## ├─school4_class2 (22 students)
## ├─school4_class3 (48 students)
## ├─school4_class4 (37 students)
## ├─school4_class5 (27 students)
## ├─school4_class6 (41 students)
## ├─school4_class7 (42 students)
## ├─school4_class8 (40 students)
## ├─school4_class9 (37 students)
## └─school4_class10 (47 students)

## NULL

Customizing the population size

So far, we have only worked with the sampled elements, which are passed as the first argument of cluster_gen. By default, cluster_gen assumes N = n, meaning that n actually corresponds to a census (where all the elements of the population are selected). In practice, though, this is rarely the case, and custer_gen can receive other values to indicate the population structure under the N argument. See the examples below:

n5 <- c(3, 4)
N5 <- 2

This is the most basic way to determine a different population size: by passing a single number to N. In that case, N will be interpreted as a multiplier of n. In other words, the syntax above basically says that the sample is composed of 3 schools and 4 students in each school, whereas the population is twice as large at all levels. This is all explicit when cluster_gen is called (see the hierarchical structures printed below):

data5 <- cluster_gen(n = n5, N = N5)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (8 students)
## school2 (8 students)
## school3 (8 students)
## school4 (8 students)
## school5 (8 students)
## school6 (8 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 12 (4 + 4 + 4)

## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (48)

In the example above, the questionnaire answers are stored in data5, which is why they do not appear in the R terminal. The user messages are still printed, as they are not stored in data5.

The population structure can also be explicitly defined:

n6 <- c(3, 4)
N6 <- c(4, 5)
data6 <- cluster_gen(n = n6, N = N6)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (5 students)
## school2 (5 students)
## school3 (5 students)
## school4 (5 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 12 (4 + 4 + 4)

## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (20)

Just like n, N can also be defined as lists:

n7 <- list(3, c(4, 2, 3))
N7 <- list(10, c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19))
data7 <- cluster_gen(n = n7, N = N7)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (10 students)
## school2 (11 students)
## school3 (12 students)
## school4 (13 students)
## school5 (14 students)
## school6 (15 students)
## school7 (16 students)
## school8 (17 students)
## school9 (18 students)
## school10 (19 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 9 (4 + 2 + 3)

## school1 (4 students)
## school2 (2 students)
## school3 (3 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (145)

Mixing ranges for n and explicit lists for N is also possible.

set.seed(345)
n8 <- list(3, ranges(5, 10))
N8 <- list(10, c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19))
data8 <- cluster_gen(n = n8, N = N8)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (10 students)
## school2 (11 students)
## school3 (12 students)
## school4 (13 students)
## school5 (14 students)
## school6 (15 students)
## school7 (16 students)
## school8 (17 students)
## school9 (18 students)
## school10 (19 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 25 (9 + 7 + 9)

## school1 (9 students)
## school2 (7 students)
## school3 (9 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (145)

n9 <- list(3, ranges(5, 10))
N9 <- list(10, ranges(50, 100))
data9 <- cluster_gen(n = n9, N = N9)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (79 students)
## school2 (65 students)
## school3 (59 students)
## school4 (59 students)
## school5 (64 students)
## school6 (98 students)
## school7 (82 students)
## school8 (63 students)
## school9 (97 students)
## school10 (56 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 23 (5 + 8 + 10)

## school1 (5 students)
## school2 (8 students)
## school3 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (722)

There are other structure combinations for n and N which give output that could be confusing for a user. One example is when the population is smaller than the sample. This example is illustrated below. If you find other misbehaving or otherwise noteworthy examples, please report.

set.seed(345)
n10 <- c(3, 4)
N10 <- c(2, 3)
cluster_gen(n = n10, N = N10)  # notice the missing weights

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Population structure

## school1 (3 students)
## school2 (3 students)

## Sampled structure

## Generating questionnaires for schools

## Total respondents: 12 (4 + 4 + 4)

## school1 (4 students)
## school2 (4 students)
## school3 (4 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the school level

##   final.student.weight should add up to the number of students in the population (6)

## $school
## $school[[1]]
##   subject         q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1       1  1.7863951  2  2  1  2     0.6666667                 0.75                  0.5
## 2       2  0.1956436  1  2  1  2     0.6666667                 0.75                  0.5
## 3       3  0.7482214  2  2  1  2     0.6666667                 0.75                  0.5
## 4       4 -0.2938856  1  2  1  1     0.6666667                 0.75                  0.5
##           uniqueID
## 1 student1_school1
## 2 student2_school1
## 3 student3_school1
## 4 student4_school1
## 
## $school[[2]]
##   subject         q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1       1 -0.4178153  2  2  1  2     0.6666667                 0.75                  0.5
## 2       2 -1.2977486  2  1  1  2     0.6666667                 0.75                  0.5
## 3       3 -1.0928678  2  1  1  2     0.6666667                 0.75                  0.5
## 4       4 -0.5008169  2  2  1  2     0.6666667                 0.75                  0.5
##           uniqueID
## 1 student1_school2
## 2 student2_school2
## 3 student3_school2
## 4 student4_school2
## 
## $school[[3]]
##   subject         q1 q2 q3 q4 q5 school.weight within.school.weight final.student.weight
## 1       1 -1.8998894  2  2  2  1            NA                   NA                   NA
## 2       2  0.2750867  1  2  2  1            NA                   NA                   NA
## 3       3  1.8452385  1  1  1  2            NA                   NA                   NA
## 4       4 -0.2048277  1  2  1  2            NA                   NA                   NA
##           uniqueID
## 1 student1_school3
## 2 student2_school3
## 3 student3_school3
## 4 student4_school3

Other useful function arguments

Understanding the commands above is all that you need to start checking the sampling weights. However, you might be interested in knowing some other things that cluster_gen can already do.

For instance, the user might be interested in keeping this cluster structure, but only generating questionnaires at the student level. This can be done by running

set.seed(2345)
df <- cluster_gen(n2, separate_questionnaires = FALSE)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for students

## Total respondents: 30 (5 + 5 + 5 + 5 + 5 + 5)

## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (6, counting once per class)

Print df and notice how the data generated is different from the previous one even though both calls share the same seed. This is because the former call generates student questionnaires only after the teacher questionnaires, so the seed is effectively different when it comes to generating student questionnaires.

Back to the case of separate questionnaires, the user may want to collapse the questionnaires per level, so that all the questionnaires on the same level are put together; alternatively, all the questionnaires can be collapsed into one data frame, with answers from higher levels being repeated at the lowest level. Perhaps this can be better understood in the example below. The relevant argument here is collapse; n_X = 0, n_W = 1 and calc_weights = FALSE were set to make the output shorter, thus making it easier to understand the effect of different collapse options.

set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "none")  # default behavior

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools, classes

## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)

## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)

## $school
## $school[[1]]
##   subject q1       uniqueID
## 1       1  2 class1_school1
## 2       2  2 class2_school1
## 3       3  2 class3_school1
## 
## $school[[2]]
##   subject q1       uniqueID
## 1       1  1 class1_school2
## 2       2  2 class2_school2
## 3       3  3 class3_school2
## 
## 
## $class
## $class[[1]]
##   subject q1                uniqueID
## 1       1  2 student1_class1_school1
## 2       2  2 student2_class1_school1
## 3       3  2 student3_class1_school1
## 4       4  2 student4_class1_school1
## 5       5  1 student5_class1_school1
## 
## $class[[2]]
##   subject q1                uniqueID
## 1       1  2 student1_class2_school1
## 2       2  2 student2_class2_school1
## 3       3  2 student3_class2_school1
## 4       4  2 student4_class2_school1
## 5       5  1 student5_class2_school1
## 
## $class[[3]]
##   subject q1                uniqueID
## 1       1  2 student1_class3_school1
## 2       2  1 student2_class3_school1
## 3       3  1 student3_class3_school1
## 4       4  1 student4_class3_school1
## 5       5  1 student5_class3_school1
## 
## $class[[4]]
##   subject q1                uniqueID
## 1       1  2 student1_class1_school2
## 2       2  3 student2_class1_school2
## 3       3  3 student3_class1_school2
## 4       4  1 student4_class1_school2
## 5       5  3 student5_class1_school2
## 
## $class[[5]]
##   subject q1                uniqueID
## 1       1  1 student1_class2_school2
## 2       2  1 student2_class2_school2
## 3       3  1 student3_class2_school2
## 4       4  1 student4_class2_school2
## 5       5  4 student5_class2_school2
## 
## $class[[6]]
##   subject q1                uniqueID
## 1       1  2 student1_class3_school2
## 2       2  2 student2_class3_school2
## 3       3  1 student3_class3_school2
## 4       4  2 student4_class3_school2
## 5       5  1 student5_class3_school2

set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "partial")

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)

## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)

## $school
##   subject q1       uniqueID
## 1       1  2 class1_school1
## 2       2  2 class2_school1
## 3       3  2 class3_school1
## 4       4  1 class1_school2
## 5       5  2 class2_school2
## 6       6  3 class3_school2
## 
## $class
##    subject q1                uniqueID
## 1        1  2 student1_class1_school1
## 2        2  2 student2_class1_school1
## 3        3  2 student3_class1_school1
## 4        4  2 student4_class1_school1
## 5        5  1 student5_class1_school1
## 6        6  2 student1_class2_school1
## 7        7  2 student2_class2_school1
## 8        8  2 student3_class2_school1
## 9        9  2 student4_class2_school1
## 10      10  1 student5_class2_school1
## 11      11  2 student1_class3_school1
## 12      12  1 student2_class3_school1
## 13      13  1 student3_class3_school1
## 14      14  1 student4_class3_school1
## 15      15  1 student5_class3_school1
## 16      16  2 student1_class1_school2
## 17      17  3 student2_class1_school2
## 18      18  3 student3_class1_school2
## 19      19  1 student4_class1_school2
## 20      20  3 student5_class1_school2
## 21      21  1 student1_class2_school2
## 22      22  1 student2_class2_school2
## 23      23  1 student3_class2_school2
## 24      24  1 student4_class2_school2
## 25      25  4 student5_class2_school2
## 26      26  2 student1_class3_school2
## 27      27  2 student2_class3_school2
## 28      28  1 student3_class3_school2
## 29      29  2 student4_class3_school2
## 30      30  1 student5_class3_school2

set.seed(1); cluster_gen(n2, n_X = 0, n_W = 1, calc_weights = FALSE, collapse = "full")

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for schools, classes
## Total respondents: 36 (3 + 3 + 5 + 5 + 5 + 5 + 5 + 5)

## school1
## ├─school1_class1 (5 students)
## ├─school1_class2 (5 students)
## └─school1_class3 (5 students)
## school2
## ├─school2_class1 (5 students)
## ├─school2_class2 (5 students)
## └─school2_class3 (5 students)

##    subject q1.student        uniqueID.student q1.teacher
## 1        1          2 student1_class1_school1          2
## 2        2          2 student2_class1_school1          2
## 3        3          2 student3_class1_school1          2
## 4        4          2 student4_class1_school1          2
## 5        5          1 student5_class1_school1          2
## 6        6          2 student1_class1_school2          1
## 7        7          3 student2_class1_school2          1
## 8        8          3 student3_class1_school2          1
## 9        9          1 student4_class1_school2          1
## 10      10          3 student5_class1_school2          1
## 11      11          2 student1_class2_school1          2
## 12      12          2 student2_class2_school1          2
## 13      13          2 student3_class2_school1          2
## 14      14          2 student4_class2_school1          2
## 15      15          1 student5_class2_school1          2
## 16      16          1 student1_class2_school2          2
## 17      17          1 student2_class2_school2          2
## 18      18          1 student3_class2_school2          2
## 19      19          1 student4_class2_school2          2
## 20      20          4 student5_class2_school2          2
## 21      21          2 student1_class3_school1          2
## 22      22          1 student2_class3_school1          2
## 23      23          1 student3_class3_school1          2
## 24      24          1 student4_class3_school1          2
## 25      25          1 student5_class3_school1          2
## 26      26          2 student1_class3_school2          3
## 27      27          2 student2_class3_school2          3
## 28      28          1 student3_class3_school2          3
## 29      29          2 student4_class3_school2          3
## 30      30          1 student5_class3_school2          3

Checking sampling weights

This is the final section of this document, and at this point you are assumed to be familiar with how cluster_gen works, but before proceeding there is one last argument you should be familiar with, called sampling_method.

Changing the sampling method

Consider the example below. The n* numbering is reset for convenience, and print_pop_structure was set to FALSE to suppress the otherwise lengthy output of the population structure (the tree would contain 5 × 9 × 6 = 270 lines). You can check draw_cluster_structure(N1) for yourself if you’re interested:

n1 <- c(2, 3, 2, 10)
N1 <- c(5, 9, 6, 50)
data1 <- cluster_gen(n = n1, N = N1, print_pop_structure = FALSE)

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for states, schools, classes

## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)

## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
##   ├─state1_school3_class1 (10 students)
##   └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
##   ├─state2_school3_class1 (10 students)
##   └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the state level

##   state.weight should add up to the number of states in the population (5, counting once per state)

## - Calculating PPS weights at the school level

##   final.teacher.weight should add up to the number of teachers in the population (270)

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (270, counting once per class)

Since N != n, the output of cluter_gen included some information on sampling weights. This is crucial for understanding how the weights were calculated and checking if they were indeed correctly calculated. The default behavior of cluster_gen is to use PPS (Probabilities Proportional to Size) whenever it detects “school” as a label and SRS (Simple Random Sampling) otherwise. This can, however, be changed. See the following examples (also notice how print_pop_structure was abbreviated; this is OK as long as it’s still clear to R what you are referring to):

data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "mixed")  # default

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for states, schools, classes

## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)

## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
##   ├─state1_school3_class1 (10 students)
##   └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
##   ├─state2_school3_class1 (10 students)
##   └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the state level

##   state.weight should add up to the number of states in the population (5, counting once per state)

## - Calculating PPS weights at the school level

##   final.teacher.weight should add up to the number of teachers in the population (270)

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (270, counting once per class)

data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "SRS")  # always SRS

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for states, schools, classes

## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)

## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
##   ├─state1_school3_class1 (10 students)
##   └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
##   ├─state2_school3_class1 (10 students)
##   └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating SRS weights at the state level

##   state.weight should add up to the number of states in the population (5, counting once per state)

## - Calculating SRS weights at the school level

##   school.weight should add up to the number of schools in the population (45, counting once per school)

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (270, counting once per class)

data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = "PPS")  # always PPS

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for states, schools, classes

## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)

## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
##   ├─state1_school3_class1 (10 students)
##   └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
##   ├─state2_school3_class1 (10 students)
##   └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the state level

##   final.principal.weight should add up to the number of principals in the population (45)

## - Calculating PPS weights at the school level

##   final.teacher.weight should add up to the number of teachers in the population (270)

## - Calculating PPS weights at the class level

##   final.student.weight should add up to the number of students in the population (13500)

data1 <- cluster_gen(n = n1, N = N1, print_pop = FALSE, sampling_method = c("PPS", "PPS", "SRS"))  # customized

## ── Hierarchical structure ──────────────────────────────────────────────────────────────────────────

## Generating questionnaires for states, schools, classes

## Total respondents: 138 (3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10)

## state1
## ├─state1_school1
## │ ├─state1_school1_class1 (10 students)
## │ └─state1_school1_class2 (10 students)
## ├─state1_school2
## │ ├─state1_school2_class1 (10 students)
## │ └─state1_school2_class2 (10 students)
## └─state1_school3
##   ├─state1_school3_class1 (10 students)
##   └─state1_school3_class2 (10 students)
## state2
## ├─state2_school1
## │ ├─state2_school1_class1 (10 students)
## │ └─state2_school1_class2 (10 students)
## ├─state2_school2
## │ ├─state2_school2_class1 (10 students)
## │ └─state2_school2_class2 (10 students)
## └─state2_school3
##   ├─state2_school3_class1 (10 students)
##   └─state2_school3_class2 (10 students)
## ── Information on sampling weights ─────────────────────────────────────────────────────────────────

## - Calculating PPS weights at the state level

##   final.principal.weight should add up to the number of principals in the population (45)

## - Calculating PPS weights at the school level

##   final.teacher.weight should add up to the number of teachers in the population (270)

## - Calculating SRS weights at the class level

##   class.weight should add up to the number of classes in the population (270, counting once per class)

Calculating sampling weights

As a tester, your main task is to check the calculation of all sampling weights. The weights were calculated based on the PISA Data Analysis Manual, which contains one chapter explaining how such weights are calculated, but of course there are other valid references on the subject you may check.

When validating the output of cluster_gen, please check:

  1. If the “information on sampling weights” is correct (especially the totals in parenthesis)
  2. If the labels of the *.weight columns are correct.
  3. If the values of the *.weight columns are correct.

If an error is found in the weight columns, its origin is likely either in the *.weight or the within.*.weight column. The final.*.weight is calculated as a product of the former, so errors found here are nothing but a propagation of the others.

Thank you for also reporting any other errors found. Please read the “How to give feedback” section for more information about how to report errors.

Testing given examples

Here are some examples of pairs of n and N which can be used to get you started. Use each (n*, N*) pair as input to cluster_gen.

n1 <- 1:4
N1 <- 5

n2 <- c(5, 1, 3)
N2 <- list(6, c(2, 4, 2, 1, 6, 7), rep(10, sum(c(2, 4, 2, 1, 6, 7))))

n3 <- list(3, c(4, 2, 4), c(8, 2, 1, 3, 4, 6, 9, 10, 2, 10))
N3 <- c(3, 4, 10)

Coming up with new examples

The true power of having multiple collaborators is that multiple brains can come with more examples than only one. Use your imagination, try to break the package and find as many examples as you can which don’t work as they should.

Clone this wiki locally