diff --git a/06-effectsize.qmd b/06-effectsize.qmd index 266e5a9..01be76a 100644 --- a/06-effectsize.qmd +++ b/06-effectsize.qmd @@ -431,8 +431,8 @@ cat(longmcq(opts_p)) ```{r, echo = FALSE, results = 'asis'} opts_p <- c( - answer = "*r*; epsilon-squared", - "Hedges’ *g*; omega-squared", + "*r*; epsilon-squared", + answer = "Hedges’ *g*; omega-squared", "Cohen’s $d_s$; generalized eta-squared" ) cat(longmcq(opts_p)) diff --git a/docs/03-likelihoods.html b/docs/03-likelihoods.html index 8acd82a..88c0c66 100644 --- a/docs/03-likelihoods.html +++ b/docs/03-likelihoods.html @@ -458,8 +458,8 @@

< 3.4.1 Questions about likelihoods

Q1: Let’s assume that you flip what you believe to be a fair coin. What is the binomial probability of observing 8 heads out of 10 coin flips, when p = 0.5? (You can use the functions in the chapter, or compute it by hand).

-
- +
+

Q2: The likelihood curve rises and falls, except in the extreme cases where 0 heads or only heads are observed. Copy the code below (remember that you can click the ‘clipboard’ icon on the top right of the code section to copy all the code to your clipboard), and plot the likelihood curves for 0 heads (x <- 0) out of 10 flips (n <- 10) by running the script. What does the likelihood curve look like?

@@ -483,8 +483,8 @@

< title(paste("Likelihood Ratio H0/H1:", round(dbinom(x, n, H0) / dbinom(x, n, H1), digits = 2), " Likelihood Ratio H1/H0:", round(dbinom(x, n, H1) / dbinom(x, n, H0), digits = 2)))

-
- +
+

Q3: Get a coin out of your pocket or purse Flip it 13 times, and count the number of heads. Using the code above, calculate the likelihood of your observed results under the hypothesis that your coin is fair, compared to the hypothesis that the coin is not fair. Set the number of successes (x) to the number of heads you observed. Change \(H_1\) to the number of heads you have observed (or leave it at 0 if you didn’t observe any heads at all!). You can just use 4/13, or enter 0.3038. Leave \(H_0\) at 0.5. Run the script to calculate the likelihood ratio. What is the likelihood ratio of a fair compared to a non-fair coin (or \(H_0\)/\(H_1\)) that flips heads as often as you have observed, based on the observed data? Round your answer to 2 digits after the decimal.

@@ -497,8 +497,8 @@

<

Q7: When comparing two hypotheses (p = X vs p = Y), a likelihood ratio of:

-
- +
+

@@ -506,21 +506,21 @@

<

A Shiny app to perform the calculations is available here.

Q8: Which statement is correct when you perform 3 studies?

-
- +
+

Q9: Sometimes in a set of three studies, you’ll find a significant effect in one study, but there is no effect in the other two related studies. Assume the two related studies were not exactly the same in every way (e.g., you changed the manipulation, or the procedure, or some of the questions). It could be that the two other studies did not work because of minor differences that had some effect that you do not fully understand yet. Or it could be that the single significant result was a Type 1 error, and \(H_0\) was true in all three studies. Which statement below is correct, assuming a 5% Type 1 error rate and 80% power?

-
- +
+

The idea that most studies have 80% power is slightly optimistic. Examine the correct answer to the previous question across a range of power values (e.g., 50% power, and 30% power).

Q10: Several papers suggest it is a reasonable assumption that the power in the psychological literature might be around 50%. Set the number of studies to 4, the number of successes also to 4, and the assumed power slider to 50%, and look at the table at the bottom of the app. How likely is it that you will observe 4 significant results in 4 studies, assuming there is a true effect?

-
- +
+

Imagine you perform 4 studies, and 3 show a significant result. Change these numbers in the online app. Leave the power at 50%. The output in the text tells you:

@@ -529,16 +529,16 @@

<

These calculations show that, assuming you have observed three significant results out of four studies, and assuming each study had 50% power, you are 526 times more likely to have observed these data when the alternative hypothesis is true, than when the null hypothesis is true. In other words, your are 526 times more likely to find a significant effect in three studies when you have 50% power, than to find three Type 1 errors in a set of four studies.

Q11: Maybe you don’t think 50% power is a reasonable assumption. How low can the power be (rounded to 2 digits), for the likelihood to remain higher than 32 in favor of \(H_1\) when observing 3 out of 4 significant results?

-
- +
+

The main take-home message of these calculations is to understand that 1) mixed results are supposed to happen, and 2) mixed results can contain strong evidence for a true effect, across a wide range of plausible power values. The app also tells you how much evidence, in a rough dichotomous way, you can expect. This is useful for our educational goal. But when you want to evaluate results from multiple studies, the formal way to do so is by performing a meta-analysis.

The above calculations make a very important assumption, namely that the Type 1 error rate is controlled at 5%. If you try out many different tests in each study, and only report the result that yielded p < 0.05, these calculations no longer hold.

Q12: Go back to the default settings of 2 out of 3 significant results, but now set the Type 1 error rate to 20%, to reflect a modest amount of p-hacking. Under these circumstances, what is the highest likelihood in favor of \(H_1\) you can get if you explore all possible values for the true power?

-
- +
+

As the scenario above shows, p-hacking makes studies extremely uninformative. If you inflate the error rate, you quickly destroy the evidence in the data. You can no longer determine whether the data are more likely when there is no effect, than when there is an effect. Sometimes researchers complain that people who worry about p-hacking and try to promote better Type 1 error control are missing the point, and that other things (better measurement, better theory, etc.) are more important. I fully agree that these aspects of scientific research are at least as important as better error control. But better measures and theories will require decades of work. Better error control could be accomplished today, if researchers would stop inflating their error rates by flexibly analyzing their data. And as this assignment shows, inflated rates of false positives very quickly make it difficult to learn what is true from the data we collect. Because of the relative ease with which this part of scientific research can be improved, and because we can achieve this today (and not in a decade), I think it is worth stressing the importance of error control, and publish more realistic-looking sets of studies.

diff --git a/docs/04-bayes.html b/docs/04-bayes.html index f421785..e431861 100644 --- a/docs/04-bayes.html +++ b/docs/04-bayes.html @@ -541,14 +541,14 @@

Q1: The true believer had a prior of Beta(1,0.5). After observing 10 heads out of 20 coin flips, what is the posterior distribution, given that \(\alpha\) = \(\alpha\) + x and \(\beta\) = \(\beta\) + n – x?

-
- +
+

Q2: The extreme skeptic had a prior of Beta(100,100). After observing 50 heads out of 100 coin flips, what is the posterior distribution, given that \(\alpha\) = \(\alpha\) + x and \(\beta\) = \(\beta\) + n – x?

-
- +
+

Copy the R script below into R. This script requires 5 input parameters (identical to the Bayes Factor calculator website used above). These are the hypothesis you want to examine (e.g., when evaluating whether a coin is fair, p = 0.5), the total number of trials (e.g., 20 flips), the number of successes (e.g., 10 heads), and the \(\alpha\) and \(\beta\) values for the Beta distribution for the prior (e.g., \(\alpha\) = 1 and \(\beta\) = 1 for a uniform prior). Run the script. It will calculate the Bayes Factor, and plot the prior (grey), likelihood (dashed blue), and posterior (black).

@@ -588,14 +588,14 @@

We see that for the newborn baby, p = 0.5 has become more probable, but so has p = 0.4.

Q3: Change the hypothesis in the first line from 0.5 to 0.675, and run the script. If you were testing the idea that this coin returns 67.5% heads, which statement is true?

-
- +
+

Q4: Change the hypothesis in the first line back to 0.5. Let’s look at the increase in the belief of the hypothesis p = 0.5 for the extreme skeptic after 10 heads out of 20 coin flips. Change the \(\alpha\) for the prior in line 4 to 100 and the \(\beta\) for the prior in line 5 to 100. Run the script. Compare the figure from R to the increase in belief for the newborn baby. Which statement is true?

-
- +
+

Copy the R script below and run it. The script will plot the mean for the posterior when 10 heads out of 20 coin flips are observed, given a uniform prior (as in Figure 4.6). The script will also use the ‘binom’ package to calculate the posterior mean, credible interval, and highest density interval is an alternative to the credible interval.

@@ -697,8 +697,8 @@

The posterior mean is identical to the Frequentist mean, but this is only the case when the mean of the prior equals the mean of the likelihood.

Q5: Assume the outcome of 20 coin flips had been 18 heads. Change x to 18 in line 2 and run the script. Remember that the mean of the prior Beta(1,1) distribution is \(\alpha\) / (\(\alpha\) + \(\beta\)), or 1/(1+1) = 0.5. The Frequentist mean is simply x/n, or 18/20=0.9. Which statement is true?

-
- +
+

Q6: What is, today, your best estimate of the probability that the sun will rise tomorrow? Assume you were born with an uniform Beta(1,1) prior. The sun can either rise, or not. Assume you have seen the sun rise every day since you were born, which means there has been a continuous string of successes for every day you have been alive. It is OK to estimate the days you have been alive by just multiplying your age by 365 days. What is your best estimate of the probability that the sun will rise tomorrow?

diff --git a/docs/06-effectsize.html b/docs/06-effectsize.html index f0d39a5..1e66004 100644 --- a/docs/06-effectsize.html +++ b/docs/06-effectsize.html @@ -551,76 +551,76 @@

Q1: One of the largest effect sizes in the meta-meta analysis by Richard and colleagues from 2003 is that people are likely to perform an action if they feel positively about the action and believe it is common. Such an effect is (with all due respect to all of the researchers who contributed to this meta-analysis) somewhat trivial. Even so, the correlation was r = .66, which equals a Cohen’s d of 1.76. What, according to the online app at https://rpsychologist.com/cohend/, is the probability of superiority for an effect of this size?

-
- +
+

Q2: Cohen’s d is to ______ as eta-squared is to ________

-
- +
+

Q3: A correlation of r = 1.2 is:

-
- +
+

Q4: Let’s assume the difference between two means we observe is 1, and the pooled standard deviation is also 1. If we simulate a large number of studies with those values, what, on average, happens to the t-value and Cohen’s d, as a function of the sample size in these simulations?

-
- +
+

Q5: Go to http://rpsychologist.com/d3/correlation/ to look at a good visualization of the proportion of variance that is explained by group membership, and the relationship between r and \(r^2\). Look at the scatterplot and the shared variance for an effect size of r = .21 (Richard et al., 2003). Given that r = 0.21 was their estimate of the median effect size in psychological research (not corrected for bias), how much variance in the data do variables in psychology on average explain?

-
- +
+

Q6: By default, the sample size for the online correlation visualization linked to above is 50. Click on the cogwheel to access the settings, change the sample size to 500, and click the button ‘New Sample’. What happens?

-
- +
+

Q7: In an old paper you find a statistical result reported as t(36) = 2.14, p < 0.05 for an independent t-test without a reported effect size. Using the online MOTE app https://doomlab.shinyapps.io/mote/ (choose “Independent t -t” from the Mean Differences dropdown menu) or the MOTE R function d.ind.t.t, what is the Cohen’s d effect size for this effect, given 38 participants (e.g., 19 in each group, leading to N – 2 = 36 degrees of freedom) and an alpha level of 0.05?

-
- +
+

Q8: In an old paper you find a statistical result from a 2x3 between-subjects ANOVA reported as F(2, 122) = 4.13, p < 0.05, without a reported effect size. Using the online MOTE app https://doomlab.shinyapps.io/mote/ (choose Eta – F from the Variance Overlap dropdown menu) or the MOTE R function eta.F, what is the effect size expressed as partial eta-squared?

-
- +
+

Q9: You realize that computing omega-squared corrects for some of the bias in eta-squared. For the old paper with F(2, 122) = 4.13, p < 0.05, and using the online MOTE app https://doomlab.shinyapps.io/mote/ (choose Omega – F from the Variance Overlap dropdown menu) or the MOTE R function omega.F, what is the effect size in partial omega-squared? HINT: The total sample size is the \(df_{error} + k\), where k is the number of groups (which is 6 for the 2x3 ANOVA).

-
- +
+

Q10: Several times in this chapter the effect size Cohen’s d was converted to r, or vice versa. We can use the effectsize R package (that can also be used to compute effect sizes when you analyze your data in R) to convert the median r = 0.21 observed in Richard and colleagues’ meta-meta-analysis to d: effectsize::r_to_d(0.21) which (assuming equal sample sizes per condition) yields d = 0.43 (the conversion assumes equal sample sizes in each group). Which Cohen’s d corresponds to a r = 0.1?

-
- +
+

Q11: It can be useful to convert effect sizes to r when performing a meta-analysis where not all effect sizes that are included are based on mean differences. Using the d_to_r() function in the effectsize package, what does a d = 0.8 correspond to (again assuming equal sample sizes per condition)?

-
- +
+

Q12: From questions 10 and 11 you might have noticed something peculiar. The benchmarks typically used for ‘small’, ‘medium’, and ‘large’ effects for Cohen’s d are d = 0.2, d = 0.5, and d = 0.8, and for a correlation are r = 0.1, r = 0.3, and r = 0.5. Using the d_to_r() function in the effectsize package, check to see whether the benchmark for a ‘large’ effect size correspond between d and r.

As McGrath & Meyer (2006) write: “Many users of Cohen’s (1988) benchmarks seem unaware that those for the correlation coefficient and d are not strictly equivalent, because Cohen’s generally cited benchmarks for the correlation were intended for the infrequently used biserial correlation rather than for the point biserial.”

Download the paper by McGrath and Meyer, 2006 (you can find links to the pdf here), and on page 390, right column, read which solution the authors prefer.

-
- +
+
diff --git a/docs/10-sequential.html b/docs/10-sequential.html index e6c8faf..6f3ce17 100644 --- a/docs/10-sequential.html +++ b/docs/10-sequential.html @@ -1009,17 +1009,17 @@

-
[PROGRESS] Stage results calculated [0.0326 secs] 
-[PROGRESS] Conditional power calculated [0.0266 secs] 
-[PROGRESS] Conditional rejection probabilities (CRP) calculated [0.0011 secs] 
-[PROGRESS] Repeated confidence interval of stage 1 calculated [0.5782 secs] 
-[PROGRESS] Repeated confidence interval of stage 2 calculated [0.5611 secs] 
-[PROGRESS] Repeated confidence interval calculated [1.14 secs] 
-[PROGRESS] Repeated p-values of stage 1 calculated [0.2358 secs] 
-[PROGRESS] Repeated p-values of stage 2 calculated [0.235 secs] 
-[PROGRESS] Repeated p-values calculated [0.4721 secs] 
-[PROGRESS] Final p-value calculated [0.0014 secs] 
-[PROGRESS] Final confidence interval calculated [0.0696 secs] 
+
[PROGRESS] Stage results calculated [0.0554 secs] 
+[PROGRESS] Conditional power calculated [0.0408 secs] 
+[PROGRESS] Conditional rejection probabilities (CRP) calculated [0.0015 secs] 
+[PROGRESS] Repeated confidence interval of stage 1 calculated [0.7934 secs] 
+[PROGRESS] Repeated confidence interval of stage 2 calculated [0.6943 secs] 
+[PROGRESS] Repeated confidence interval calculated [1.49 secs] 
+[PROGRESS] Repeated p-values of stage 1 calculated [0.5374 secs] 
+[PROGRESS] Repeated p-values of stage 2 calculated [0.4709 secs] 
+[PROGRESS] Repeated p-values calculated [1.01 secs] 
+[PROGRESS] Final p-value calculated [0.0029 secs] 
+[PROGRESS] Final confidence interval calculated [0.1217 secs] 

diff --git a/docs/changelog.html b/docs/changelog.html index a4a399a..6c6ece0 100644 --- a/docs/changelog.html +++ b/docs/changelog.html @@ -254,8 +254,8 @@

Change Log

The current version of this textbook is 1.4.2.

-

This version has been compiled on October 15, 2023.

-

This version was generated from Git commit #f1caacf1. All version controlled changes can be found on GitHub.

+

This version has been compiled on October 19, 2023.

+

This version was generated from Git commit #49748dd2. All version controlled changes can be found on GitHub.

This page documents the changes to the textbook that were more substantial than fixing a typo.

Updates

September 6, 2023:

diff --git a/docs/search.json b/docs/search.json index 26458b4..71d5c13 100644 --- a/docs/search.json +++ b/docs/search.json @@ -809,7 +809,7 @@ "href": "10-sequential.html#reporting-the-results-of-a-sequential-analysis", "title": "10  Sequential Analysis", "section": "\n10.13 Reporting the results of a sequential analysis", - "text": "10.13 Reporting the results of a sequential analysis\nGroup sequential designs have been developed to efficiently test hypotheses using the Neyman-Pearson approach for statistical inference, where the goal is to decide how to act, while controlling error rates in the long run. Group sequential designs do not have the goal to quantify the strength of evidence, or provide accurate estimates of the effect size (Proschan et al., 2006). Nevertheless, after having reached a conclusion about whether a hypothesis can be rejected or not, researchers will often want to also interpret the effect size estimate when reporting results.\nA challenge when interpreting the observed effect size in sequential designs is that whenever a study is stopped early when \\(H_0\\) is rejected, there is a risk that the data analysis was stopped because, due to random variation, a large effect size was observed at the time of the interim analysis. This means that the observed effect size at these interim analyses over-estimates the true effect size. As Schönbrodt et al. (2017) show, a meta-analysis of studies that used sequential designs will yield an accurate effect size, because studies that stop early have smaller sample sizes, and are weighted less, which is compensated by the smaller effect size estimates in those sequential studies that reach the final look, and are weighted more because of their larger sample size. However, researchers might want to interpret effect sizes from single studies before a meta-analysis can be performed, and in this case, reporting an adjusted effect size estimate can be useful. Although sequential analysis software only allows one to compute adjusted effect size estimates for certain statistical tests, we recommend reporting both the adjusted effect size where possible, and to always also report the unadjusted effect size estimate for future meta-analyses.\nA similar issue is at play when reporting p values and confidence intervals. When a sequential design is used, the distribution of a p value that does not account for the sequential nature of the design is no longer uniform when \\(H_0\\) is true. A p value is the probability of observing a result at least as extreme as the result that was observed, given that \\(H_0\\) is true. It is no longer straightforward to determine what ‘at least as extreme’ means a sequential design (Cook, 2002). The most widely recommended procedure to determine what “at least as extreme” means is to order the outcomes of a series of sequential analyses in terms of the look at which the study was stopped, where earlier stopping is more extreme than later stopping, and where studies with higher z values are more extreme, when different studies are stopped at the same time (Proschan et al., 2006). This is referred to as stagewise ordering, which treats rejections at earlier looks as stronger evidence against \\(H_0\\) than rejections later in the study (Wassmer & Brannath, 2016). Given the direct relationship between a p value and a confidence interval, confidence intervals for sequential designs have also been developed.\nReporting adjusted p values and confidence intervals, however, might be criticized. After a sequential design, a correct interpretation from a Neyman-Pearson framework is to conclude that \\(H_0\\) is rejected, the alternative hypothesis is rejected, or that the results are inconclusive. The reason that adjusted p values are reported after sequential designs is to allow readers to interpret them as a measure of evidence. Dupont (1983) provides good arguments to doubt that adjusted p values provide a valid measure of the strength of evidence. Furthermore, a strict interpretation of the Neyman-Pearson approach to statistical inferences also provides an argument against interpreting p values as measures of evidence (Lakens, 2022). Therefore, it is recommended, if researchers are interested in communicating the evidence in the data for \\(H_0\\) relative to the alternative hypothesis, to report likelihoods or Bayes factors, which can always be reported and interpreted after the data collection has been completed. Reporting the unadjusted p-value in relation to the alpha level communicates the basis to reject hypotheses, although it might be important for researchers performing a meta-analysis based on p-values (e.g., a p-curve or z-curve analysis, as explained in the chapter on bias detection) that these are sequential p-values. Adjusted confidence intervals are useful tools to evaluate the observed effect estimate relative to its variability at an interim or the final look at the data. Note that the adjusted parameter estimates are only available in statistical software for a few commonly used designs in pharmaceutical trials, such as comparisons of mean differences between groups, or survuval analysis.\nBelow, we see the same sequential design we started with, with 2 looks and a Pocock-type alpha spending function. After completing the study with the planned sample size of 95 participants per condition (where we collect 48 participants at look 1, and the remaining 47 at look 2), we can now enter the observed data using the function getDataset. The means and standard deviations are entered for each stage, so at the second look, only the data from the second 95 participants in each condition are used to compute the means (1.51 and 1.01) and standard deviations (1.03 and 0.96).\n\ndesign <- getDesignGroupSequential(\n kMax = 2,\n typeOfDesign = \"asP\",\n sided = 2,\n alpha = 0.05,\n beta = 0.1\n)\n\ndataMeans <- getDataset(\n n1 = c(48, 47), \n n2 = c(48, 47), \n means1 = c(1.12, 1.51), # for directional test, means 1 > means 2\n means2 = c(1.03, 1.01),\n stDevs1 = c(0.98, 1.03), \n stDevs2 = c(1.06, 0.96)\n )\n\nres <- getAnalysisResults(\n design, \n equalVariances = TRUE,\n dataInput = dataMeans\n )\n\nres\n\n\n\n[PROGRESS] Stage results calculated [0.0326 secs] \n[PROGRESS] Conditional power calculated [0.0266 secs] \n[PROGRESS] Conditional rejection probabilities (CRP) calculated [0.0011 secs] \n[PROGRESS] Repeated confidence interval of stage 1 calculated [0.5782 secs] \n[PROGRESS] Repeated confidence interval of stage 2 calculated [0.5611 secs] \n[PROGRESS] Repeated confidence interval calculated [1.14 secs] \n[PROGRESS] Repeated p-values of stage 1 calculated [0.2358 secs] \n[PROGRESS] Repeated p-values of stage 2 calculated [0.235 secs] \n[PROGRESS] Repeated p-values calculated [0.4721 secs] \n[PROGRESS] Final p-value calculated [0.0014 secs] \n[PROGRESS] Final confidence interval calculated [0.0696 secs]" + "text": "10.13 Reporting the results of a sequential analysis\nGroup sequential designs have been developed to efficiently test hypotheses using the Neyman-Pearson approach for statistical inference, where the goal is to decide how to act, while controlling error rates in the long run. Group sequential designs do not have the goal to quantify the strength of evidence, or provide accurate estimates of the effect size (Proschan et al., 2006). Nevertheless, after having reached a conclusion about whether a hypothesis can be rejected or not, researchers will often want to also interpret the effect size estimate when reporting results.\nA challenge when interpreting the observed effect size in sequential designs is that whenever a study is stopped early when \\(H_0\\) is rejected, there is a risk that the data analysis was stopped because, due to random variation, a large effect size was observed at the time of the interim analysis. This means that the observed effect size at these interim analyses over-estimates the true effect size. As Schönbrodt et al. (2017) show, a meta-analysis of studies that used sequential designs will yield an accurate effect size, because studies that stop early have smaller sample sizes, and are weighted less, which is compensated by the smaller effect size estimates in those sequential studies that reach the final look, and are weighted more because of their larger sample size. However, researchers might want to interpret effect sizes from single studies before a meta-analysis can be performed, and in this case, reporting an adjusted effect size estimate can be useful. Although sequential analysis software only allows one to compute adjusted effect size estimates for certain statistical tests, we recommend reporting both the adjusted effect size where possible, and to always also report the unadjusted effect size estimate for future meta-analyses.\nA similar issue is at play when reporting p values and confidence intervals. When a sequential design is used, the distribution of a p value that does not account for the sequential nature of the design is no longer uniform when \\(H_0\\) is true. A p value is the probability of observing a result at least as extreme as the result that was observed, given that \\(H_0\\) is true. It is no longer straightforward to determine what ‘at least as extreme’ means a sequential design (Cook, 2002). The most widely recommended procedure to determine what “at least as extreme” means is to order the outcomes of a series of sequential analyses in terms of the look at which the study was stopped, where earlier stopping is more extreme than later stopping, and where studies with higher z values are more extreme, when different studies are stopped at the same time (Proschan et al., 2006). This is referred to as stagewise ordering, which treats rejections at earlier looks as stronger evidence against \\(H_0\\) than rejections later in the study (Wassmer & Brannath, 2016). Given the direct relationship between a p value and a confidence interval, confidence intervals for sequential designs have also been developed.\nReporting adjusted p values and confidence intervals, however, might be criticized. After a sequential design, a correct interpretation from a Neyman-Pearson framework is to conclude that \\(H_0\\) is rejected, the alternative hypothesis is rejected, or that the results are inconclusive. The reason that adjusted p values are reported after sequential designs is to allow readers to interpret them as a measure of evidence. Dupont (1983) provides good arguments to doubt that adjusted p values provide a valid measure of the strength of evidence. Furthermore, a strict interpretation of the Neyman-Pearson approach to statistical inferences also provides an argument against interpreting p values as measures of evidence (Lakens, 2022). Therefore, it is recommended, if researchers are interested in communicating the evidence in the data for \\(H_0\\) relative to the alternative hypothesis, to report likelihoods or Bayes factors, which can always be reported and interpreted after the data collection has been completed. Reporting the unadjusted p-value in relation to the alpha level communicates the basis to reject hypotheses, although it might be important for researchers performing a meta-analysis based on p-values (e.g., a p-curve or z-curve analysis, as explained in the chapter on bias detection) that these are sequential p-values. Adjusted confidence intervals are useful tools to evaluate the observed effect estimate relative to its variability at an interim or the final look at the data. Note that the adjusted parameter estimates are only available in statistical software for a few commonly used designs in pharmaceutical trials, such as comparisons of mean differences between groups, or survuval analysis.\nBelow, we see the same sequential design we started with, with 2 looks and a Pocock-type alpha spending function. After completing the study with the planned sample size of 95 participants per condition (where we collect 48 participants at look 1, and the remaining 47 at look 2), we can now enter the observed data using the function getDataset. The means and standard deviations are entered for each stage, so at the second look, only the data from the second 95 participants in each condition are used to compute the means (1.51 and 1.01) and standard deviations (1.03 and 0.96).\n\ndesign <- getDesignGroupSequential(\n kMax = 2,\n typeOfDesign = \"asP\",\n sided = 2,\n alpha = 0.05,\n beta = 0.1\n)\n\ndataMeans <- getDataset(\n n1 = c(48, 47), \n n2 = c(48, 47), \n means1 = c(1.12, 1.51), # for directional test, means 1 > means 2\n means2 = c(1.03, 1.01),\n stDevs1 = c(0.98, 1.03), \n stDevs2 = c(1.06, 0.96)\n )\n\nres <- getAnalysisResults(\n design, \n equalVariances = TRUE,\n dataInput = dataMeans\n )\n\nres\n\n\n\n[PROGRESS] Stage results calculated [0.0554 secs] \n[PROGRESS] Conditional power calculated [0.0408 secs] \n[PROGRESS] Conditional rejection probabilities (CRP) calculated [0.0015 secs] \n[PROGRESS] Repeated confidence interval of stage 1 calculated [0.7934 secs] \n[PROGRESS] Repeated confidence interval of stage 2 calculated [0.6943 secs] \n[PROGRESS] Repeated confidence interval calculated [1.49 secs] \n[PROGRESS] Repeated p-values of stage 1 calculated [0.5374 secs] \n[PROGRESS] Repeated p-values of stage 2 calculated [0.4709 secs] \n[PROGRESS] Repeated p-values calculated [1.01 secs] \n[PROGRESS] Final p-value calculated [0.0029 secs] \n[PROGRESS] Final confidence interval calculated [0.1217 secs]" }, { "objectID": "10-sequential.html#analysis-results-means-of-2-groups-group-sequential-design", @@ -1145,6 +1145,6 @@ "href": "changelog.html", "title": "Change Log", "section": "", - "text": "The current version of this textbook is 1.4.2.\nThis version has been compiled on October 15, 2023.\nThis version was generated from Git commit #f1caacf1. All version controlled changes can be found on GitHub.\nThis page documents the changes to the textbook that were more substantial than fixing a typo.\nUpdates\nSeptember 6, 2023:\nIncorporated extensive edits by Nick Brown in CH 1-3.\nAugust 27, 2023:\nAdd CH 16 on confirmation bias and organized skepticism. Add Bakan 1967 quote to CH 13.\nAugust 12, 2023:\nAdded section on why standardized effect sizes hinder the interpretation of effect sizes in CH 6. Added Spanos 1999 to CH 1. Split up the correct interpretation of p values for significant and non-significant results CH 1. Added new Statcheck study CH 12. Added Platt quote CH 5.\nJuly 21, 2023:\nAdded “Why Effect Sizes Selected for Significance are Inflated” section to CH 6, moved main part of “The Minimal Statistically Detectable Effect” from CH 8 to CH 6, replaced Greek characters by latex, added sentence bias is expected for papers that depend on main hypothesis test in CH 12.\nJuly 13, 2023:\nUpdated Open Questions in CH 1, 2, 3, 4, 6, 7, 8 and 9. Added a figure illustrating how confidence intervals become more narrow as N increases in CH 7.\nJuly 7, 2023:\nAdded this change log page.\nJune 12, 2023:\nAdded an updated figure from Carter & McCullough, 2014, in the chapter in bias detection, now generated from the raw data.\nMay 5, 2023:\nAdded the option to download a PDF and epub version of the book.\nMarch 19, 2023:\nUpdated CH 5 with new sections on falsification, severity, and risky predictions, and a new final section on verisimilitude.\nMarch 3, 2023:\nUpdated book to Quarto. Added webexercises to all chapters.\nFebruary 27, 2023:\nAdded a section “Dealing with Inconsistencies in Science” to CH 5.\nOctober 4, 2022:\nAdded CH 15 on research integrity." + "text": "The current version of this textbook is 1.4.2.\nThis version has been compiled on October 19, 2023.\nThis version was generated from Git commit #49748dd2. All version controlled changes can be found on GitHub.\nThis page documents the changes to the textbook that were more substantial than fixing a typo.\nUpdates\nSeptember 6, 2023:\nIncorporated extensive edits by Nick Brown in CH 1-3.\nAugust 27, 2023:\nAdd CH 16 on confirmation bias and organized skepticism. Add Bakan 1967 quote to CH 13.\nAugust 12, 2023:\nAdded section on why standardized effect sizes hinder the interpretation of effect sizes in CH 6. Added Spanos 1999 to CH 1. Split up the correct interpretation of p values for significant and non-significant results CH 1. Added new Statcheck study CH 12. Added Platt quote CH 5.\nJuly 21, 2023:\nAdded “Why Effect Sizes Selected for Significance are Inflated” section to CH 6, moved main part of “The Minimal Statistically Detectable Effect” from CH 8 to CH 6, replaced Greek characters by latex, added sentence bias is expected for papers that depend on main hypothesis test in CH 12.\nJuly 13, 2023:\nUpdated Open Questions in CH 1, 2, 3, 4, 6, 7, 8 and 9. Added a figure illustrating how confidence intervals become more narrow as N increases in CH 7.\nJuly 7, 2023:\nAdded this change log page.\nJune 12, 2023:\nAdded an updated figure from Carter & McCullough, 2014, in the chapter in bias detection, now generated from the raw data.\nMay 5, 2023:\nAdded the option to download a PDF and epub version of the book.\nMarch 19, 2023:\nUpdated CH 5 with new sections on falsification, severity, and risky predictions, and a new final section on verisimilitude.\nMarch 3, 2023:\nUpdated book to Quarto. Added webexercises to all chapters.\nFebruary 27, 2023:\nAdded a section “Dealing with Inconsistencies in Science” to CH 5.\nOctober 4, 2022:\nAdded CH 15 on research integrity." } ] \ No newline at end of file