In the example I had provided is the simplest case I can think of that is germane to experimental psychologists. We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect. Think Stroop if you need a context.

The answer, by the way, is \(+\infty\). I'll get to it.

### The good news about effect sizes

Effect sizes have revolutionized how we compare and understand experimental results. Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means. And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference. It is this common meaning and portability that licenses a modern emphasis on estimation. Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point. Here, the studies have vastly different designs and dependent measures. Yet, they can all be characterized in unison with effect size.

### The bad news about effect size

Even for the simplest experiment above, there is a lot of confusion. Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances. The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure. Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog). Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler. Sanjay Srinivasta needs the goals and contexts first before weighing in. If I got this wrong, please let me know.From an experimental perspective, The Effect Size Puzzler is as simple as it gets. Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

### Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice. Instead, statistical reasoning follows from statistical models. And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful. Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.### Effect-size guidelines

*#1: Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.*

*#2: Measures of true effect and true amount of variability are only defined in statistical models. They don't really exist accept within the context of a model. The model is important. It needs to be stated.*

*#3: The true effect size should not be tied to the number of participants nor the number of trials per participant. True effect sizes characterize a state of nature independent of our design.*### The Puzzler Model

I generated the data to be realistic. They had the right amount of skew and offset, and the tails fell like real RTs do. Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task. The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative. So faster participants had smaller effects; slower participants had bigger effects. This pattern too is typical of RT data. The best way to describe these data is in terms of percent-scale change. The effect was to change the scale by 10.5%, and this amount was held constant across all people. And because it was held constant, that is, there was no variability in the effect, the standardized effect size in this case is infinitely large.

Now, let's go explore the data. I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it. Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y. First, let's take care of replicates.

\[ Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)\]

where \(i\)$ indexes individuals, \(j=1,2\) indexes conditions, and \(k\) indexes replicates. Now, lets model \(\mu_{ij}\). A general formulation is

\[\mu_{ij} = \alpha_i+x_j\beta_i,\]

where \(x_j\) is a dummy code of 0 for Condition 1 and 1 for Condition 2. The term \(\beta_i\) is the ith individual's effect. We can model it as

\[\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)\]

where \(\beta_0\) is the mean effect across people and \(\delta^2\) is the variation of the effect across people.

With this model, the true effect size is \[d_t = \frac{\beta_0}{\delta}.\] Here, by true, I just mean that it is a parameter rather than a sample statistic. And that's it, and there is not much more to say in my opinion. In my simulations the true value of each individual's effect was .1. So the mean, \( \beta_0\), is .1 and the standard deviation, \(\delta\), is, well, zero. Consequently, the true standardized effect size is \(d_t=+\infty\). I can't justify any other standardized measure that captures the above principles.

These one-effect results should be heeded. It is a structural element that I would not want to miss in any data set. We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model. You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish. Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context. The key specification is the prior on \( \delta^2\). In the log-normal, the variance is a shape parameter, and it is somewhere around \(.4^2\). Effects across people are usually about 1/5th of this say \(.08^2\). To capture variances in this range, I would use a \(\delta^2 \sim \mbox{Inverse Gamma(.1,.01)} \) prior for general estimation. This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people. The following plot shows the resulting estimates of individual effects as a function of the sample effect values.

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task. The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative. So faster participants had smaller effects; slower participants had bigger effects. This pattern too is typical of RT data. The best way to describe these data is in terms of percent-scale change. The effect was to change the scale by 10.5%, and this amount was held constant across all people. And because it was held constant, that is, there was no variability in the effect, the standardized effect size in this case is infinitely large.

Now, let's go explore the data. I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it. Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y. First, let's take care of replicates.

\[ Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)\]

where \(i\)$ indexes individuals, \(j=1,2\) indexes conditions, and \(k\) indexes replicates. Now, lets model \(\mu_{ij}\). A general formulation is

\[\mu_{ij} = \alpha_i+x_j\beta_i,\]

where \(x_j\) is a dummy code of 0 for Condition 1 and 1 for Condition 2. The term \(\beta_i\) is the ith individual's effect. We can model it as

\[\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)\]

where \(\beta_0\) is the mean effect across people and \(\delta^2\) is the variation of the effect across people.

With this model, the true effect size is \[d_t = \frac{\beta_0}{\delta}.\] Here, by true, I just mean that it is a parameter rather than a sample statistic. And that's it, and there is not much more to say in my opinion. In my simulations the true value of each individual's effect was .1. So the mean, \( \beta_0\), is .1 and the standard deviation, \(\delta\), is, well, zero. Consequently, the true standardized effect size is \(d_t=+\infty\). I can't justify any other standardized measure that captures the above principles.

### Analysis

Could a good analyst have found this infinite value? That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest. A key question is whether these are spread out more than expected from within-cell sample noise alone. It these individual sample effects are more spread out, then there is evidence for true individual variation in \(\beta_i\). If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary. The solid line is the prediction within within-cell noise alone. It is pretty darn good. (The dashed line is the null that people have the same, zero-valued true effect). I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects. It was one effect F(24,2450) = 1.03. Seems like one effect.These one-effect results should be heeded. It is a structural element that I would not want to miss in any data set. We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model. You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish. Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context. The key specification is the prior on \( \delta^2\). In the log-normal, the variance is a shape parameter, and it is somewhere around \(.4^2\). Effects across people are usually about 1/5th of this say \(.08^2\). To capture variances in this range, I would use a \(\delta^2 \sim \mbox{Inverse Gamma(.1,.01)} \) prior for general estimation. This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people. The following plot shows the resulting estimates of individual effects as a function of the sample effect values.

The noteworthy feature is the lack of variation in model estimates of individual's effects! This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise. Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size. For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large!

The final step for me is comparing this variable effect model to a model with no variation, say \( \beta_i = \beta_0 \) for all people. I would do this comparison with Bayes factor. But, I am out of energy and you are out of patience, so we will save it for another post.

### Back To Jake Westfall

Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure. He does this to stay true to Cohen's formulae. I think it is a conceptual mistake.

I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people. In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design. Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to? This advise strikes me as crazy. To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d. Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures. Just don't forget the hierarchical shrinkage when you use it!