**8/30, DONT READ THIS POST. I am now convinced the numbers are all wrong! I am going to leave it up for a while in the interests of transparency.**

**8/27, After some wonderful exchanges (thanks to those of you who have emailed me), I know longer have enough faith in the computations and, consequently, in the corresponding claims to support this post at this point in time. Consider it retracted at least for now. Julia and I have been working feverishly this week to clarify our understanding. We are converging on the notion that our computational approach is flawed. I will endeavor to collect and publish our full analysis code here today, and there will be a follow up blog post.**

This blog is coauthored by Julia Haaf (@JuliaHaaf).

There is no more belabored topic than the behavior of Bayes factors vs p-values. The common-place practical wisdom goes something like this: Bayes factor requires just a bit more evidence to reject nulls than p-values, perhaps at .005 rather than .05 (see Val Johnson's PNAS paper). We should look for t-values closer to 3 than 2. It doesn't change things that much if you have big effects.

*Move on Bayesian cheerleaders and do some good science (if you can).*### The Dramatic Difference

So, with this common wisdom as backdrop, it took our breath away when we found cases that led to dramatically different conclusions. We found data sets where F-tests yielded significant results, \(p<.003\) but the Bayes factor favored constancy by \(10^{16}\). If you do inference by p-values you conclude there is reasonable evidence for variability; if you do it by Bayes factor you conclude there is overwhelming evidence for constancy.

*Wow!*We think this paradox is important because it forces you to choose. Are you a Bayesian or a frequentist? It seems like you can't be both or neither.

**.**

*No room for ecumenical, conflict-averse do-gooders*### The Setup

The setup is one-way random effects ANOVA. We ask whether there is a constant Stroop effect across individuals. A similar application comes from the recent personality work of Katie Corker, Brent Donnellan and colleagues who ask whether base rates of Big 5 personality traits are constant across research sites (universities). In the Stroop case, there are trials nested within people, and we ask about the constancy of the effect across people; in the persoanlity case, people are nested within research sites, and we ask about the constancy of the baserate of a trait across sites. We will illustrate the paradox with the Stroop setup.

The below figure shows the Stroop effect for 121 participant who each performed 48 congruent and 48 incongruent trials (the data are real, from von Bastian et al., and available here; also, see cleaning code here). The effects are ordered from smallest (most negative) to largest (most positive). The 70 ms effect should be obvious. We wish to know if this effect is constant across all people. We have included 95% CIs for each person (these CIs serve as eye candy, ooohh la la). We have also included a constancy curve in red---this curve shows how individuals' sample means should trend when ordered under the null that each individual has the same true Stroop effect.

The appropriate frequentist test is the random-effect one-way ANOVA. In this case, the resultant is F(120, 11003)=1.30, and the p-value is about .016. Significance! Given this significance, we might perhaps expect a Bayes factor that favored the alternative. Not even close. The Bayes factor is in favor of constancy with a mind-boggling value of about \(10^{60}\).

A second example is provided. It is Simon interference data collected in my lab about a decade ago and reported in Pratte et al., (2010). The one-way ANOVA yields F(37,17259)=1.78, p<.003, which indicates evidence for variability across people. The Bayes factor in favor of constancy is \(10^{16}\), which is strong evidence for constancy.

### Bayesian Modeling

Here is the Bayesian model. Let \(X_{ij}\) be the the response time for the ith participant, jth replicate in the congruent condition, and let \(Y_{ij}\) be the same for the incongruent condition. We start with

\[

X_{ij} \sim \mbox{Normal}(\alpha_i,\sigma^2),\\

Y_{ij} \sim \mbox{Normal}(\alpha_i+\theta_i,\sigma^2).

\]

Here, the \(\alpha\)'s are the base RTs and the \(\theta\)s are the Stroop effect for each individual.

A constant Stroop-effect model specifies \(\theta_i=\nu\) for all participants. The variation alternative model specifies

\[

\theta_i \sim \mbox{Normal}(\nu,\eta^2),

\]

where \(\eta^2\) is the variance across individuals.

Priors are needed for all parameters. The easy part is the specification of priors for the parameters that are common in the constancy and variation model, namely all \(\alpha_i\), \(\sigma^2\), and \(\nu\). These priors can be specified to be broad, and because these parameters play the same role in both models, the resultant Bayes factor is not much a function of them. They are truly ancillary. The key specification is the prior on \(\eta^2\). This parameter defines the difference between the models, and the Bayes factor depends on this prior. So we put a realistic, reasonably-informed prior on this parameter. The below graph shows this prior on the standard deviation or \(\eta\). There was no mass below 25 ms, highest density for values between 25 and 35 ms, which is followed by a smooth decreasing tail. Therefore, under this model, the standard deviation of individual variation had to be at least 25 ms and could be larger. Such a model is quite reasonable for Stroop effects where average effects are often between 50 ms and 100 ms.

And it is with this reasonable setup that the Bayes factor differ so dramatically than the one-way ANOVA F-test.

In subsequent blog posts, we are going to explain why we think the Bayesian analysis is more profitable and more useful. For now, we just want to leave you with the paradox.

Our paper on the issue is submitted and can be read here.

*Our minds are blown; we hope your appetite is whetted.*