## Saturday, August 20, 2016

### Where Bayes and Classical Inference Yield Radically Different Conclusions---A Case Study

8/27, After some wonderful exchanges (thanks to those of you who have emailed me), I know longer have enough faith in the computations and, consequently, in the corresponding claims to support this post at this point in time.  Consider it retracted at least for now.  Julia and I have been working feverishly this week to clarify our understanding.    We are converging on the notion that our computational approach is flawed.  I will endeavor to collect and publish our full analysis code here today, and there will be a follow up blog post.

This blog is coauthored by Julia Haaf (@JuliaHaaf).

There is no more belabored topic than the behavior of Bayes factors vs p-values.  The common-place practical wisdom goes something like this:  Bayes factor requires just a bit more evidence to reject nulls than p-values, perhaps at .005 rather than .05 (see Val Johnson's PNAS paper).  We should look for t-values closer to 3 than 2. It doesn't change things that much if you have big effects.    Move on Bayesian cheerleaders and do some good science (if you can).

### The Dramatic Difference

So, with this common wisdom as backdrop,  it took our breath away when we found cases that led to dramatically different conclusions.   We found data sets where F-tests yielded significant results, $$p<.003$$ but the Bayes factor favored constancy by $$10^{16}$$.  If you do inference by p-values you conclude there is reasonable evidence for variability; if you do it by Bayes factor you conclude there is overwhelming evidence for constancy.  Wow!

We think this paradox is important because it forces you to choose.  Are you a Bayesian or a frequentist?  It seems like you can't be both or neither.  No room for ecumenical, conflict-averse do-gooders.

### The Setup

The setup is one-way random effects ANOVA.  We ask whether there is a constant Stroop effect across individuals.  A similar application comes from the recent personality work of Katie Corker, Brent Donnellan and colleagues who ask whether base rates of Big 5 personality traits are constant across research sites (universities).  In the Stroop case, there are trials nested within people, and we ask about the constancy of the effect across people; in the persoanlity case,  people are nested within research sites, and we ask about the constancy of the baserate of a trait across sites.  We will illustrate the paradox with the Stroop setup.

The below figure shows the Stroop effect for 121 participant who each performed 48 congruent and 48 incongruent trials (the data are real, from von Bastian et al., and available here; also, see cleaning code here).  The effects are ordered from smallest (most negative) to largest (most positive).  The 70 ms effect should be obvious.  We wish to know if this effect is constant across all people.  We have included 95% CIs for each person (these CIs serve as eye candy,  ooohh la la).  We have also included a constancy curve in red---this curve shows how individuals' sample means should trend when ordered under the null that each individual has the same true Stroop effect.

The appropriate frequentist test is the random-effect one-way ANOVA.  In this case, the resultant is F(120, 11003)=1.30, and the p-value is about .016.  Significance!  Given this significance, we might perhaps expect a Bayes factor that favored the alternative.  Not even close.  The Bayes factor is in favor of constancy with a mind-boggling value of about $$10^{60}$$.

A second example is provided.  It is Simon interference data collected in my lab about a decade ago and reported in Pratte et al., (2010).   The one-way ANOVA yields F(37,17259)=1.78, p<.003, which indicates evidence for variability across people.   The Bayes factor in favor of constancy is $$10^{16}$$, which is strong evidence for constancy.

### Bayesian Modeling

Here is the Bayesian model.  Let $$X_{ij}$$ be the the response time for the ith participant, jth replicate in the congruent condition, and let $$Y_{ij}$$ be the same for the incongruent condition.  We start with

$X_{ij} \sim \mbox{Normal}(\alpha_i,\sigma^2),\\ Y_{ij} \sim \mbox{Normal}(\alpha_i+\theta_i,\sigma^2).$
Here, the $$\alpha$$'s are the base RTs and the $$\theta$$s are the Stroop effect for each individual.

A constant Stroop-effect model specifies $$\theta_i=\nu$$ for all participants.  The variation alternative model specifies
$\theta_i \sim \mbox{Normal}(\nu,\eta^2),$
where $$\eta^2$$ is the variance across individuals.

Priors are needed for all parameters.  The easy part is the specification of priors for the parameters that are common in the constancy and variation model, namely all $$\alpha_i$$, $$\sigma^2$$, and $$\nu$$.  These priors can be specified to be broad, and because these parameters play the same role in both models, the resultant Bayes factor is not much a function of them.  They are truly ancillary.  The key specification is the prior on $$\eta^2$$.  This parameter defines the difference between the models, and the Bayes factor depends on this prior.  So we put a realistic, reasonably-informed prior on this parameter.  The below graph shows this prior on the standard deviation or $$\eta$$.    There was no mass below 25 ms, highest density for values between 25 and 35 ms, which is followed by a smooth decreasing tail.  Therefore, under this model, the standard deviation of individual variation had to be at least 25 ms and could be larger.  Such a model is quite reasonable for Stroop effects where average effects are often between 50 ms and 100 ms.

And it is with this reasonable setup that the Bayes factor differ so dramatically than the one-way ANOVA F-test.

In subsequent blog posts, we are going to explain why we think the Bayesian analysis is more profitable and more useful.  For now, we just want to leave you with the paradox.

Our paper on the issue is submitted and can be read here.

Our minds are blown; we hope your appetite is whetted.

## Sunday, April 10, 2016

### This Summer's Challenge: Share Your Data

"It would take me weeks of going through my data and coordinating them, documenting them, and cleaning them if I were to share them." anonymous senior faculty member

"Subject 7 didn't show. There is an empty file. Normally the program would label the next person Subject 8 and we would just exclude Subject 7 in analysis. But now that we are automatically posting data, what should I do? Should I delete the empty file so the next person is Subject 7?" anonymous student in my lab

"Why? Data from a bad study is, by definition, no good." @PsychScienctists, in response to my statement that all data should be curated and available.

All three of the above quotes illustrate a common way of thinking about data. Our data reflect something about us. When we share them, we are sharing something deep and meaningful about ourselves. Our data may be viewed as statements about our competence, our organizational skills, our meticulousness, our creativity, and our lab culture. Even the student in my lab feels this pressure. This student is worried that our shared data won't be viewed as sufficiently systematic because we have no data for Subject 7. Maybe we want to present a better image.

#### The Data-Are-The-Data Mindset

I don't subscribe to the Judge-Me-By-My-Data mindset. Instead, I think of data as follows:
• Scientific data are precious resources collected for the common good.
• We should think in terms of stewardship rather than ownership. Be good stewards
• Data are neither good nor bad, nor are they neat nor messy. They just are.
• We should judge each other by the authenticity of our data

### Mistake-Free Data Stewardship through Born-Open Data

To be good stewards and to insure authentic data, we upload everything, automatically, every night. Nobody has to remember anything, nobody makes decisions---it all just happens. Data are uploaded to GitHub where everyone can see them. In fact, I don't even use locally stored data for analysis; I point my analyses to the copy on GitHub. We upload data from well-though-out experiments. We upload data from poorly-thought-out-bust experiments. We upload pilot data. We upload incomplete data. If we collected it, it is uploaded. We have an accurate record of what happened in the lab, and you all are welcome to look in at our GitHub account. I call this approach born-open data, and have an in-press paper coming out about it. We have been doing born-open data for about a year.

So far, the main difference I have noticed is an increase in quality control with no energy or time spent to maintain this quality. Nothing ever gets messed up, and there is no after-the-fact reconstruction of what had happened. There is only one master copy of data---the one on GitHub. Analysis code points to the GitHub version. We never analyze the wrong or incomplete data. And it is trivially easy to share our analyses among lab members and others. In fact, we can build the analyses right into our papers with Knitr and Markdown. Computers are so much more meticulous than we will ever be. They never take a night off!

#### This Summer's Challenge: Automatic Data Curation

I'd like to propose a challenge: Set up your own automatic data curation system for new data that you collect. Work with your IT people. Set up the scripts. Hopefully, when next Fall rolls around, you too are practicing born-open data!

## Tuesday, April 5, 2016

### The Bayesian Guarantee And Optional Stopping.

Frequentist intuitions run so deep in us that we often mistakenly interpret Bayesian statistics in frequentist ones. Optional stopping has always been a case in point.  Bayesian quantities, when interpreted correctly, are not affected by optional stopping.  This fact is guaranteed by Bayes' Theorem.  Previously, I have shown how this guarantee works for Bayes factors.  Here, let's consider the simple case of estimating an effect size.

For demonstration purposes, let's generate data from a normal with unknown mean, $$\mu$$, but known variance of 1.  I am going to use a whacky optional stopping rule that favors sample means near .5 over others.  Here is how it works:  I. As each observation comes in, compute the running sample mean. II. Compute a probability of stopping that is dependent on the sample mean according to the figure below.  The probability favors stopping for sample means near .5.  III. Flip a coin with sides labeled "STOP" and "GO ON" with the below probability.  IV. Do what the coin says (up to a maximum of 50 observations, then stop no matter).

The results of this rule is a bias toward sample means near .5.   I ran a simulation with a true mean of zero for ten thousand replicates (blue histogram below).  The key property is a biasing of the observed sample means higher than the true value of zero.   Bayesian estimation seems biased too.    The green histogram shows the posterior means when the prior on $$\mu$$ is a normal with mean of zero and a standard deviation of .5.  The bias is less, but that just reflects the details of the situation where the true value, zero, is also favored by the prior.

So it might seem I have proved the opposite of my point---namely that optional stopping affects Bayesian estimation.

Nope.  The above case offers a frequentist interpretation, and that interpretation entered when we examined the behavior on a true value, the value zero.  Bayesians don't interpret analyses conditional on unknown "truths".

### The Bayesian Guarantee

Bayes' Theorem provides a guarantee.  If you start with your prior and observed data, then Bayes' Theorem guarantees that the posterior is the optimal set of probability statements about the parameter at hand.  It is a bit subtle to see this in simulation because one needs to condition on data rather than on some unknown truth.

Here is how a Bayesian uses simulation shows the Bayesian Guarantee.

I. On each replicate, sample a different true value from the prior.  In my case, I just draw from a normal centered at zero with standard deviation of .5 since that is my prior on effect size for this  post.  Then, on each replicate, simulate data from that truth value for that replicate.  I have chosen data of 25 observations (from a normal with variance of 1).  A histogram of the sample mean across these varying true values is provided below, left panel.   I ran the simulation for 100,000 replicates.

II. The histogram is that of data (sample means) we expect under our prior.  We need to condition on data, so let's condition on an observed sample mean of .3.  I have highlighted a small bin between .25 and .35 with red.  Observations fall in this bin about 6% of the time.

III.  Look at all the true values that generated those sample means in the bin with .3.  These true values are shown in the yellow histogram.  This histogram is the target of Bayes' Theorem, that is, we can use Bayes Theorem to describe this distribution without going through the simulations.   I have computed the posterior distribution for a sample mean of .3 and 25 observations under my prior, and plotted it as the line.  Notice the correspondence.  This correspondence is the simulation showing that Bayes Theorem works.  It works, by the way, for every bin though I have just shown it for the one centered on .3.

TAKE HOME 1: Bayes Theorem tells the distribution of true values given your prior and the data.

### Is The Bayesian Guarantee Affected By Optional Stopping?

So, we come to the crux move.  Let's simulate the whacky optional stopping rule that favors sample means near .5.  Once again, we start with the prior, and for each replicate we choose a different truth value as a sample from the prior.  Then we simulate data using optional stopping, and the resulting sample means are shown in the histogram on the left.  Optional stopping has affected these data dramatically.  No matter, we choose our bin, again around .3, and plot the true values that led to these sample means.  These true values are shown as the yellow histogram on the right.  They are far more spread out than in the previous simulation without optional stopping primarily because stopping occurred often for less than 25 observations.  Now, is this spread predicted?  Yes.  On each replication we obtain a posterior distribution, and these vary from replication-to-replication because the sample size is random.  I averaged these posteriors (as I should), and the result is the line that corresponds well to the histogram.

TAKE HOME  II: Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled!

And this should be good news.

*****************

R code

set.seed(123)

m0=0
v0=.5^2

runMean=function(y) cumsum(y)/(1:length(y))
minIndex=function(y) order(y)[1]

mySampler=function(t.mu,topN)
{
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
ybar=t(apply(y,1,runMean))
prob=plogis((ybar-.6)^2,0,.2)
another=matrix(nrow=M,ncol=topN,rbinom(M*topN,1,prob))
stop=apply(another,1,minIndex)
return(list("ybar"=ybar[cbind(1:M,stop)],"N"=stop))
}

goodSampler=function(t.mu,topN){
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
return(apply(y,1,mean))}

M=10000

png('freqResults.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))
t.mu=rep(0,M)
out=mySampler(t.mu,50)
ybar=out$ybar N=out$N
v=1/(N+1/v0)
c=(N*ybar+m0/v0)
hist(ybar,col='lightblue',main="",xlab="Sample Mean",breaks=50,xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(ybar),lwd=3,lty=2)
hist(v*c,col='lightgreen',main="",xlab="Posterior Mean",xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(v*c),lwd=3,lty=2)
dev.off()

###############################
set.seed(456)
png('bayesGuarantee.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))

M=100000
N=25
t.mu=rnorm(M,m0,sqrt(v0))
ybar=goodSampler(t.mu,N)
myBreak=seq(-2.45,2.45,.1)
bars=hist(ybar,breaks=myBreak,plot=F)

mid=.3
good=(ybar >(mid-.05) & ybar<(mid+.05))
myCol=rep("white",length(myBreak))
myCol[round(bars$mids,2)==0.3]='red' plot(bars,col=myCol,xlab="Sample Mean",main="") mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior") v=1/(N+1/v0) c=(N*mid+m0/v0) hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow', ylim=c(0,2.2),main="",xlim=c(-1.75,1.75)) myES=seq(-2,2,.01) post=1:length(myES) for (i in 1:length(myES)) post[i]=mean(dnorm(myES[i],c*v,sqrt(v))) lines(myES,post,lwd=2) mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3") dev.off() ######################## set.seed(790) png('moneyShot.png',width=960,height=480) par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0)) M=100000 t.mu=rnorm(M,m0,sqrt(v0)) out=mySampler(t.mu,50) ybar=out$ybar
N=out$N myBreak=seq(-5.95,5.95,.1) bars=hist(ybar,breaks=myBreak,plot=F) mid=.3 good=(ybar >(mid-.05) & ybar<(mid+.05)) myCol=rep("white",length(myBreak)) myCol[round(bars$mids,2)==0.3]='red'
plot(bars,col=myCol,xlab="Sample Mean",main="",xlim=c(-4,3))

v=1/(N[good]+1/v0)
c=(N[good]*ybar[good]+m0/v0)

hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',main="",
ylim=c(0,2.2),xlim=c(-1.75,1.75))
myES=seq(-2,2,.01)
post=1:length(myES)
for (i in 1:length(myES))
post[i]=mean(dnorm(myES[i],c*v,sqrt(v)))
lines(myES,post,lwd=2)
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

dev.off()

######################
#stop probability

png(file="probStop.png",width=480,height=480)
par(cex=1.3,mar=c(4,4,1,1),mgp=c(2,1,0))
ybar=seq(-2,3,.01)
prob=plogis((ybar-.6)^2,0,.2)
plot(ybar,1-prob,typ='l',lwd=2,ylab="Stopping Probability",xlab="Sample Mean",ylim=c(0,.55))
mtext("Optional Stopping Depends on Sample Mean",side=3,adj=.5,line=-1,cex=1.3)
dev.off()

## Monday, March 28, 2016

### The Effect-Size Puzzler, The Answer

I wrote the Effect-Size Puzzler because it seemed to me that people have reduced the concept of effect size to a few formulas on a spreadsheet.  It is a useful concept that deserves a bit more thought.

In the example I had provided is the simplest case I can think of that is germane to experimental psychologists.  We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect.  Think Stroop if you need a context.

The answer, by the way, is $$+\infty$$.  I'll get to it.

### The good news about effect sizes

Effect sizes have revolutionized how we compare and understand experimental results.  Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means.  And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference.  It is this common meaning and portability that licenses a modern emphasis on estimation.  Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point.  Here, the studies have vastly different designs and dependent measures.  Yet, they can all be characterized in unison with effect size.

Even for the simplest experiment above, there is a lot of confusion.  Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances.  The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure.  Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog).  Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler.   Sanjay Srinivasta needs the goals and contexts first before weighing in.  If I got this wrong, please let me know.

From an experimental perspective, The Effect Size Puzzler is as simple as it gets.  Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

### Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice.  Instead, statistical reasoning follows from statistical models.  And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful.  Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.

### Effect-size guidelines

#1:  Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.

#2:  Measures of true effect and true amount of variability are only defined in statistical models.  They don't really exist accept within the context of a model.  The model is important.  It needs to be stated.

#3: The true effect size should not be tied to the number of participants nor the number of trials per participant.  True effect sizes characterize a state of nature independent of our design.

### The Puzzler Model

I generated the data to be realistic.  They had the right amount of skew and offset, and the tails fell like real RTs do.   Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task.  The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative.  So faster participants had smaller effects; slower participants had bigger effects.  This pattern too is typical of RT data.   The best way to describe these data is in terms of percent-scale change.  The effect was to change the scale by 10.5%, and this amount was held constant across all people.  And because it was held constant, that is, there was no variability in the effect,  the standardized effect size in this case is infinitely large.

Now, let's go explore the data.  I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it.  Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y.  First, let's take care of replicates.
$Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)$
where $$i$$$indexes individuals, $$j=1,2$$ indexes conditions, and $$k$$ indexes replicates. Now, lets model $$\mu_{ij}$$. A general formulation is $\mu_{ij} = \alpha_i+x_j\beta_i,$ where $$x_j$$ is a dummy code of 0 for Condition 1 and 1 for Condition 2. The term $$\beta_i$$ is the ith individual's effect. We can model it as $\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)$ where $$\beta_0$$ is the mean effect across people and $$\delta^2$$ is the variation of the effect across people. With this model, the true effect size is $d_t = \frac{\beta_0}{\delta}.$ Here, by true, I just mean that it is a parameter rather than a sample statistic. And that's it, and there is not much more to say in my opinion. In my simulations the true value of each individual's effect was .1. So the mean, $$\beta_0$$, is .1 and the standard deviation, $$\delta$$, is, well, zero. Consequently, the true standardized effect size is $$d_t=+\infty$$. I can't justify any other standardized measure that captures the above principles. ### Analysis Could a good analyst have found this infinite value? That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest. A key question is whether these are spread out more than expected from within-cell sample noise alone. It these individual sample effects are more spread out, then there is evidence for true individual variation in $$\beta_i$$. If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary. The solid line is the prediction within within-cell noise alone. It is pretty darn good. (The dashed line is the null that people have the same, zero-valued true effect). I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects. It was one effect F(24,2450) = 1.03. Seems like one effect. These one-effect results should be heeded. It is a structural element that I would not want to miss in any data set. We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero. To estimate effect sizes, we need a hierarchical model. You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish. Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch. I used the general model above in the Bayesian context. The key specification is the prior on $$\delta^2$$. In the log-normal, the variance is a shape parameter, and it is somewhere around $$.4^2$$. Effects across people are usually about 1/5th of this say $$.08^2$$. To capture variances in this range, I would use a $$\delta^2 \sim \mbox{Inverse Gamma(.1,.01)}$$ prior for general estimation. This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people. The following plot shows the resulting estimates of individual effects as a function of the sample effect values. The noteworthy feature is the lack of variation in model estimates of individual's effects! This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise. Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size. For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large! The final step for me is comparing this variable effect model to a model with no variation, say $$\beta_i = \beta_0$$ for all people. I would do this comparison with Bayes factor. But, I am out of energy and you are out of patience, so we will save it for another post. ### Back To Jake Westfall Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure. He does this to stay true to Cohen's formulae. I think it is a conceptual mistake. I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people. In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design. Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to? This advise strikes me as crazy. To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d. Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures. Just don't forget the hierarchical shrinkage when you use it! ## Thursday, March 24, 2016 ### The Effect-Size Puzzler Effect sizes are bantered around as useful summaries of the data. Most people think they are straightforward and obvious. So if you think so, perhaps you won't mind a bit of a challenge? Let's call it "The Effect-Size Puzzler," in homage to NPR's CarTalk. I'll buy the first US winner a nice Mizzou sweatshirt (see here). Standardized effect size please. I have created a data set with 25 people each observing 50 trials in 2 conditions. It's from a priming experiment. It looks about like real data. Here is the download. The three columns are: • id (participant: 1...25) • cond (condition: 1,2) • rt (response time in seconds). There are a total of 2500 rows. I think it will take you just a few moments to load it and tabulate your effect size for the condition effect. Have fun. Write your answer in a comment or write me an email. I'll provide the correct answer in a blog next week. HINT: If you wish to get rid of the skew and stabilize the variances, try the transform y=log(rt-.3) ## Monday, March 21, 2016 ### Roll Your Own II: Bayes Factors With Null Intervals The Bayes factors we develop compare the null model to an alternative model. This null model is almost always a single point---the true effect is identically zero. People sometimes confuse our advocacy for Bayes factor with that for point-null-hypothesis testing. They even critique Bayes factor with the Cohenesque claim that the point null is never true. Bayes factor is a general way of measuring the strength of evidence from data for competing models. It is not tied to the point null. We develop for the point null because we think it is a useful, plausible, theoretically meaningful model. Others might disagree, and these disagreements are welcome as part of the exchange of viewpoints in science. In the blog post Roll Your Own: How to Compute Bayes Factors for Your Priors, I provided R code to compute a Bayes factor between a point-null and a user-specified alternative for a simple setup motivated by the one-sample t-test. I was heartened by the reception and I hope a few of you are using the code (or the comparable code provided by Richard Morey). There have been some requests to generalize the code for non-point nulls. Here, let's explore the Bayes factor for any two models in a simple setup. As it turns out, the generalization is instructive and computationally trivial. We have all we need from the previous posts. ### Using Interval Nulls: An Example Consider the following two possibilities: I. Perhaps you feel the point null is too constrained and would rather adopt a null model with mass on a small region around zero rather than at the point. John Kruschke calls these regions ROPEs (regions of posterior equivalence). II. Perhaps you are more interested in the direciton in an effect rather than whether it is zero or not. In this case, you might consider testing two one-sided models against each other. For this blog, I am going to retain four different priors. Let’s start with a data model. Data are independent normal draws with mean $$\mu$$ and variance $$\sigma^2$$. It is more convenient re-express the normal as a function of effect size, $$\delta$$ and $$\sigma^2$$ where $$\delta=\mu/\sigma)$$. Here is the formal specification: $Y_i \mid \delta,\sigma^2 \stackrel{iid}{\sim} \mbox{Normal}(\sigma\delta,\sigma^2).$ Now, the substantive positions as prior models on effect size: 1. $$M_0$$, A Point Null Model: $$\delta=0$$ 2. $$M_1$$, A ROPE Model: $$\delta \sim \mbox{Unif}(-.25,.25)$$ 3. $$M_2$$, A Positive Model: $$\delta \sim \mbox{Gamma(3,2.5)}$$ 4. $$M_3$$, A Negative Model: $$-\delta \sim \mbox{Gamma(3,2.5)}$$ Here are these four models expressed graphically as distributions: I picked these four models, but you can pick as many ones as you wish. For example, you can include a normal if you wish. Oh, let's look at some data. Suppose the observed effect size is .35 for an N of 60. ### Going Transitive Bayes factors are the comparison between two models. Hence we would like to compute the Bayes factors between any of these models. Let $$B_{ij}$$ be the comparison between the ith and jth model. We want a Table like this:  $$B_{00}$$ $$B_{01}$$ $$B_{02}$$ $$B_{03}$$ $$B_{10}$$ $$B_{11}$$ $$B_{12}$$ $$B_{13}$$ $$B_{20}$$ $$B_{21}$$ $$B_{22}$$ $$B_{23}$$ $$B_{30}$$ $$B_{31}$$ $$B_{32}$$ $$B_{33}$$ Off the bat, we know the Bayes factor between a model and itself is 1 and that $$B_{ij} = 1/B_{ji}$$. So we only need to worry about the lower corner.  1 $$B_{10}$$ 1 $$B_{20}$$ $$B_{21}$$ 1 $$B_{30}$$ $$B_{31}$$ $$B_{32}$$ 1 We can use the code below, from the previous post to figure out the null vs. all the other models. $B_{10} = 4.9, \quad B_{20} = 4.2, \quad B_{30} = .0009$ Here we see that the point null is not as attractive or the ROPE null or the positive model. It is more attractive, however, than the negative model. Suppose, however, that you are most interested in the ROPE null and its comparison to the positive and negative model. The missing Bayes factors are $$B_{12}$$, $$B_{13}$$, and $$B_{23}$$. The key application of transitivity is as follows: $B_{ij} = B_{ik} \times B_{kj}.$ So, we can compute $$B_{12}$$ as follows: $$B_{12} = B_{10} \times B_{02} = B_{10}/B_{20} = 4.9/4.2 = 1.2$$. The other two Bayes factors are computed likewise: $$B_{13} = 5444$$ and $$B_{23} = 4667$$ So what have we learned. Clearly, if you were pressed to choose a direction, it is in the positive direction. That said, the evidence for a positive effect is slight when compared to a ROPE null. ### Snippets of R Code #First, Define Your Models as a List #lo, lower bound of support #hi, upper bound of support #fun, density function #here are Models M1, M2, M3 #add or change here for your models mod1=list(lo=-.25,hi=.25,fun=function(x,lo,hi) dunif(x,lo,hi)) mod2=list(lo=0,hi=Inf,fun=function(x,lo,hi) dgamma(x,shape=3,rate=2.5)) mod3=list(lo=-Inf,hi=0,fun=function(x,lo,hi) dgamma(-x,shape=3,rate=2.5)) #note, we dont need to specify the point null, it is built into the code #Lets make sure the densities are proper, here is a function to do so: normalize=function(mod) return(c(mod,K=1/integrate(mod$fun,lower=mod$lo,upper=mod$hi,lo=mod$lo,hi=mod$hi)$value)) #and now we normalize the three models mod1=normalize(mod1) mod2=normalize(mod2) mod3=normalize(mod3) #Observed Data es=.35 N=60 #Here is the key function that computes the Bayes factor between a model and the point null BF.mod.0=function(mod,es,N) { f= function(delta) mod$fun(delta,mod$lo,mod$hi)*mod$K pred.null=dt(sqrt(N)*es,N-1) altPredIntegrand=function(delta,es,N) dt(sqrt(N)*es,N-1,sqrt(N)*delta)*f(delta) pred.alt= integrate(altPredIntegrand,lower=mod$lo,upper=mod$hi,es=es,N=N)$value
return(pred.alt/pred.null)
}

B10=BF.mod.0(mod1,es,N)
B20=BF.mod.0(mod2,es,N)
B30=BF.mod.0(mod3,es,N)

print(paste("B10=",B10,"   B20=",B20,"   B30=",B30))

B12=B10/B20
B13=B10/B30
B23=B20/B30

print(paste("B12=",B12,"   B13=",B13,"   B23=",B23))

## Tuesday, March 15, 2016

### Statistical Difficulties from the Outer Limits

You would think that the more data we collect, the closer we should be to the truth.

This blog post falls into the "I may be wrong" category.  I hope many of you comment.

### ESP: God's Gift To Bayesians?

It seems like ESP is God's gift to Bayesians.  We use it like a club to reinforce the plausibility of null hypotheses and to point out the difficulties of frequentist analysis.

In the 1980s, a group of Princeton University engineers set out to test ESP by asking people to use their minds to change the outcome of a random noise generator (check out their website).   Over the course of a decade, these engineers collected an astounding 104,490,000 trials.  On each trial, the random noise generator flipped a gate with known probability of exactly .5.  The question was whether a human operator using only the power of his or her mind could increase this rate.  Indeed, they found 52,263,471 gate flips, or 0.5001768 of all trials.  This proportion, though only slightly larger than .5, is nonetheless significantly larger with a damn low p-value of .0003.   The figure below shows the distribution of successes under the null, and the observation is far to the right.  The green interval is the 99% CI, and it does not include the null.

Let's assume these folks have a decent set up and the true probability should be .5 without human ESP intervention.  Did they show ESP?

What do you think?  There data are numerous, but do you feel closer to the truth?  Impressed by the low p-value?  Bothered by the wafer-thin effect?  Form an opinion; leave a comment.

Bayesians love this example because we can't really fathom what a poor frequentist would do?  The p-value is certainly lower than .05, even lower than .01, and even lower than .001.  So, it seems like a frequentist would need to buy in.  The only way out is to lower the Type I error rate in response to the large sample size.  But to what value and why?

### ESP: The Trojan Horse?

ESP might seem like God's gift to Bayesians, but maybe it is a Trojan Horse.  A Bayes factor model comparison analysis goes as following.  The no-ESP null model is
$M_0: Y \sim \mbox{Binomial}(.5,N)$

The ESP alternative is
$M_1: Y|\theta \sim \mbox{Binomial}(\theta,N)$
A prior on $$\theta$$ is needed to complete the specification.  For the moment, let's use a flat one, $$\theta \sim \mbox{Unif}(0,1)$$.

It is pretty easy to calculate a Bayes factor here, and the answer is 12-to-1 in favor of the null.   What a relief.

ESP proponents might rightly criticize this prior as too dispersed.  We may reasonably assume that $$\theta$$ should not be less than .5 as we can assume the human operators are following the direction to increase rather than decrease the proportion of gate flips.   Also, the original investigators might argue that it is unreasonable to expect anything more than a .1% effect, so the top might be .501.  In fact, they might argue they ran such a large experiment because they expected a prior such a small effect.  The prior is   $$\theta \sim \mbox{Unif}(.5,.501)$$, then the Bayes factor is 84-to-1 for an ESP effect.

The situation seems tenuous.  The below figure shows the Bayes factors for both priors as a function of the number of trials.  To draw these curves, I simply kept the proportion of success constant at 0.5001768.  The line is for the observed number of trials.  With this proportion, the Bayes factor not only depend on the prior, but they also depend in unintuitive ways on sample size.  For instance, if we doubled the number of trials and successes, the Bayes factors become 40-to-1 and 40,000-to-1, respectively, for the flat prior and the very small interval one.

Oh, I can see the anti-Bayes crowd getting ready to chime in should they read this.   Sanjay Srivastava may take the high road and discuss the apparent lack of practicality of the Bayes factor.  Uri Simonsohn may boldly declare that Bayesians can't find truth.   And perhaps Uli Shimmack will create a new index, the M-Index, where M stands for moron.  Based on his analysis of my advocacy, he may declare I have the second highest known M-Index, perhaps surpassed only by E.-J. Wagenmakers.

Seems like ESP was a bit of a Trojan Horse.  It looked all good, and then turned on us.

### But What Happened?

Bayes' rule is ok of course.  The problem is us.  We tend to ask too much of statistics.   Before I get to my main points, I need to address one issue,  What is the model?  Many will call the data model, the binomial specification in this case, "the model."  The other part, the priors on parameters, is not part of "the model", it is the prior.  Yet, it is better to think of "the model" as the combination of the binomial and prior specification.  It's all one model, and this one model provides a priori predictive distribution about where the data should fall (see my last blog post).  The binomial is a conditional specification, and the prior completes the model.

With this in mind that the above figure strikes me as quite reasonable.  Consider the red line, the one that compares the null to the model where the underlying probability ranges across the full interval.  Take the point for 10,000 trials.   The number of successes is 5,002 which is almost 1/2 of all trials.  Not surprisingly,  this value is evidence for the null compared to this diffuse alternative.  But the same value is not evidence for the null compared to the more constrained alternative model where $$.5<\theta<.501$$.  Both the null and this alternative are about the same for 10,000 trials, and each predict 5,002 successes out of 10,000 trials equally well.  Hence,  the Bayes factor is equivocal.  This alternative and the null are so similar that it takes way more data to discriminate among them.   As we gain more and more data, say 100,000,000 trials, the  slight discrepancy from 1/2 can be resolved, and the Bayes factors start to favor the alternative models.  As the sample size is increased further, the discrepancy becomes more pronounced.  Everything in that figure makes beautiful sense to me--- it all is as it should be.  Bayes rule is ok.

Having more and more data doesn't get us closer to the truth.  It does, however, is give us greater resolution to more finely discriminate among models.

### Loose Ends

The question, "is there an effect" strikes me as ill formed.   Yet, we answer the question affirmatively daily.  Sometimes, effects are obvious, and they hit you between the eyes.  How can that be if the question is not well formed?

I think when there are large effects, just about any diffuse alternative model will do.  As long as the alternative is diffuse, data with large effects easily discriminate this diffuse alternative from the null.  It is in this sense that effects are obviously large.

What this example shows that if one tries to resolve small effects with large sample sizes, there is intellectual tension.  Models matter.  Models are all that matter.  Large data gives you greater resolution to discriminate among similar models.  And perhaps little else.

### The Irony Is...

This ESP example is ironic.  The data are so numerous that they are capable of finely discriminating among just about any set of models we wish, even the difference between a point null and a uniform null subtending .001 in width on the probability scale.  The irony is that we have no bona-fide competing models to discriminate.  ESP by definition seemingly precludes any scientific explanation, and without such explanation, all alternatives to the null are a bit contrived.  So while we can discriminate among models, there really is only one plausible one, the null, and no need for discrimination at all.

If forced to do inference here (which means someone buys me a beer),  I would choose the full-range uniform as the alternative model and state the 12-to-1 ratio for the null.  ESP is such a strange proposition that why would values of $$\theta$$ near .5 be any more a priori plausible than those away from it?