Psychology – Prefrontal.org

Scientific Consensus On Brain Training

prefrontal — Tue, 21 Oct 2014 05:25:56 +0000

A lot of big cognitive scientists and neuroscientists endorsed the following statement:

“We object to the claim that brain games offer consumers a scientifically grounded avenue to reduce or reverse cognitive decline when there is no compelling scientific evidence to date that they do. The promise of a magic bullet detracts from the best evidence to date, which is that cognitive health in old age reflects the long-term effects of healthy, engaged lifestyles. In the judgment of the signatories below, exaggerated and misleading claims exploit the anxieties of older adults about impending cognitive decline. We encourage continued careful research and validation in this field.”

Couldn’t have said it better myself.

Read the full statement here.
The complete list of signatories can be found on the press release.
Don’t forget the prefrontal.org post from last month on brain training.

Thoughts on Brain Training

prefrontal — Sun, 28 Sep 2014 01:33:58 +0000

Last Monday I had the opportunity to participate in a response panel as part of the Cottage Rehabilitation Hospital series on Empowerment Through Medical Rehabilitation. This year they brought Dr. Michael Merzenich in from UC San Francisco. Dr. Merzenich is an expert on neuroplasticity and is also the founder of several companies dedicated to cognitive enhancement through training. He is a research powerhouse, with hundreds of publications and a CV that is probably taller than I am. His most famous startup is likely PositScience, which you can find at the site brainhq.com.

Dr. Merzenich gave a great 80-minute talk to a crowded room. His talk was pretty similar to others that I have watched on YouTube, so if you are curious about his research you might head over there. Here is a link to his Google TechTalk, which was quite good: http://www.youtube.com/watch?v=UyPrL0cmJRs.

I had a whole series of points ready to bring up to try and foster some lively discussion during the panel. Due to time issues the panel was only about 20-25 minutes long. This didn’t leave much room for lively discussion after introductions were done. Good material for a blog post though…

What follows are some elaborations on the points that I had prepared for the response panel. They are all my opinion, occasionally backed by specific citations and research. They are deliberately intended to be controversial, which will hopefully make for interesting reading.

* Our brain function and structure certainly reflect their environment

Developmental psychologists and neuroscientists have known for some time that the brain was plastic and responds to its environment. After all, we research a period of life that is highly plastic, with a four-year-old brain having 10x as many connections as it will later in life. Changes in the brain of a child are sculpted by a mixture of genetic predisposition and environmental context. That is, some changes are hard-wired and will happen no matter what, while other changes are in response to the environment.

Even later in life it is recognized that the brain can change its structure and function and adapt to its surroundings. Among others, milestone neuroimaging studies investigating structural differences in the brains of London taxi drivers (Maguire et al., 2000) and changes following learning to juggle (Boyke et al., 2008) validated everything that we already knew: the brain is not a static organ. In that context brain training isn’t a revolution in our understanding of the brain, but a new method to mold that plasticity in the direction of our choosing.

* Everything, even your breakfast, changes your brain

The vast majority of functional neuroimaging is done with a defined cognitive task. The investigator will look at the relative difference in activity between different conditions of the task to determine indices of activity. A more recent investigative method, resting state imaging, is different. The subject is given no instructions, other than to sit quietly in the scanner and clear their head to the best of their ability. From the data collected during this scan you can begin to look at which brain areas have correlated activity, indicating that they tend to work together. By doing scans before and after a specific intervention,you can then learn about which functional networks are affected by treatment. So, what have we learned from these studies?

Everything changes the brain. Everything.

The Neuroskeptic had a great argument in a recent blog post, which is that even the act of opening your eyes can have a dramatic influence on the state of the brain. This can be seen even in commodity EEG equipment as a change in power in the alpha band. He was discussing a recent study that found a single dose of an antidepressant can have a significant effect on functional connectivity in the brain (Schaefer et al., 2014).

The challenge to neuroscientists is to separate the wheat from the chaff. If everything changes the brain, then which changes are actually relevant? Much of the brain training research does a good job of this, looking for measurable changes in behavior following training. As for the rest of neuroimaging, well, I have a feeling that we are going to see a lot more “crossing the street changes the brain” type of articles…

* Cognitive capacity is a mix of genetic and environmental factors

There seems to be an approximate 50/50 split in terms of what determines your intelligence. About half comes from your parents in the form of your DNA. The other half comes from the sum total of all experiences you have ever had in your life, including brain training. This is great, because it means that we have a lot of leeway to push our cognitive abilities around. Make positive changes and you will see improvements, make negative changes and you will see degradation. A second point should be made here though, which is that each of us carries genetic predispositions that will be difficult or impossible to change through training.

One extreme example of genetic factors at play is in psychopathy. There is evidence that it is difficult or impossible to rehabilitate true clinical psychopaths. I never got a chance to ask Dr. Merzenich about this counterexample. I have to believe that he would say that even clinical psychopaths could benefit from brain training. I have a different view, which is that a certain percentage of our cognition is hard-wired, and may or may not be able to be changed. Some of us have burdens that cannot be trained away.

* Individual differences can create very different training outcomes

We typically look at the results of an fMRI study, see the blobs, and immediately assume that the entire group of research subjects must have been using those areas of the brain. This is somewhat true in terms of the statistics, but there is a richer story to be understood. First, some subjects do not show increases in activity, and some even show reductions in activity! Further, the patterns of activity across the rest of the brain will often be wildly different from one individual to another. My postdoc advisor, Michael Miller, has done a lot of research on this, going so far as to call each person’s pattern of activity a “neural fingerprint”.

What does this mean for brain training? Well, if everyone is using a highly varied mix of functional networks as they navigate a cognitive task then we would likely expect to see a highly varied response to brain training as a consequence. For example, if you leverage executive working memory to a greater degree than I do while completing a task and we then complete a working memory brain training program then you would likely demonstrate a greater benefit than I would.

One issue I had with Dr. Merzenich’s talk was his assertion that every ability could be improved at any age. This is a very strong argument. He cited an animal study which showed that cognitive training in rats led to a range of behavioral and physical improvements, right down to improved immune function. Knowing that each individual human could have a different response to each form of brain training, I would put forward that most cognitive abilities could be improved using brain training. I would further argue that there will be a varied generalization of brain training to other tasks that remain untrained.

* The relevant networks for training might be counter-intuitive

It will not always be immediately apparent which functional networks in the brain should be targeted with brain training to derive a practical cognitive benefit. This will be particularly true as you try to train cognitive abilities that lie higher and higher in the hierarchy of function. For example, how would you train an adolescent to better handle risky situations? The common refrain here would be to train inhibitory control, so that the teen could better inhibit a prepotent response. That may have some benefit, but evidence is accumulating that one representation of social risk is modeled by an ‘as-if’ loop, which attempts to predict body and emotional state following a decision (Preuschoff et al., 2008). It may be that something like mindfulness meditation may be more beneficial here relative to brain training (but I am only guessing).

* Why aren’t more people using brain training?

Dr. Merzenich suggested that there were immediate benefits following brain training that lasted for years, transferred to other cognitive abilities, and had a significant effect size. He argued that brain training was the logical path to enhanced cognitive capacity and cognitive rehabilitation. Sounds persuasive, yes? If all of that is true, then why isn’t every hospital and school clamoring for access to the training programs? Insurance companies aren’t fools, contrary to the opinion of most Americans. If they can obtain the same benefit from $1k of brain training instead of $10k of traditional treatment, do you not think that they would be pushing doctors to prescribe it? What’s the hangup?

* Social interaction as a highly complex cognitive task

I thought that one subject that didn’t get a lot of play in Dr. Merzenich’s talk was the cognitive requirements of social interaction. Social interaction is not frequently viewed as a highly complex cognitive task. That is largely because we are so damned good at it. You can model the current internal state of multiple individuals while simultaneously predicting how that internal state will change based on some tidbit of information that you are about to communicate. That is simply incredible. There is a reason that it takes us over two decades to become ‘adult’ in terms of our behavior, and a large part of that is having the processing capacity and social experience necessary to interact with others as an adult. It takes way more than 10,000 hours…

I bring this up mostly because I think that social interaction is a critical part of our daily lives, and I see it as critical during the rehabilitation process as well. Which patient has a better overall outcome: the one who does brain training alone at home twice a week or the one who sees a therapist for training twice a week? My money is on the latter. I say this not only for the obvious emotional benefit of social interaction, but because navigating the social landscape is one of the most cognitively difficult tasks that we, as humans, accomplish.

* The need to define a ‘challenging’ cognitive task

I think that much of the benefit that comes from the brain training has relatively little to do with the tasks that are presented and more to do with the progressive difficulty of the software that drives them. One major point in Dr. Merzenich’s talk that deserved additional attention was that there was very little benefit to be had with brain training if the participant is not adequately challenged. That means that the task cannot be too difficult, causing frustration, and cannot be not too easy, meaning no benefit. Instead, you want the participant to get the majority of trials correct (~80%) and still have room to grow.

When I was in grad school my roommate dropped me in an MRI scanner for 2.5 hours to complete a mildly complex task. To answer your questions, yes, I did that task for over two hours and, no, it was not fun at all. What the data showed at the end was fairly amazing though. At first, when the task was difficult, I had to utilize a range of frontal and parietal resources to complete the task. It took effort. Two hours later, when I didn’t even have to think about the task anymore, the majority of activity was subcortical. It had become automatic.

We may not get a lot of cognitive benefit out of social interaction because we are already experts in social interaction. Likewise, when you first start doing crosswords everyday you will likely see a cognitive bump, but as you get better at crossword puzzles the marginal increase you see will diminish. This is the secret sauce of brain training. If you had a set of crossword puzzles that increased in difficulty as your skills improved I think that you would continue to observe improvements over time.

* Brain training seems to have mixed results

First, I haven’t read the entire body of literature on brain training, so I am limited to the handful of papers that I have reviewed. One consistent thread that I have observed is that the training can offer improvements that confer long-term benefits on a scale of many years. Still, the effect size of the studies I reviewed isn’t always huge.

A recent study by Wolinsky et al. (2013), showed an effect size of around 0.25 for most of their training, as indexed by the Cohen’s d statistic. This was across several experimental conditions one year after the initial training.

Another, larger study from the NIH found that cognitive benefits can stay around for over a decade. The Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) study followed 2,832 participants for ten years following an initial six-week brain training program (Rebok et al., 2014). After ten years approximately 70% of subjects in the training group were above their initial performance baseline. This is relative to 50-60% in the control group. The effect sizes varied depending on the condition, but ranged from approximately 0.20 to 0.50.

Head on over to the Cohen’s d effect size interpreter at http://rpsychologist.com/d3/cohend/ and look at the overlap of the treatment and control group distributions for a d of 0.2. Yeah, there is a benefit, but there is not a dramatic shift. Effect sizes of 0.5, which were observed in the ACTIVE study, are better. Still, you are not jumping several standard deviations up the curve.

The authors of both studies discus that the practical effect of brain training is a 2-7 year delay in certain aspects of cognitive decline. Again, not super huge. You are not turning a 70-year-old brain into that of a 30-year-old. Still, when I am 70 I will take any improvement I can get.

* Exercise and nutrition as global cognitive benefits

It is no secret that I am a huge fan of exercise and mindfulness meditation as ways to improve your brain. Many of the points in my impromptu post on running still ring true, and there is an increasing body of evidence that exercise, nutrition, meditation, sleep, and medication can improve the function of the brain. Dr. Merzenich accepted that these had positive influences, and mentioned them early in his talk. Still, he was somewhat dismissive of their ability to significantly improve cognition or aid in rehabilitation, saying that they formed part of a “complete package” of rehabilitation.

I took strong issue with his stance on this issue. He stated that exercise and nutrition provide “essentially no benefit” relative to brain training on the tasks that they had investigated. I have a hard time completely believing that, especially given the range of benefits that have been published in scientific literature.

* Warning against false prophets of brain training

Brain training has a huge buzz right now, and a number of startups have been founded provide brain training tools. PositScience is not alone in a market that includes Lumosity, Cogmed, CogniFit, LearningRx, and many others. Which ones really work?

How many programs are just peddling mildly difficult games with an advertisement that it improves the brain? I sleep better at night knowing that Dr. Merzenich is the Chief Scientific Officer of PositScience, but how many companies do not have a solid scientist at the helm, or the research to back up their claims?

* Realtime neuroimaging techniques could play a role in the future

Brain training certainly has some effect. Further, there is very little downside to using brain training tools. The only possible negative I could see is the opportunity cost. We all have a finite amount of time during the day, and if you are in your back office engaged in brain training instead of exercising or interacting socially then you may be missing out on other ways to improve your cognitive function.

I am most excited about the market for realtime feedback on cognitive performance based on neuroimaging data. Christopher deCharms has been doing a bit of this using realtime fMRI to help control pain in patients. He founded a startup, Omneuron, to investigate this technique and capitalize on it. What I would like to see is a consumer-grade EEG system (like the Emotiv headgear) that integrates realtime brain state with brain training. I think that this is really the Holy Grail, because you are no longer basing the progressive difficulty of a task on the behavior of the trainee, but instead you could potentially increase difficulty based on the actual cognitive load.

Perhaps it is time for me to start up a company of my own…

Response Panel: A New Path to Brain Health

prefrontal — Mon, 08 Sep 2014 00:37:50 +0000

Cottage Rehabilitation Hospital is a local Santa Barbara institution dedicated to providing care for survivors of stroke, brain and spinal cord injury, orthopedic injury and other disabling conditions. To support this mission they have an annual presentation series titled “Empowerment through Medical Rehabilitation”. Each year they bring in amazing speakers to present on a topic related to medical rehabilitation. This year they are having Dr. Michael Merzenich present.

If you are anywhere close to Santa Barbara on Monday, September 22 I would highly encourage you to swing by and see Dr. Merzenich’s presentation. He is a world-renowned expert on brain plasticity, particularly in the context of cognitive rehabilitation. I have the honor of serving on a response panel following the lecture, which I am particularly looking forward to. It should be a really fun evening.

See the below web pages for more information on the event:
http://www.sbch.org/
http://www.eventbrite.com/

Hope to see you there!

Adiós Academia

prefrontal — Wed, 06 Nov 2013 07:57:02 +0000

You may have noticed that I haven’t posted anything of neuro-subtance since, oh, January 2012.

Well, between you and me, there is a perfectly good reason why that is the case.

That was about the time I completed the paperwork to end my neuroscience career and begin a new adventure in private industry as a software engineer.

Wait, what?

The short version is that my wife and I have decided to make Santa Barbara, California our permanent home. Given that the postdoc is a temporary position usually lasting less than five years, the time was coming soon to move on up to being a professor or move on out of academia. After much deliberation, I have chosen the latter.

I opted not to announce the transition publicly for some time in case I changed my mind, or if the new position didn’t work out. I figured I could always head back to research as long as nobody noticed I had temporarily left. Now, here we are, almost two years later and I have neglected to let you, my closest internet-friends, in on the big news. How terrible!

Some important factors in the decision:

(a) Funding. There were plans afoot for me to make the transition from postdoctoral researcher to research faculty. This would have potentially kept the lights on in the Bennett household for many years to come. I easily could have enjoyed a long career focused uniquely on research in cognitive neuroscience. The only risk? Funding. Funding rates have been dropping year over year for decades. The average age at which a new investigator finally gets their first R01 grant has risen to 43+. The percent of funded research project grants has fallen from 32% in 1999 to 24% in 2007. The success rate for funding on your first grant submission is now 12%. [see Broken-Pipeline.pdf]

It’s not that I can’t handle the continuous funding chase, but it is a bitter struggle. The situation isn’t getting better either. Would you want to place a bet on whether success rates are going to fall below 20% or rise back to 30% as we move forward? Further, if a grant falls through, and there happens to be a gap in funding coverage, then my paycheck would simply cease to exist. Working for a private employer may be ‘at will’ employment, but the chances of a negative outcome may be far lower than rolling the dice with the NIH, NSF, or other funding agency every few years.

The lab that I was working in experienced just such a shortfall this last year as the federal sequester hit the campus. The lab was funded primarily by funds that, filtered through other agencies, came from the U.S. Army. When the sequester arrived, lab funding took a double-digit drop. My postdoc advisor was hard pressed to pay people’s salaries. Data acquisition stopped and everyone just had to hope that they had enough quality data to publish/graduate. That is a scary proposition.

(c) Compensation. The postdoctoral researcher is seen as a training position where you are learning important new skills and scientific perspectives from the principal investigator that you work for. Based on the salary data I collected as part of my job search, postdocs are typically getting paid a (literal) fraction of their value in private industry. This, again, tends to get better when you become an assistant professor, but it can take a while to climb the ladder. I would have received a bit of a bump if I would have gone into a research faculty position, but it paled in comparison to the, um, significant increase I negotiated as part of my new software development position.

(b) Lifestyle. Part and parcel of academia is the fact that everyone is constantly moving around. Your best friend today may or may not be in the same town with you next year. While this stabilizes a bit when/if you finally land that coveted assistant professor position, the time leading up to that point is defined by flittering around the country. First you head off to grad school, then off to a postdoc, then to who knows where. I know some individuals who are on their fourth city in ten years, with another move on the way. For some this is exhilarating, as you are essentially getting paid to see the world. For me it has been harder, constantly making incredible friends and then having to say goodbye. I now have friends stretching from Washington, to Maine, to Florida, and back to California. It was finally time to stop flying around and put down some roots.

The other lifestyle point is that my wife and I have really fallen in love with Santa Barbara, California. The weather is amazing. We are able to live two blocks from the beach and an eight minute drive to hiking in the mountains. I am able to commute by bike almost every day of the year. These geographical and climate features are fantastic, but beyond that is the fact that we have met some amazing people in Santa Barbara, and it just feels like home now. I don’t want to say goodbye, and now I don’t have to.

Things aren’t all bad as a postdoc, however. I would be remiss if I didn’t highlight some incredible benefits of the old job:

(a) Schedule. Being able to set your own schedule is an incredible benefit of the postdoc. I am a nightowl by nature, so working 10am-6pm with another hour or two of reading in the late evening suited me perfectly. Working an 8-5pm job is not really my cup of tea. The flexibility to take time away from work was also fantastic as a postdoc. I had some genuinely awesome travel and volunteer experiences that I might have never experienced in a regular job, including trips to Hawaii and several excursions to Europe. Now I have to hoard my ATO days like they are bars of gold.

(b) People. I have met some of the most genuinely awesome people in my years as a researcher. There is something truly magical about a group of people who have similar interests and come together for a common cause. I found that special kinship in graduate school, my postdoc, and also with the larger community of cognitive neuroscientists. Even though I am leaving neuroscience professionally, I will always identify with this cadre of individuals who brand themselves ‘neuroscience geeks’. While there is some of this kinship in my new position, there is an overriding sense that many folks are just punching the clock.

(c) Academic Freedom. You are at the mercy of your postdoctoral advisor to a point, but at the end of the day you have an awesome job where, for all intents and purposes, you are getting paid to think. How absolutely freaking incredible is that? You know that one thing that you are most curious about? Here is some money to learn everything you can about that one thing and then here is some more money to find out the answers to questions we don’t even know to ask yet. I won’t lie to you – I miss it all. Frequently.

So, what am I doing now?

Well, I am working for a small scientific instrumentation company in Goleta, California that goes by the name [redacted]. Well, I could tell you their name, but I would need to obtain prior approval from my corporate overlords (no joke). I work as a firmware engineer in the software group, helping to maintain the embedded applications that drive the instruments. As a neuroscientist, the best part of my week was spending an entire afternoon writing a gnarly MATLAB script. Now, I get to do that every day. I could not have hoped for a better place to end up after falling out of the Ivory Tower.

I have had a really, really great run as a professional neuroscience researcher. Great friends. Great research. Amazing conferences. Deep debates on the nature of of the brain and cognition. An Ig Nobel award! I credit any success I might have had to the wonderful mentors who took me under their wing, and to the long list of quality people I have encountered on my academic journey. I love cognitive neuroscience and scientific exploration more than I can adequately put into words. Still, I was not content to let the sunk costs of a graduate education dictate my future actions, especially when other options are now a better fit for my personal and professional goals.

~Craig [Prefrontal]

Quote of the Week – Cameron

prefrontal — Fri, 06 Jan 2012 22:06:20 +0000

“It would be nice if all of the data which sociologists require could be enumerated because then we could run them through IBM machines and draw charts as the economists do. However, not everything that can be counted counts, and not everything that counts can be counted.” – William Bruce Cameron, Informal Sociology: A Casual Introduction to Sociological Thinking (1963)

Hot, Hot iPhone Love (More Terrible Neuromarketing)

prefrontal — Mon, 03 Oct 2011 22:41:05 +0000

I hate being late to a party. You finally arrive after the festivities have begun and you know that your friends have already been there for hours, having a grand time doing what they do best. So it is with the latest neuromarketing debacle involving the New York Times and the pseudoscience that appeared on the op-ed page. All the best stuff has already been written.

Summary:

A branding consultant (Martin Lindstrom) commissions a neuromarketing company (MindSign) to do a neuroimaging study. Sixteen subjects underwent fMRI data acquisition while being shown audio and video of ringing iPhones. Visual and auditory cortex was active across all conditions. There was also activity in the insula. The authors interpret the sensory cortex activity as a kind of cross-modal synesthesia experience. The authors further interpret the insula activity as the subjects experiencing feelings of love and compassion. Headlines around the web ring loudly with headlines “YOU LOVE YOUR iPHONE”.

Web points of interest:

1) Read the original op-ed piece by Martin Lindstrom to give yourself some context regarding what was said and the arguments that were made. It will probably make your skin crawl with tales of babies wanting cell phones to be iPhones and terrible definitions of synesthesia. Stick with it anyway.
http://www.nytimes.com/2011/10/01/opinion/you-love-your-iphone-literally.html

2) Start at Russ Poldrack’s weblog and read his first post on the topic. He called it crap, and he was being direct and truthful.
http://www.russpoldrack.org/2011/10/nyt-editorial-fmri-complete-crap.html

3) Now read Tal Yarkoni’s excellent in-depth discussion of the problem. If you read nothing else today, go and check this one out.
http://www.talyarkoni.org/blog/2011/10/01/the-new-york-times-blows-it-big-time-on-brain-imaging/

4) Next, read the post by Vaughan Bell at Mind Hacks, which is also a nice follow-up. Double points for using the term “facepalm jamboree”.
http://mindhacks.com/2011/10/02/the-new-york-times-wees-itself-in-public/

5) Finally, see the list of people who support Poldrack’s position on the Lindstrom article. Many of the best minds in neuroscience are agreed that the Op-Ed piece is not representative of good science:
http://www.russpoldrack.org/2011/10/signers-of-letter-to-editor-of-new-york.html

To be honest, I don’t have a whole lot to add to the conversation. On the topic of reverse inference you really can’t do better than Russ Poldrack and Tal Yarkoni. The Yarkoni blog post is particularly good, effectively nuking the Lindstrom piece from orbit. It is, in a way, poetic since Poldrack and Yarkoni are working on the databases and methods that will enable probabilities to be put on arguments such as Lindstrom’s. That is, if insula activation is observed how likely is it that the emotion of ‘love’ is being experienced. To give their technology a try surf on over to http://neurosynth.org/ and check it out.

One aspect of the debate that I am particularly interested in is the purported role of the insula in the experience of love and affection. Unfortunately, Lindstrom provided very little detail in terms of the spatial location of their insula activity, effectively preventing anyone from criticizing the work on that basis. But, for the sake of argument, let’s put the insular question forward. Does it matter where in the insula that the activity was observed? The short answer is: absolutely.

There is an excellent paper by A. D. “Bud” Craig entitled “Forebrain emotional asymmetry: a neuroanatomical basis?” that details how the left and right insula have a different pattern of connectivity to the homeostatic afferents that provide information on our current body state. Craig describes how the right insula is preferentially involved in sympathetic nervous system activity geared toward engaging with the environment, energy use, and even “fight or flight” responses. Conversely, the left insula is preferentially involved in parasympathetic activity geared toward contentment, energy conservation, and “rest and digest” responses.

In our evolution, humans seem to have bolted-on social components to this underlying insular emotional asymmetry. The right insula seems to be associated with the experience of social disgust and social avoidance. This has been seen in work such as the original Philips et al. (1997) paper, showing prominent right anterior insula activity during disgust. The left insula seems to be associated with the experience of social compassion and social approach. There is less evidence for this, but meta-analyses such as Ortigue et al. (2010) have reported this pattern.

In short, leaving out which hemisphere the results occurred in is a huge faux pax on the part of Lindstrom. It is not the greatest sin of the piece, and probably not even the greatest sin of the insula argument. Still, it is certainly a prominent FAIL from the perspective of a researcher with an interest in the insula.

One final point of discussion I would like to raise is with regard to an earlier prefrontal.org post on the Seven Sins of Neuromarketing. Let’s see which ones are most prominent in the current discussion:

1) The curtain of proprietary analysis methods limits our knowledge of how effective neuromarketing can be.

We have no idea what methods Lindstrom and his colleagues used to arrive at their findings. It could be the best study in the history of ever, or it could be riddled with common statistical flaws. We have no idea because the work isn’t peer-reviewed. As before, we don’t even know where in the insula the results were located!

3) Most people’s introduction to neuromarketing is through press releases, not peer-reviewed studies.

Let’s just establish this as a rule: the New York Times editorial page is not the right place to introduce the world to your cutting-edge, unproven fMRI methods. Period. In fact, we should come up with a verb for what always happens afterward: you get Poldrack’d.

4) Neuromarketing methods are not immune to subjectivity and bias.

In a way, scientific claims are guilty until proven innocent by empirical evidence. Honestly, can I trust a man who has written books with titles like Buyology, Brandwashed, and Brand Sense to be objective with regard to a neuromarketing study with a sensational headline? If this was work was peer-reviewed then we could evaluate his evidence in a balanced manner, but an Op-Ed piece does not allow for this luxury and leaves the question of bias open.

6) People are rushing the field to make a quick buck, and not everyone is trustworthy.

I think that this represents the case in point.

Ortigue S, Bianchi-Demicheli F, Patel N, Frum C, Lewis JW. (2010). Neuroimaging of love: fMRI meta-analysis evidence toward new perspectives in sexual medicine. J Sex Med. 7(11): 3541-3552.

Phillips ML, Young AW, Senior C, Brammer M, Andrew C, Calder AJ, Bullmore ET, Perrett DI, Rowland D, Williams SC, Gray JA, David AS. (1997). A specific neural substrate for perceiving facial expressions of disgust. Nature. 389(6650): 495-498.

Significant Differences

prefrontal — Mon, 26 Sep 2011 20:07:45 +0000

One of the first things you learn in an introductory psychology class is the topic of cognitive bias. These are situations or contexts in which human beings cannot reliably make effective judgements or discriminations. For instance, information that tends to confirm our own assumptions is generally judged to be correct (Confirmation Bias). Another example is the disproportionate attention given to negative experiences relative to positive experiences (Negativity Bias). In each situation perception and decision making is distorted even though we should know better. It may be the case that we need to come up with a new bias to explain investigator behavior. Significance Bias anyone?

There is a great article by Nieuwenhuis, Forstmann, and Wagenmakers in this month’s edition of Nature Neuroscience. Entitled “Erroneous analyses of interactions in neuroscience: a problem of significance”, the paper discusses the problem of how to gauge when two effects differ in neuroscience. It turns out that many papers misjudge the difference between effects by basing their judgement on significance values, even though they should know better.

The crux of the issue is that it is improper to judge the difference between two effects by looking at their relative significance. The perceived difference between a significant effect ( i.e. p < 0.05) and non-significant effect (i.e. p > 0.05) does not necessarily mean that the two effects are themselves significantly different. You have to explicitly test for that.

In fMRI, this could mean relating one brain area that is significant to another brain area that is not significant. The temptation is to discuss the significant region as being more active than the nonsignificant region based on the fact that the latter region was below the significance threshold. This actually may or may not be the case.

Andrew Gelman and Hal Stern wrote a similar article on the problem a few years ago. The focus of their piece was simply to draw attention to the issue through the use of several theoretical and real life examples. While they were able to say that the problem existed, they were unable to say how prevalent the problem was across any particular scientific discipline. The power of the Nieuwenhuis, Forstmann, and Wagenmakers paper is that it extends the Gelman & Stern work through an analysis of the existing literature to put concrete numbers on how widespread the problem is in neuroscience.

The authors conducted a survey of 513 articles in major neuroscience journals. They identified 157 papers containing an analysis where the authors would be tempted to make an inferential error by focusing on significance. They found that in 78 out of 157 cases (50%) the authors did indeed make an error. That is far higher that I would have guessed, and one of the reasons I felt compelled to write about it today. I mean, come on, fifty percent? Really?

In the next to last paragraph the authors specifically state the the error of comparing significance levels is particularly acute in neuroimaging. From my perspective we are almost setup for failure in this regard, as significant regions are visualized as a range of attention-grabbing colors while regions that are not significant are visualized as completely blank.

I could rail on a bit longer, but that is time you could be using to go and read this article. There is a lot of good information in the text – it is short, punchy, and well worth your time.

Some additional discussion on the topic:
http://andrewgelman.com/2011/09/the-difference-between-significant-and-not-significant/

Gelman A and Stern H. (2006). The Difference Between “Signiﬁcant” and “Not Signiﬁcant” is not Itself Statistically Signiﬁcant. The American Statistician 60(4), 328-331.

Nieuwenhuis S, Forstmann BU, and Wagenmakers EJ. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience 14, 1105-1107.

Neuromarketing Debate, May 23rd

prefrontal — Sat, 21 May 2011 20:49:11 +0000

Do you feel like neuromarketing is a disruptive new technology, or just another example of neurohype? Regardless of where you stand on the issue you might be interested in a debate I will be participating in next Monday, the 23rd of May, at Stanford Medical School.

The Stanford Interdisciplinary Group on Neuroscience and Society (SIGNS) is hosting the debate, which is focused on neuroscience in the marketplace. Jim Sullivan, the CEO of NeuroSky, Uma Karmarkar from the Stanford Graduate School of Business, and myself will all weigh in on the topic of whether neuroscience is being used to manipulate consumers.

I think you might already know where I stand based on my ‘Seven Sins’ neuromarketing post, but the event promises to be a lively affair with a diverse array of perspectives. Come check it out if you are in the bay area next week!

Grab some details on the event, or check out the event poster for more information.

The Seven Sins of Neuromarketing

prefrontal — Sat, 23 Apr 2011 00:59:44 +0000

I got quoted in a random neuromarketing article recently. In the flurry of people I have been chatting with about statistics and functional neuroimaging I often neglect to ask what organizations people are associate with. In this case it was Forbes magazine.

http://www.forbes.com/forbes/2009/1116/marketing-hyundai-neurofocus-brain-waves-battle-for-the-brain.html

In the online version of the article there was a user comment from a neuromarketing company CEO defending the honor of his business and the field in which they operate. He went so far as to compare the launch of neuromarketing with the initial steps of market research in the early 20th century. He further argued that neuromarketing would bring about the next revolution in understanding consumer behavior.

I have to admit, my gut reaction on first reading this statement was one of mild disgust. This got me thinking about why neuromarketing hangs in a cloud of disdain among many scientists. Below are some of the ‘sins’ which I feel currently plague the field of neuromarketing. This is all just my opinion of course, but I do think that it raises some interesting points for discussion.

1) The curtain of proprietary analysis methods limits our knowledge of how effective neuromarketing can be.

Neuromarketing seems to be primarily driven by the private industry, not academia. This is not to say that research into consumer behavior has not occurred at the university level. There has been a lot of good neuroeconomics research in the last several years. Still, it is mostly companies in private industry that are driving the application of these findings to practical consumer behaviors. Because these companies are in competition with each other they are reluctant to give others the recipe to their secret analysis sauce. From the outside this means that the analysis pipeline of all neuromarketing companies is that of a black box, with data going in one end and the results-you-need coming out the other.

My colleagues and I have the position that fMRI research utilizing incorrect statistics can generate a large number of false positives. That is, many of the results will be there simply because of noise. Because so much of the current neuromarketing data is hidden behind the veil of proprietary analysis methods it is impossible to judge how successful their methods actually are, and to what degree their findings are false positives.

2) There is little peer-reviewed literature that is specific to neuromarketing.

Neuromarketing is an emerging discipline that will, in time, give us new insight into human behavior. Unfortunately, little peer-reviewed research has currently been published in this area. Search for ‘neuromarketing‘ in the PubMed database of abstracts (www.pubmed.com) and you will find all of ten publications. This must change for neuromarketing to mature.

Again, without peer-reviewed results on the effectiveness of neuromarketing experiments all we have to rely on are self-reports from the neuromarketing firms themselves. An issue similar to the file-drawer problem then exists. The file-drawer problem is when only positive results get published in journals while negative results sit unpublished in the file drawer. Neuromarketing companies will be likely to report positive results while negative results sit undistributed. Either way, the end result is a biased understanding.

3) Most people’s introduction to neuromarketing is through press releases, not peer-reviewed studies.

In 2006 there was an “instant-science” article released online by Marco Iacoboni et al. revealing their analysis of fMRI date obtain while subjects were watching Super Bowl advertisements. The much-discussed post, entitled “Who Really Won the Super Bowl?”, tried to determine the most effective commercial by judging which one activated regions involved in reward and empathy to the greatest degree. They determined that a commercial from Disney fared the best when evaluated by these measures. Many neuroscientists shook their heads and moved on.

In 2009 the same group published an op-ed in the New York Times detailing the results of scanning 20 individuals while looking at pictures and videos of leading political candidates. They drew conclusions on candidate evaluations by examining activity in areas like the amygdala and anterior cingulate. For example, they concluded that amygdala activity indicated a state of anxiety and cingulate activity indicated cognitive conflict. These oversimplifications were so well publicized and widely distributed that a number of leading neuroscientists were compelled to publish a letter in the New York Times calling the Iacoboni results into question.

Let’s put it this way, when many of the top minds in neuroimaging feel compelled to assemble a letter to the New York times regarding your non-peer-reviewed neuromarketing/neuropolitics results then the field has a problem.

There are a handful of peer-reviewed neuromarketing papers that do deliver. One recent paper by Michael Schaefer was a very interesting investigation into the representation of brand associations. However, these type of studies are typically rare, and it remains that the signal-to-noise ratio of information in the press is very low.

4) Neuromarketing methods are not immune to subjectivity and bias.

One of the most highly touted aspects of neuromarketing methods is that they are free from subjectivity and bias on the part of the participant. For example, asking a subject what they thought of a particular brand introduces the muddying waters of conscious consideration. The person’s response will be colored by a complex web of tangential cognitive factors and contextual biases. The promise of neuromarketing is that you can bypass these confounding factors to get at the heart of the matter – the real representation of the brand. While this is true to a degree, an entirely new set of confounding factors is introduced during the analysis of neuromarketing data.

While many neuromarketing measures are indeed more objective than verbal reports, I must disagree with the observation that they are unfiltered, true reports of the underlying representation. While the signals are not filtered by the consciousness of the research subject, a great deal of manipulation and filtering of the data is done by the researcher. This does introduce the potential for bias, simply by a different avenue.

Small changes in processing pipelines can have a huge impact on the power of fMRI to detect relevant signals. Some excellent papers by Stephen Strother come to mind with regard to this point. With no knowledge of what is going on we have no idea how objective the analyses by these companies can be.

5) The value per dollar of neuromarketing methods has yet to be determined.

Neuromarketing studies are expensive. The Forbes article says that an average EEG or fMRI marketing study costs in the neighborhood of $50,000. Immediately this number can trigger a ‘more expensive = better’ response, especially if you have a large budget to support such studies. What rarely gets discussed is what kind of value you obtain in return for the huge amount of money that is spent.

The key question in neuromarketing is what information can you get with EEG / fMRI / eye tracking / biometrics that you cannot obtain using other methods. If I can spend $1000 to do a traditional market study that gets me 85% of what a $50,000 fMRI study does then the return on my neuromarketing investment is not great. Thinking about it another way, how much less or more could I get across 50 traditional studies relative to the value of one neuromarketing study.

Many companies are not limited by the extreme cost of neuromarketing studies, and a significant fraction of them are not afraid to take the risk to try something new. Perhaps part of the motivation is also the fear of being left behind – that a competitor will take the risk and gain a competitive advantage in consumer understanding. Whatever the motivation, there will always be a market for neuromarketing methods. Still, we must still acknowledge that the value of neuromarketing is an open question.

6) People are rushing the field to make a quick buck, and not everyone is trustworthy.

The emergence of neuromarketing represents a modern day gold rush in terms of buzz and promises. Brilliant researchers will be attracted to this opportunity and will significantly advance the field of neuromarketing. Morally questionable individuals will also be drawn to the opportunity, and will end up giving the field a black eye. Reputations will build up over time and trustworthy companies will emerge from the fray, but the current situation is more akin to the wild west than a civilized exchange.

7) The true value of neuromarketing is obscured by the above-mentioned problems.

I thought I would end on a high note. There is certainly significant value to using neuromarketing methods in consumer research. Why else would companies like Nielsen Holdings be investing in neuromarketing firms like NeuroFocus? One of the biggest problems is that the true value of these methods is obscured by those who treat it as a gimmick and have the loudest voice. The next ten years will represent a true shakedown of the neuromarketing industry. Companies that are able to provide real value to their customers will live on while those who simply seek to make pretty pictures will fall by the wayside. It will be a fascinating time to be an observer of the business and politics in this emerging field.

Conclusions.

The above points ignore many other issues facing neuromarketing. I have completely bypassed a discussion of the ethics of neuromarketing. Many people worry that technologies like fMRI will help marketers find the ‘buy button’ in the brain, stripping away people’s free will in product choice. I am not terribly worried about that discussion, perhaps because I am ignoring the problem or perhaps because I know too much about brain function or neuroimaging methods. Regardless, there are other issues and hurdles that neuromarketing must address to grow as a field.

In the end I do wish neuromarketing great success. I simply fear that those individuals who are seeking to profit on the popularity will tarnish the reputation of neuromarketing before it is able to legitimize itself.

PAPER: An Argument For Proper Multiple Comparisons Correction

prefrontal — Wed, 03 Nov 2010 22:39:49 +0000

It has been a long road, but our multiple comparisons paper including the salmon has been published. See below for more details, including the abstract and a link to the download page of the journal. If you have any questions or comments please post them below or send me an email directly.

– – – – – – – – – – – – – – – – – – – – – – – – – – –

Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction

Craig M. Bennett(1), Abigail A. Baird(2), Michael B. Miller(1) and George L. Wolford(3)
1)Department of Psychology, University of California at Santa Barbara, Santa Barbara, CA 93106
2)Department of Psychology, Blodgett Hall, Vassar College, Poughkeepsie, NY 12604
3)Department of Psychological and Brain Sciences, Moore Hall, Dartmouth College, Hanover, NH 03755

Journal of Serendipitous and Unexpected Results, 2010. 1(1):1-5
Early Access: Oct 20, 2010

With the extreme dimensionality of functional neuroimaging data comes extreme risk for false positives. Across the 130,000 voxels in a typical fMRI volume the probability of at least one false positive is almost certain. Proper correction for multiple comparisons should be completed during the analysis of these datasets, but is often ignored by investigators. To highlight the danger of this practice we completed an fMRI scanning session with a post-mortem Atlantic Salmon as the subject. The salmon was shown the same social perspective-taking task that was later administered to a group of human subjects. Statistics that were uncorrected for multiple comparisons showed active voxel clusters in the salmon’s brain cavity and spinal column. Statistics controlling for the family-wise error rate (FWER) and false discovery rate (FDR) both indicated that no active voxels were present, even at relaxed statistical thresholds. We argue that relying on standard statistical thresholds (p < 0.001) and low minimum cluster sizes (k > 8) is an ineffective control for multiple comparisons. We further argue that the vast majority of fMRI studies should be utilizing proper multiple comparisons correction as standard practice when thresholding their data.

Download a PDF of the article here:
http://prefrontal.org/files/papers/Bennett-Salmon-2010.pdf

Riverside Presentation Slides

prefrontal — Thu, 28 Oct 2010 07:28:14 +0000

Just wanted to take a second to thank the kind folks in the Psychology Department at UC Riverside for hosting me this afternoon. I gave a neuroimaging stats talk for their cognitive brown bag series, and it was a really great time!

For anyone who is interested a copy of the slides from my presentation can be downloaded at the link below. If you have any questions or comments feel free to email me – I would love to chat more. Take care UCR!

http://prefrontal.org/files/presentations/Bennett-Riverside-2010.pdf

Spring 2010 Conference Posters

prefrontal — Wed, 20 Oct 2010 23:26:13 +0000

I have been remiss in uploading copies of my spring conference posters. October seems like a fine month to rectify that. Below are links to the research I presented at the Cognitive Neuroscience Society meeting in Montreal and at the Organization for Human Brain Mapping meeting in Barcelona. Both meetings were fantastic – I got to meet a lot of new people and experience all the awesomeness that Montreal and Barcelona have to offer.

* How reliable are the results from fMRI?
Conference Poster: [PDF] [JPEG]

* A device for the parametric application of thermal and
tactile stimulation during fMRI
Conference Poster: [PDF] [JPEG]

APS Conference – Presentation Slides

admin — Sat, 29 May 2010 19:34:17 +0000

I have wanted to attend the Association for Psychological Science annual convention for a number of years, but I was always frustrated by the number of other conferences I had to attend during the spring. All that changed early this year when I was offered the opportunity to give a presentation on interoceptive development. I suddenly had a very good reason to free up some time and hop on a plane!

I want to thank everyone who attended my address this morning. After untold amounts of airline trouble getting to Boston it was a real pleasure to have the chance to talk about the insula and interoceptive development.

If you are interested you can download a copy of my presentation slides here. Send me an email if you have any questions or comments. Thanks!

PAPER: How reliable are the results from functional magnetic resonance imaging?

prefrontal — Fri, 26 Feb 2010 05:02:51 +0000

– Current Citation:
Bennett CM, Miller MB. (in press). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences.

– Abstract:
Functional magnetic resonance imaging is one of the most important methods for in vivo investigation of cognitive processes in the human brain. Within the last two decades an explosion of research has emerged using fMRI, revealing the underpinnings of everything from motor and sensory processes to the foundations of social cognition. While these results have revealed the potential of neuroimaging, important questions regarding the reliability of these results remain unanswered. In this chapter we take a close look at what is currently known about the reliability of fMRI findings. First, we examine the many factors that influence the quality of acquired fMRI data. We also conduct a review of the existing literature to determine if some measure of agreement has emerged regarding the reliability of fMRI. Finally, we provide commentary on ways to improve fMRI reliability and what questions remain unanswered. Reliability is the foundation on which scientific investigation is based. How reliable are the results from fMRI?

– Downloadable Versions:
[Manuscript PDF]
[Link to Journal PDF]

– Full Text:

Reliability is the cornerstone of any scientific enterprise. Issues of research validity and significance are relatively meaningless if the results of our experiments are not trustworthy. It is the case that reliability can vary greatly depending on the tools being used and what is being measured. Therefore, it is imperative that any scientific endeavor be aware of the reliability of its measurements.

Surprisingly, most fMRI researchers have only a vague idea of how reliable their results are. Reliability is not a typical topic of conversation between most investigators and only a small fraction of papers investigating fMRI reliability have been published. This became an important issue in 2009 as a paper by Vul, Harris, Winkielman, and Pashler set the stage for debate (2009). Their paper, originally entitled “Voodoo Correlations in Social Neuroscience”, was focused on a statistical problem known as the ‘non-independence error’. Critical to their argument was the reliability of functional imaging results. Vul et al. argued that test-retest variability of fMRI results placed an ‘upper bound’ on the strength of possible correlations between fMRI data and behavioral measures:

This calculation reflects that the strength of a correlation between two measures is a product of the measured relationship and the reliability of the measurements (Nunnally, 1970; Vul et al., 2009). Vul et al. specified that behavioral measures of personality and emotion have a reliability of around 0.8 and that fMRI results have a reliability of around 0.7. Not everyone agreed. Across several written exchanges multiple research groups debated what the “actual reliability” of fMRI was. Jabbi et al. stated that the reliability of fMRI could be as high as 0.98 (2009). Lieberman et al. split the difference and argued that fMRI reliability was likely around 0.90 (2009). While much ink was spilled debating the reliability of fMRI results, very little consensus was reached regarding an appropriate approximation of its value.

The difficulty of detecting signal (what we are trying to measure) from amongst a sea of noise (everything else we don’t care about) is a constant struggle for all scientists. It influences what effects can be examined and is directly tied to the reliability of research results. What follows in this chapter is a multifaceted examination of fMRI reliability. We examine why reliability is a critical metric of fMRI data, discuss what factors influence the quality of the blood oxygen level dependent (BOLD) signal, and investigate the existing reliability literature to determine if some measure of agreement has emerged across studies. Fundamentally, there is one critical question that this chapter seeks to address: if you repeat your fMRI experiment, what is the likelihood you will get the same result?

Pragmatics of Reliability

Why worry about reliability at all? As long as investigators are following accepted statistical practices and being conservative in the generation of their results, why should the field be bothered with how reproducible the results might be? There are, at least, four primary reasons why test-retest reliability should be a concern for all fMRI researchers.

Scientific truth. While it is a simple statement that can be taken straight out of an undergraduate research methods course, an important point must be made about reliability in research studies: it is the foundation on which scientific knowledge is based. Without reliable, reproducible results no study can effectively contribute to scientific knowledge. After all, if a researcher obtains a different set of results today than they did yesterday, what has really been discovered? To ensure the long-term success of functional neuroimaging it is critical to investigate the many sources of variability that impact reliability. It is a strong statement, but if results do not generalize from one set of subjects to another or from one scanner to another then the findings are of little value scientifically.

Clinical and Diagnostic Applications. The longitudinal assessment of changes in regional brain activity is becoming increasingly important for the diagnosis and treatment of clinical disorders. One potential use of fMRI is for the localization of specific cognitive functions before surgery. A good example is the localization of language function prior to tissue resection for epilepsy treatment (Fernandez et al., 2003). This is truly a case where an investigator does not want a slightly different result each time they conduct the scan. If fMRI is to be used for surgical planning or clinical diagnostics then any issues of reliability must be quantified and addressed.

Evidentiary Applications. The results from functional imaging are increasingly being submitted as evidence into the United States legal system. For example, results from a commercial company called No Lie MRI (San Diego, CA; http://www.noliemri.com/) were introduced into a juvenile sex abuse case in San Diego during the spring of 2009. The defense was attempting to introduce the fMRI results as scientific justification of their client’s claim of innocence. A concerted effort from imaging scientists, including in-person testimony from Marc Raichle, eventually forced the defense to withdraw the request. While the fMRI results never made it into this case, it is clear that fMRI evidence will be increasingly common in the courtroom. What are the larger implications if the reliability of this evidence is not as trustworthy as we assume?

Scientific Collaboration. A final pragmatic dimension of fMRI reliability is the ability to share data between researchers. This is already a difficult challenge, as each scanner has its own unique sources of error that become part of the data (Jovicich et al., 2006). Early evidence has indicated that the results from a standard cognitive task can be quite similar across scanners (Casey et al., 1998; Friedman et al., 2008). Still, concordance of results remains an issue that must be addressed for large-scale, collaborative inter-center investigations. The ultimate level of reliability is the reproducibility of results from any equivalent scanner around the world and the ability to integrate this data into larger investigations.

– What Factors Influence fMRI Reliability? –

The ability of fMRI to detect meaningful signals is limited by a number of factors that add error to each measurement. Some of these factors include thermal noise, system noise in the scanner, physiological noise from the subject, non-task related cognitive processes, and changes in cognitive strategy over time (Huettel et al., 2008; Kruger and Glover, 2001). The concept of reliability is, at its core, a representation of the ability to routinely detect relevant signals from this background of meaningless noise. If a voxel timeseries contains a large amount of signal then the primary sources of variability are actual changes in blood flow related to neural activity within the brain. Conversely, in a voxel containing a large amount of noise the measurements are dominated by error and would not contain meaningful information. By increasing the amount of signal, or decreasing the amount of noise, a researcher can effectively increase the quality and reliability of acquired data.

The quality of data in magnetic resonance imaging is typically measured using the signal-to-noise ratio (SNR) of the acquired images. The goal is to maximize this ratio. Two kinds of SNR are important for functional MRI. The first is the image SNR. It is related to the quality of data acquired in a single fMRI volume. Image SNR is typically computed as the mean signal value of all voxels divided by the standard deviation of all voxels in a single image:

Increasing the image SNR will improve the quality of data at a single point in time. However, most important for functional neuroimaging is the amount of signal present in the data across time. This makes the temporal SNR (tSNR) perhaps the most important metric of data for functional MRI. It represents the signal-to-noise ratio of the timeseries at each voxel:

The tSNR is not the same across all voxels in the brain. Some regions will have higher or lower tSNR depending on location and constitution. For example, there are documented differences in tSNR between gray matter and white matter (Bodurka et al., 2005). The typical tSNR of fMRI can also vary depending on the same factors that influence image SNR.

Another metric of data quality is the contrast-to-noise ratio (CNR). This refers to the ability to maximize differences between signal intensity in different areas in an image (image CNR) or to maximize differences between different points in time (temporal CNR). With regard to functional neuroimaging, the temporal CNR represents the maximum relative difference in signal intensity that is represented within a single voxel. In a voxel with low CNR there would be very little difference between two conditions of interest. Conversely, in a voxel with high CNR there would be relatively large differences between two conditions of interest. The image CNR is not critical to fMRI, but having a high temporal CNR is very important for detecting task effects.

It is generally accepted that fMRI is a rather noisy measurement with a characteristically low tSNR, requiring extensive signal averaging to achieve effective signal detection (Murphy et al., 2007). The following sections provide greater detail on the influence of specific factors on the SNR/tSNR of functional MRI data. We break these factors down by the influence of differences in image acquisition, the image analysis pipeline, and the contribution of the subjects themselves.

SNR influences of MRI acquisition

The typical high-field MRI scanner is a precision superconducting device constructed to very exact manufacturing tolerances. Still, the images it produces can be somewhat variable depending on a number of hardware and software variables. With regard to hardware, one well-known influence on the signal to noise ratio of MRI is the strength of the primary B0 magnetic field (Bandettini et al., 1994; Ogawa et al., 1993). Doubling this field, such as moving from 1.5 Tesla to a 3.0 Tesla field strength, can theoretically double the SNR of the data. The B0 field strength is especially important for fMRI, which relies on magnetic susceptibility effects to create the blood oxygen level dependent (BOLD) signal (Turner et al., 1993). Hoenig et al. showed that, relative to a 1.5 Tesla magnet, a 3.0 Tesla fMRI acquisition had 60-80% more significant voxels (2005). They also demonstrated that the CNR of the results was 1.3 times higher than those obtained at 1.5 Tesla. The strength and slew rate of the gradient magnets can have a similar impact on SNR. Advances in head coil design are also notable, as parallel acquisition head coils have increased radiofrequency reception sensitivity.

It is important to note that there are negative aspects of higher field strength as well. Artifacts due to physiological effects and susceptibility are all increasingly pronounced at higher fields. The increased contribution of physiological noise reduces the expected gains in SNR at high field (Kruger and Glover, 2001). The increasing contribution of susceptibility artifacts can virtually wipe out areas of orbital prefrontal cortex and inferior temporal cortex (Jezzard and Clare, 1999). Also, in terms of tSNR there are diminishing returns with each step up in B0 field strength. At typical fMRI spatial resolution values tSNR approaches an asymptotic limit between 3 Tesla and 7 Tesla (Kruger and Glover, 2001; Triantafyllou et al., 2005).

Looking beyond the scanner hardware, the parameters of the fMRI acquisition can also have a significant impact on the SNR/CNR of the final images. For example, small changes in the voxel size of a sequence can dramatically alter the final SNR. Moving from 1.5 mm3 voxels to 3.0 mm3 voxels can potentially increase the acquisition SNR by a factor of eight, but at a cost of spatial resolution. Some other acquisition variables that will influence the acquired SNR/CNR are : repetition time (TR), echo time (TE), bandwidth, slice gap, and k-space trajectory. For example, Moser et al. found that optimizing the flip angle of their acquisition could approximately double the SNR of their data in a visual stimulation task (1996). Further, the effect of each parameter varies according to the field strength of the magnet (Triantafyllou et al., 2005). The optimal parameter set for a 3 Tesla system may not be optimal with a 7 Tesla system.

The ugly truth is that any number of factors in the control room or magnet suite can increase noise in the images. A famous example from one imaging center was when the broken filament from a light bulb in a distant corner of the magnet suite started causing visible sinusoidal striations in the acquired EPI images. This is an extreme example, but it makes the point that the scanner is a precision device that is designed to operate in a narrow set of well-defined circumstances. Any deviation from those circumstances will increase noise, thereby reducing SNR and reliability.

SNR considerations of analysis methods

The methods used to analyze fMRI data will affect the reliability of the final results. In particular, those steps taken to reduce known sources of error are critical to increasing the final SNR/CNR of preprocessed images. For example, spatial realignment of the EPI data can have a dramatic effect on lowering movement-related variance and has become a standard part of fMRI preprocessing (Oakes et al., 2005; Zhilkin and Alexander, 2004). Recent algorithms can also help remove remaining signal variability due to magnetic susceptibility induced by movement (Andersson et al., 2001). Temporal filtering of the EPI timeseries can reduce undesired sources of noise by frequency. The use of a high-pass filter is a common method to remove low-frequency noise, such as signal drift due to the scanner (Kiebel and Holmes, 2007). Spatial smoothing of the data can also improve the SNR/CNR of an image. There is some measure of random noise added to the true signal of each voxel during acquisition. Smoothing across voxels can help to average out error across the area of the smoothing filter (Mikl et al., 2008). It can also help account for local differences in anatomy across subjects. Smoothing is most often done using a Gaussian kernel of approximately 6-12 mm3 FWHM.

There has been some degree of standardization regarding preprocessing and statistical approaches in fMRI. For instance, Mumford and Nichols found that approximately 92% of group fMRI results were computed using an ordinary least squares (OLS) estimation of the general linear model (2009). Comparison studies with carefully standardized processing procedures have shown that the output of standard software packages can be very similar (Gold et al., 1998; Morgan et al., 2007). However, in actual practice the diversity of tools and approaches in fMRI increases the variability between sets of results. The functional imaging analysis contest (FIAC) in 2005 demonstrated that prominent differences existed between fMRI results generated by different groups using the same original dataset. On reviewing the results the organizers concluded that brain regions exhibiting robust signal changes could be quite similar across analysis techniques, but the detection of areas with lower signal was highly variable (Poline et al., 2006). It remains the case that decisions made by the researcher regarding how to analyze the data will impact what results are found.

Strother et al. have done a great deal of research into the influence of image processing pipelines using a predictive modeling framework (2004; 2002; Zhang et al., 2009). They found that small changes in the processing pipeline of fMRI images have a dramatic impact on the final statistics derived from that data. Some steps, such as slice timing correction, were found to have little influence on the results from experiments with a block design. This is logical, given the relative insensitivity of block designs to small temporal shifts. However, the steps of motion correction, high-pass filtering, and spatial smoothing were found to significantly improve the analysis. They reported that the optimization of preprocessing pipelines improved both intra-subject and between-subject reproducibility of results (Zhang et al., 2009). Identifying an optimal set of processing steps and parameters can dramatically improve the sensitivity of an analysis.

SNR influences of participants

The MRI system and fMRI analysis methods have received a great deal of attention with regard to SNR. However, one area that may have the greatest contribution to fMRI reliability is how stable/unstable the patterns of activity within a single subject can be. After all, a test-retest methodology involving human beings is akin to hitting a moving target. Any discussion of test-retest reliability in fMRI has to take into consideration the fact that the cognitive state of a subject is variable over time.

There are two important ways that a subject can influence reliability within a test-retest experimental design. The first involves within-subject changes that take place over the course of a single session. For instance, differences in attention and arousal can significantly modulate subsequent responses to sensory stimulation (Munneke et al., 2008; Peyron et al., 1999; Sterr et al., 2007). Variability can also be caused by evolving changes in cognitive strategy used during tasks like episodic retrieval (Miller et al., 2001; Miller et al., 2002). If a subject spontaneously shifts to a new decision criterion midway during a session then the resulting data may reflect the results of two different cognitive processes. Finally, learning will take place with continued task experience, shifting the pattern of activity as brain regions are engaged and disengaged during task-relevant processing (Grafton et al., 1995; Poldrack et al., 1999; Rostami et al., 2009). For studies investigating learning this is a desired effect, but for others this is an undesired source of noise.

The second influence on reliability is related to physiological and cognitive changes that may take place within a subject between the test and retest sessions. Within 24 hours an infinite variety of reliability-reducing events can take place. All of the above factors may show changes over the days, weeks, months, or years between scans. These changes may be even more dramatic depending on the amount of time between scanning sessions.

– Estimates of fMRI Reliability –

A diverse array of methods have been created for measuring the reliability of fMRI. What differs between them is the specific facet of reliability they are intended to quantify. Some methods are only concerned with significant voxels. Other methods address similarity in the magnitude of estimated activity across all voxels. The choice of how to calculate reliability often comes down to which aspect of the results are desired to remain stable over time.

Measuring stability of super-threshold extent.

Do you want the voxels that are significant during the test scan to still be significant during the retest scan? This would indicate that super-threshold voxels are to remain above the threshold during subsequent sessions. The most prevalent method to quantify this reliability is the cluster overlap method. The cluster overlap method is a measure revealing what set of voxels are considered to be super-threshold during both test and retest sessions.

Two approaches have been used to calculate cluster overlap. The first, and by far most prevalent, is a measure of similarity known as the Dice coefficient. It was first used to calculate fMRI cluster overlap by Rombouts et al. and has become a standard measure of result similarity (1997). It is typically calculated by the following equation:

Results from the Dice equation can be interpreted as the number of voxels that will overlap divided by the average number of significant voxels across sessions. Another approach to calculating similarity is the Jaccard index. The Jaccard index has the advantage of being readily interpretable as the percent of voxels that are shared, but is infrequently used in the investigation of reliability. It is typically calculated by the following equation:

Results from the Jaccard equation can be interpreted as the number of overlapping voxels divided by the total number of unique voxels in all sessions. For both the Dice and Jaccard methods a value of 1.0 would indicate that all super-threshold voxels identified during the test scan were also active in the retest scan, and vice-versa. A value of 0.0 would indicate that no voxels in either scan were shared between the test and retest sessions. See Figure 1 for a graphical representation of overlapping results from two runs in an example dataset.

Figure 1. Visualization of cluster overlap using two runs of data from a two-back working memory task. The regions in red represent significant clusters from the first run and regions in blue represent significant clusters from the second run. The crosshatched region represents the overlapping voxels that were significant in both runs. Important to note is that not all significant voxels remained significant across the two runs. One cluster in the cerebellum did not replicate at all. Data is from Bennett, Guerin, and Miller (2009).

The main limitation of all cluster overlap methods is that they are highly dependent on the statistical threshold used to define what is ‘active’. Duncan et al. demonstrated that the reported reliability of the cluster overlap method decreases as the significance threshold is increased (2009). Similar results were reported by Rombouts et al., who found nonlinear changes in cluster overlap reliability across multiple levels of significance (1998).

These overlap statistics seek to represent the proportion of voxels that remain significant across repetitions relative to the proportion that are significant in only a subset of the results. Another, similar approach would be to conduct a formal conjunction analysis between the repetitions. The goal of this approach would be to uniquely identify those voxels that are significant in all sessions. One example of this approach would be the ‘Minimum Statistic compared to the Conjunction Null’ (MS/CN) of Nichols et al (2005). Using this approach a researcher could threshold the results, allowing for the investigation of reliability with a statistical criterion.

A method similar to cluster overlap, called voxel counting, was reported in early papers. The use of voxel counting simply evaluated the total number of activated voxels in the test and retest images. This has proven to be a suboptimal approach for the examination of reliability, as it is done without regard to the spatial location of significant voxels (Cohen and DuBois, 1999). An entirely different set of results could be observed in each image yet they could contain the same number of significant voxels. As a result this method is no longer used.

Measuring stability of activity in significant clusters.

Do you want the estimated magnitude of activity in each cluster to be stable between the test scan and the retest scan? This is a more stringent criteria than simple extent reliability, as it is necessary to replicate the exact degree of activation and not simply what survives thresholding. The most standard method to quantify this reliability is through an intra-class correlation (ICC) of the time1-time2 cluster values. The intra-class correlation is different from the traditional Pearson product-moment correlation as it is specialized for data of one type, or class. While there are many versions of the ICC, it is typically taken to be a ratio of the variance of interest divided by the total variance (Bartko, 1966; Shrout and Fleiss, 1979). The ICC can be computed as follows:

One of the best reviews of the ICC was completed by Shrout and Fleiss, who detailed six types of ICC calculation and when each is appropriate to use (1979). One advantage of the ICC is that it can be interpreted similarly to the Pearson correlation. A value of 1.0 would indicate near-perfect agreement between the values of the test and retest sessions, as there would be no influence of within-subject variability. A value of 0.0 would indicate that there was no agreement between the values of the test and retest sessions, since within-subject variability would dominate the equation.

Studies examining reliability using intra-class correlations are often computed based on summary values from regions of interest (ROIs). Caceras et al. compared four methods commonly used to compute ROI reliability using intraclass correlations (2009). The median(ICC) is the median of the ICC values from within a ROI. ICCmed is the median ICC of the contrast values. ICCmax is the calculation of ICC values at the peak activated voxel within an activated cluster. ICCv is defined the intra-voxel reliability, a measure of the total variability that can be explained by the intra-voxel variance.

There are several notable weaknesses to the use of ICC in calculating reliability. First, the generalization of ICC results is limited because calculation is specific to the dataset under investigation. An experiment with high inter-subject variability could have different ICC values relative to an experiment with low inter-subject variability, even if the stability of values over time is the same. As discussed later in this chapter, this can be particularly problematic when comparing the reliability of clinical disorders to that of normal controls. Second, because of the variety of ICC subtypes there can often be confusion regarding which one to use. Using an incorrect subtype can result in quite different reliability estimates (Muller and Buttner, 1994).

Measuring voxelwise reliability of the whole brain.

Do you want to know the reliability of results on a whole-brain, voxelwise basis? Completing a voxelwise calculation would indicate that the level of activity in all voxels should remain consistent between the test and retest scans. This is the strictest criterion for reliability. It yields a global measure of concordance that indicates how effectively activity across the whole brain is represented in each test-retest pairing. Very few studies have examined reliability using this approach, but it may be one of the most valuable metrics of fMRI reliability. This is one of the few methods that gives weight to the idea that the estimated activity should remain consistent between test and retest, even if the level of activity is close to zero.

Figure 2. Histogram showing the frequency of voxelwise ICC values during a two-back working memory task. The histogram was computed from a dataset of sixteen subjects using 100 bins between ICC values of 1.0 and -1.0. The distribution of values is negatively skewed, with a mean ICC value of ICC = 0.44 and the most frequently occurring value of ICC = 0.57. Data is from Bennett, Guerin, and Miller (2009).

Figure 2 is an example histogram plot from our own data that shows the frequency of ICC values for all voxels across the whole brain during a two-back working memory task (Bennett et al., 2009). The mean and mode of the distribution is plotted. It is quickly apparent that there is a wide range of ICC reliability values across the whole brain, with some voxels having almost no reliability and others approaching near perfect reliability.

Other reliability methods.

Numerous other methods have also been used to measure the reliability of estimated activity. Some of these include maximum likelihood (ML), coefficient of variation (CV), and variance decomposition. While these methods are in the minority by frequency of use, this does not diminish their utility in examining reliability. This is especially true with regard to identifying the sources of test-retest variability that can influence the stability of results.

One particularly promising approach for the quantification of reliability is predictive modeling. Predictive modeling measures the ability of a training set of data to predict the structure of a testing set of data. One of the best established modeling techniques within functional neuroimaging is the nonparametric prediction, activation, influence, and reproducibility sampling (NPAIRS) approach by Strother et al. (2004; 2002). Within the NPAIRS modeling framework separate metrics of prediction and reproducibility are generated (Zhang et al., 2008). The first, prediction accuracy, evaluates classification in the temporal domain, predicting which condition of the experiment each scan belongs to. The second metric, reproducibility, evaluates the model in the spatial domain, comparing patterns of regional brain activity over time. While this approach is far more complicated than the relatively simple cluster overlap or ICC metrics, predictive modeling does not suffer from many of the drawbacks that these methods have. NPAIRS, and other predictive modeling approaches, enable a much more thorough examination of fMRI reliability.

Some studies have investigated fMRI reliability using the Pearson product-moment (r) correlation. Intuitively this is a logical method to use, as it measures the relationship between two variables. However, it is generally held that the Pearson product-moment correlation is not an ideal measure of test-retest reliability. Safrit identified three reasons why the product-moment correlation should not be used to calculate reliability (1976). First, the Pearson product-moment correlation is setup to determine the relationship between two variables, not the stability of a single variable. Second, it is difficult to measure reliability with the Pearson product-moment correlation beyond a single test-retest pair. It becomes increasingly awkward to quantify reliability with two or more retest sessions. One can try to average over multiple pairwise Pearson product-moment correlations between the multiple sessions, but it is far easier to take the ANOVA approach of the ICC and examine it from the standpoint of between- and within-subject variability. Third, the Pearson product-moment correlation cannot detect systematic error. This would be the case when the retest values deviate by a similar degree, such as adding a constant value to all of the original test values. The Pearson product-moment correlation would remain the same, while an appropriate ICC would indicate that the test-retest agreement is not exact. While the use of ICC measures has its own set of issues, it is generally a more appropriate tool for the investigation of test-retest reliability.

– Review of Existing Reliability Estimates –

Since the advent of fMRI some results have been common and quite easily replicated. For example, activity in primary visual cortex during visual stimulation has been thoroughly studied. Other fMRI results have been somewhat difficult to replicate. What does the existing literature have to say regarding the reliability of fMRI results?

There have been a number of individual studies investigating the test-retest reliability of fMRI results, but few articles have reviewed the entire body of literature to find trends across studies. To obtain a more effective estimate of fMRI reliability we conducted a survey of the existing literature on fMRI reliability. To find papers for this investigation we searched for “test-retest fMRI” using the NCBI PubMed database (www.pubmed.gov). This search yielded a total of 183 papers, 37 of which used fMRI as a method of investigation, used a general linear model to compute their results, and provided test-retest measures of reliability. To broaden the scope of the search we then went through the reference section of the 37 papers found using PubMed to look for additional works not identified in the initial search. There were 26 additional papers added to the investigation through this secondary search method. The total number of papers retrieved was 63. Each paper was examined with regard to the type of cognitive task, kind of fMRI design, number of subjects, and basis of reliability calculation.

We have separated out the results into three groups: those that used the voxel overlap method, those that used intraclass correlation, and papers that used other calculation methods. The results of this investigation can be seen in Tables 1, 2, and 3. In the examination of cluster overlap values in the literature we attempted to only include values that were observed at a similar significance threshold across all of the papers. The value we chose as the standard was p(uncorrected) < 0.001. Other deviations from this standard approach are noted in the tables.

Table1

Table2

Table3

Conclusions From the Reliability Review

What follows are some general points that can be taken away from the reliability survey. Some of the conclusions that follow are quantitative results from the review and some are qualitative descriptions of trends that were observed as we conducted the review.

A diverse collection of methods have been used to assess fMRI reliability. The first finding mirrors the above discussion on reliability calculation. A very diverse collection of methods has been used to investigate fMRI reliability. This list includes: intra-class correlation (ICC), cluster overlap, voxel counts, receiver operating characteristic (ROC) curves, maximum likelihood (ML), conjunction analysis, Cohen’s kappa index, coefficient of variation (CV), Kendall’s W, laterality index (LI), variance component decomposition, Pearson correlation, predictive modeling, and still others. While this diversity of methods has created converging evidence of fMRI reliability, it has also limited the ability to compare and contrast the results of existing reliability studies.

Intra-class correlation and cluster overlap methods dominate the calculation of test-retest reliability. While there have been a number of methods used to investigate reliability, the two that stand out by frequency of use are cluster overlap and intra-class correlation. One advantage of these methods is that they are easy to calculate. The equations are simple to understand, easy to implement, and fast to process. A second advantage of these methods is their easy interpretation by other scientists. Even members of the general public can understand the concept behind the overlapping of clusters and most everyone is familiar with correlation values. While these techniques certainly have limitations and caveats, they seem to be the emerging standard for the analysis of fMRI reliability.

Most previous studies of reliability and reproducibility have been done with relatively few subjects. What sample size is necessary to conduct effective reliability research? Most of the studies that were reviewed used less than 10 subjects to calculate their reliability measures, with 11 subjects being the overall average across the investigation. Should reliability studies have more subjects? Since a large amount of the error variance is coming from subject-specific factors it may be wise to use larger sample sizes when assessing study reliability, as a single anomalous subject could sway study reliability in either direction. Another notable factor is that a large percentage of studies using fMRI are completed with a restricted range of subjects. Most samples will typically be recruited from a pool of university undergraduates. These samples may have a different reliability than a sample pulled at random from the larger population. Because of sample restriction the results of most test-retest investigations may not reflect the true reliability of other populations, such as children, the elderly, and individuals with clinical disorders.

Reliability varies by test-retest interval. Generally, increased amounts of time between the initial test scan and the subsequent retest scan will lower reliability. Still, even back-to-back scans are not perfectly reliable. The average Jaccard overlap of studies where the test and retest scans took place within the same hour was 33%. Many studies with intervals lasting three months or more had a lower overlap percentage. This is a somewhat loose guideline though. Notably, the results reported by Aron et al. had one of the longest test-retest intervals but also possessed the highest average ICC score (2006).

Reliability varies by cognitive task and experimental design. Motor and sensory tasks seem to have greater reliability than tasks involving higher cognition. Caceras et al. found that the reliability of an N-back task was generally higher than that of an auditory target detection task (2009). Differences in the design of an fMRI experiment also seem to affect the reliability of results. Specifically, block designs appear to have a slight advantage over event-related designs in terms of reliability. This may be a function of the greater statistical power inherent in a block design and its increased SNR.

Significance is related to reliability, but it is not a strong correlation. Several studies have illustrated that super-threshold voxels are not necessarily more reliable than sub-threshold voxels. Caceras et al. examined the joint probability distribution of significance and reliability (2009). They found that there were some highly activated ROIs with low reliability and some sub-threshold regions that had high reliability. These ICC results fit in well with the data from cluster overlap studies. The average cluster overlap was 29%. This means that, across studies, the average number of significant voxels that will replicate is roughly one-third. This evidence speaks against the assumption that significant voxels will be far more reliable in an investigation of test-retest reliability.

An optimal threshold of reliability has not been established. There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

Inter-individual variability is consistently greater than intra-individual variability. Many studies reported both within-subject and between-subject reliability values in their results. In every case the within-subject reliability far exceeded the between-subjects reliability. Miller et al. explicitly examined variability across subjects and concluded that there are large-scale, stable differences between individuals on almost any cognitive task (2001; 2002). More recently, Miller et al. directly contrasted within- and between-subject variability (2009). They concluded that between-subject variability was far higher than any within-subject variability. They further demonstrated that the results from one subject completing two different cognitive tasks are typically more similar than the data from two subjects doing the same task. These results are mirrored by those of Costafreda et al. who found that well over half (57%) of the variability in their fMRI data was due to between-subject variation (2007). It seems to be the case that within-subject measurements over time may vary, but they vary far less than differences in the overall pattern of activity between individuals.

There is little agreement regarding the true reliability of fMRI results. While we mention this as a final conclusion from the literature review, it is perhaps the most important point. Some studies have estimated the reliability of fMRI data to be quite high, or even close to perfect for some tasks and brain regions (Aron et al., 2006; Maldjian et al., 2002; Raemaekers et al., 2007). Other studies have been less enthusiastic, showing fMRI reliability to be relatively low (Duncan et al., 2009; Rau et al., 2007). Across the survey of fMRI test-retest reliability we found that the average ICC value was 0.50 and the average cluster overlap value was 29% of voxels (Dice overlap = 0.45, Jaccard overlap = 0.29). This represents an average across many different cognitive tasks, fMRI experimental designs, test-retest time periods, and other variables. While these numbers may not be representative of any one experiment, they do provide an effective overview of fMRI reliability.

– Other Issues and Comparisons –

Test-Retest Reliability in Clinical Disorders

There have been few examinations of test-retest reliability in clinical disorders relative to the number of studies with normal controls. A contributing factor to this problem may be that the scientific understanding of brain disorders is still in its infancy. It may be premature to examine clinical reliability if there is only a vague understanding of anatomical and functional abnormalities in the brain. Still, some investigators have taken significant steps forward in the clinical realm. These few investigations suggest that reliability in clinical disorders is typically lower than the reliability of data from normal controls. Some highlights of these results are listed below, categorized by disorder.

Epilepsy. Functional imaging has enormous potential to aid in the clinical diagnosis of epileptiform disorders. Focusing on fMRI, research by Di Bonaventura et al. found that the spatial extent of activity associated with fixation off sensitivity (FOS) was stable over time in epileptic patients (2005). Of greater research interest for epilepsy has been the reliability of combined EEG/fMRI imaging. Symms et al. reported that they could reliably localize interictal epileptiform discharges using EEG-triggered fMRI (1999). Waites et al. also reported the reliable detection of discharges with combined EEG/fMRI at levels significantly above chance (2005). Functional imaging also has the potential to assist in the localization of cognitive function prior to resection for epilepsy treatment. One possibility would be to use noninvasive fMRI measures to replace cerebral sodium amobarbital anesthetization (Wada Test). Fernandez et al. reported good reliability of lateralization indices (whole-brain test-retest r = 0.82) and cluster overlap measures (Dice overlap = .43, Jaccard overlap = 0.27) (2003).

Stroke. Many aspects of stroke recovery can impact the results of functional imaging data. The lesion location, size, and time elapsed since the stroke event each have the potential to alter function within the brain. These factors can also lead to increased between-subject variability relative to groups of normal controls. This is especially true when areas proximal to the lesion location contribute to specific aspects of information processing, such as speech production. Kimberley et al. found that stroke patients had generally higher ICC values relative to normal controls (2008). This mirrors the findings of Eaton et al., who showed that the average reliability of aphasia patients was approximately equal to that of normal controls as measured by ICC (2008). These results may be indicative of equivalent fMRI reliability in stroke victims, or it may be an artifact of the ICC calculation. Kimberly et al. state that increased between-subject variability of stroke patients can lead to inflated ICC estimates (2008). They argue that fMRI reliability in stroke patients likely falls within the moderate range of values (0.4 < ICC < 0.6). Schizophrenia. Schizophrenia is a multidimensional mental disorder characterized by a wide array of cognitive and perceptual dysfunctions (Freedman, 2003; Morrison and Murray, 2005). While there have been a number of studies on the reliability of anatomical measures in schizophrenia there have been few that have focused on function. Manoach et al. demonstrated that the fMRI results from schizophrenic patients on a working memory task were less reliable overall than that of normal controls (2001). The reliability of significant ROIs in the schizophrenic group ranged from ICC values of -0.20 to 0.57. However, the opposite effect was found by Whalley et al. in a group of subjects at high genetic risk for schizophrenia (no psychotic symptoms) (2009). The ICC values for these subjects were equally reliable relative to normal controls on a sentence completion task. More research is certainly needed to find consensus on reliability in schizophrenia.

Aging. The anatomical and functional changes that take place during aging can increase the variability of fMRI results at all levels (MacDonald et al., 2006). Clement et al. reported that cluster overlap percentages and the cluster-wise ICC values were not significantly different between normal elderly controls and patients with mild cognitive impairment (MCI) (2009). On an episodic retrieval task healthy controls had ICC values averaging 0.69 while patients diagnosed with MCI had values averaging 0.70. However, they also reported that all values for the older samples were lower than those reported for younger adults on similar tasks. Marshall et al. found that while the qualitative reproducibility of results was high, the reliability of activation magnitude during aging was quite low (2004).

It is clear that the use of intra-class correlations in clinical research must be approached carefully. As mentioned by Bosnell et al. and Kimberly et al., extreme levels of between-subject variability will artificially inflate the resulting ICC reliability estimate (Bosnell et al., 2008; Kimberley et al., 2008). Increased between-subject variability is a characteristic found in many clinical populations. Therefore, it may be the case that comparing two populations with different levels of between-subject variability may be impossible when using an ICC measure.

Reliability Across Scanners / Multicenter Studies

One area of increasing research interest is the ability to combine the data from multiple scanners into larger, integrative data sets (Van Horn and Toga, 2009). There are two areas of reliability that are important for such studies. The first is subject-level reliability, or how stable the activity of one person will be scan-to-scan. The second is group-level reliability, or how stable the group fMRI results will be from one set of subjects to another or from one scanner to another. Given the importance of multi-center collaboration it is critical to evaluate how results will differ when the data comes from a heterogeneous group of MRI scanners as opposed to a single machine. Generally, the concordance of fMRI results from center to center is quite good, but not perfect.

Casey et al. was one of the first groups to examine the reliability of results across scanners (1998). Between three imaging centers they found a ‘strong similarity’ in the location and distribution of significant voxel clusters. More recently, Friedman et al. found that inter-center reliability was somewhat worse than test-retest reliability across several centers with an identical hardware configuration (2008). The median ICC of their inter-center results was ICC = 0.22. Costafreda et al. also examined the reproducibility of results from identical fMRI setups (2007). Using a variance components analysis they determined that the MR system accounted for roughly 8% of the variation in the BOLD signal. This compares favorably relative to the level of between-subject variability (57%).

The reliability of results from one scanner to another seems to be approximately equal to or slightly less than the values of test-retest reliability with the same MRI hardware. Special calibration and quality control steps can be taken to ensure maximum concordance across scanners. For instance, before conducting anatomical MRI scans in the Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://www.loni.ucla.edu/ADNI/) a special MR phantom is typically scanned. This allows for correction of magnet-specific field inhomogeneity and maximizes the ability to compare data from separate scanners. Similar calibration measures are being discussed for functional MRI (Chiarelli et al., 2007; Friedman and Glover, 2006; Thomason et al., 2007). It may be the case that as calibration becomes standardized it will lead to increased inter-center reliability.

Other Statistical Issues in fMRI

It is important to note that a number of important fMRI statistical issues have gone unmentioned in this chapter. First, there is the problem of conducting thousands of statistical comparisons without an appropriate threshold adjustment. Correction for multiple comparisons is a necessary step in fMRI analysis that is often skipped or ignored (Bennett et al., in press). Another statistical issue in fMRI is temporal autocorrelation in the acquired timeseries. This refers to the fact that any single timepoint of data is not necessarily independent of the acquisitions that came before and after (Smith et al., 2007; Woolrich et al., 2001). Autocorrelation correction is widely available, but is not implemented by most investigators. Finally, throughout the last year the ‘non-independence error’ has been discussed at length. Briefly, this refers to selecting a set of voxels to create a region of interest (ROI) and then using the same measure to evaluate some statistical aspect of that region. Ideally, an independent data set should be used after the ROI has been initially defined. It is important to address these issues because they are still debated within the field and often ignored in fMRI analysis. Their correction can have a dramatic impact on how reproducible the results will be from study to study.

– Conclusions –

How can a researcher improve fMRI reliability?

The generation of highly reliable results requires that sources of error be minimized across a wide array of factors. An issue within any single factor can significantly reduce reliability. Problems with the scanner, a poorly designed task, or an improper analysis method could each be extremely detrimental. Conversely, elimination of all such issues is necessary for high reliability. A well maintained scanner, well designed tasks, and effective analysis techniques are all prerequisites for reliable results.

There are a number of practical ways that fMRI researchers can improve the reliability of their results. For example, Friedman and Glover reported that simply increasing the number of fMRI runs improved the reliability of their results from ICC = 0.26 to ICC = 0.58 (2006). That is quite a large jump for an additional ten or fifteen minutes of scanning. Below are some general areas where reliability can be improved.

Increase the SNR and CNR of the acquisition. One area of attention is to improve the signal-to-noise and contrast-to-noise ratios of the data collection. An easy way to do this would be to simply acquire more data. It is a zero-sum game, as increasing the number of TRs that are acquired will help improve the SNR but will also increase the task length. Subject fatigue, scanner time limitations, and the diminishing returns with each duration increase will all play a role in limiting the amount of time that can be dedicated to any one task. Still, a researcher considering a single six-minute EPI scan for their task might add additional data collection to improve the SNR of the results. With regard to the magnet, every imaging center should verify acquisition quality before scanning. Many sites conduct quality assurance scans (QA) at the beginning of each day to ensure stable operation. This has proven to be an effective method of detecting issues with the MR system before they cause trouble for investigators. It is a hassle to cancel a scanning session when there are subtle artifacts present, but this is a better option than acquiring noisy data that does not make a meaningful contribution to the investigation. As a final thought, research groups can always start fundraising to purchase a new magnet with improved specifications. If data acquisition is being done on a 1.5 Tesla magnet with a quadrature head coil enormous gains in SNR can be made by moving to 3.0 Tesla or higher and using a parallel-acquisition head coil (Simmons et al., 2009; Zou et al., 2005).

Minimize individual differences in cognitive state, both across subjects and over time. Because magnet time is expensive and precious the critical component of effective task instruction can often be overlooked. Researchers would rather be acquiring data as opposed to spending additional time giving detailed instructions to a subject. However, this is a very easy way to improve the quality of the final data set. If it takes ten trials for the participant to really ‘get’ the task then those trials have been wasted, adding unnecessary noise to the final results. Task training in a separate laboratory session in conjunction with time in a mock MRI scanner can go a long way toward homogenizing the scanner experience for subjects. It may not always be possible to fully implement these steps, but they should not be avoided simply to reduce the time spent per subject.

For multi-session studies steps can be taken to help stabilize intra-subject changes over time. Scanning test and retest session at the same time of day can help due to circadian changes in hormone level and cognitive performance (Carrier and Monk, 2000; Huang et al., 2006; Salthouse et al., 2006). A further step to consider is minimizing the time between sessions to help stabilize the results. Much more can change over the course of a month than over the course of a week.

Maximize the experiment’s statistical power. Power represents the ability of an experiment to reject the null hypothesis when the null hypothesis is indeed false (Cohen, 1977). For fMRI this ability is often discussed in terms of the number of subjects that will be scanned and the design of the task that will be administered, including how many volumes of data will be acquired from each subject. More subjects and more volumes almost always contribute to increasing power, but there are occasions when one may improve power more than the other. For example, Mumford and Nichols demonstrated that, when scanner time was limited, different combinations of subjects and trials could be used to achieve high levels of power (2008). For their hypothetical task it would take only five 15 second blocks to achieve 80% power if there were 23 subjects, but it would take 25 blocks if there were only 18 subjects. These kinds of power estimations are quite useful in determining the best use of available scanner time. Tools like fmripower (http://fmripower.org) can utilize data from existing experiments to yield new information on how many subjects and scans a new experiment will require to reach a desired power level (Mumford and Nichols, 2008; Mumford et al., 2007 2007; Van Horn et al., 1998).

The structure of the stimulus presentation has a strong influence on an experiment’s statistical power. The dynamic interplay between stimulus presentation and inter-stimulus jitter are important, as is knowing what contrasts will be completed once the data has been acquired. Each of these parameters can influence the power and efficiency of the experiment, later impacting the reliability of the results. Block designs tend to have greater power relative to event-related designs. One can also increase power by increasing block length, but care should be exercised not to make blocks so long that they approach the low frequencies associated with scanner drift. There are several good software tools available that will help researchers create an optimal design for fMRI experiments. OptSeq is a program that helps to maximize the efficiency of an event-related fMRI design (1999). OptimizeDesign is a set of Matlab scripts that utilize a genetic search algorithm to maximize specific aspects of the design (Wager and Nichols, 2003). Researchers can separately weight statistical power, HRF estimation efficiency, stimulus counterbalancing, and maintenance of stimulus frequency. These two programs, and others like them, are valuable tools for ensuring that the ability to detect meaningful signals is effectively maximized.

It is important to state that the reliability of a study in no way implies that an experiment has accurately assessed a specific cognitive process. The validity of a study can be quite orthogonal to its reliability – it is possible to have very reliable results from a task that mean little with regard to the cognitive process under investigation. No increase in SNR or optimization of event timing can hope to improve an experiment that is testing for the wrong thing. This makes task selection of paramount importance in the planning of an experiment. It also places a burden on the researcher in terms of effective interpretation of fMRI results once the analysis is done.

Where does neuroimaging go next?

In many ways cognitive neuroscience is still at the beginning of fMRI as a research tool. Looking back on the last two decades it is clear that functional MRI has made enormous gains in both statistical methodology and popularity. However, there is still much work to do. With specific regard to reliability, there are some specific next steps that must be taken for the continued improvement of this method.

Better Characterization of the Factors that Influence Reliability. Additional research is necessary to effectively understand what factors influence the reliability of fMRI results. The field has a good grasp of the acquisition and analysis factors that influence SNR. Still, there is relatively little knowledge regarding how stable individuals are over time and what influences that stability. Large-scale studies specifically investigating reliability and reproducibility should therefore be conducted across several cognitive domains. The end goal of this research would be to better characterize the reliability of fMRI across multiple dimensions of influence within a homogeneous set of data. Such a study would also create greater awareness of fMRI reliability in the field as a whole. The direct comparison of reliability analysis methods, including predictive modeling, should also be completed.

Meta/Mega Analysis. The increased pooling of data from across multiple studies can give a more generalized view of important cognitive processes. One method, meta-analysis, refers to pooling the statistical results of numerous studies to identify those results that are concordant and discordant with others. For example, one could obtain the MNI coordinates of significant clusters from several studies having to do with response inhibition and plot them in the same stereotaxic space to determine their concordance. One popular method of performing such an analysis is the creation of an Activation Likelihood Estimate, or ALE (Eickhoff et al., 2009; Turkeltaub et al., 2002). This method allows for the statistical thresholding of meta-analysis results, making it a powerful tool to examine the findings of many studies at once. Another method, mega-analysis, refers to reprocessing the raw data from numerous studies in a new statistical analysis with much greater power. Using this approach any systematic error introduced by any one study will contribute far less to the final statistical result (Costafreda, in press). Mega-analyses are far more difficult to implement since the raw imaging data from multiple studies must be obtained and reprocessed. Still, the increase in detection power and the greater generalizability of the results are strong reasons to engage in such an approach.

One roadblock to collaborative multi-center studies is the lack of data provenance in functional neuroimaging. Provenance refers to complete detail regarding the origin of a dataset and the history of operations that have been preformed on the data. Having a complete history of the data enables analysis by other researchers and provides information that is critical for replication studies (Mackenzie-Graham et al., 2008). Moving forward there will be an additional focus on provenance to enable increased understanding of individual studies and facilitate integration into larger analyses.

New Emphasis on Replication. The non-independence debate of 2009 was less about effect sizes and more about reproducibility. The implicit argument made about studies that were ‘non-independent’ was that if researchers ran a non-independent study over again the resulting correlation would be far lower with a new, independent dataset. There should be a greater emphasis on the replicability of studies in the future. This can be frustrating because it is expensive and time consuming to acquire and process a replication study. However, moving forward this may become increasingly important to validate important results and conclusions.

General Conclusions

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate. Still, over time some measure of truth will accrue. This chapter is not intended to be an accusation against fMRI as a method. Quite the contrary, it is meant to increase the understanding of how much each fMRI result can contribute to scientific knowledge. If only 30% of the significant voxels in a cluster will replicate then that value represents an important piece of contextual information to be aware of. Likewise, if the magnitude of a voxel is only reliable at a level of ICC = 0.50 then that value represents important information when examining scatter plots comparing estimates of activity against a behavioral measure.

There are a variety of methods that can be used to evaluate reliability, and each can provide information on unique aspects of the results. Our findings speak strongly to the question of why there is no agreed-upon average value for fMRI reliability. There are so many factors spread out across so many levels of influence that it is almost impossible to summarize the reliability of fMRI with a single value. While our average ICC value of 0.50 and our average overlap value of 30% are effective summaries of fMRI as a whole, these values may be higher or lower on a study-to-study basis. The best characterization of fMRI reliability would be to give a window within which fMRI results are typically reliable. Breaking up the range of 0.0 to 1.0 into thirds, it is appropriate to say that most fMRI results are reliable in the ICC = 0.33 to 0.66 range.

To conclude, functional neuroimaging with fMRI is no longer in its infancy. Instead it has reached a point of adolescence, where knowledge and methods have made enormous progress but there is still much development left to be done. Our growing pains from this point forward are going to be a more complete understanding of its strengths, weaknesses, and limitations. A working knowledge of fMRI reliability is key to this understanding. The reliability of fMRI may not be the high relative to other scientific measures, but it is presently the best tool available for the in vivo investigation of brain function.

– References –

Andersson, J.L., Hutton, C., Ashburner, J., Turner, R., Friston, K., 2001. Modeling geometric deformations in EPI time series. Neuroimage 13, 903-919.

Aron, A.R., Gluck, M.A., Poldrack, R.A., 2006. Long-term test-retest reliability of functional MRI in a classification learning task. Neuroimage 29, 1000-1006.

Bandettini, P.A., Wong, E.C., Jesmanowicz, A., Hinks, R.S., Hyde, J.S., 1994. Spin-echo and gradient-echo EPI of human brain activation using BOLD contrast: a comparative study at 1.5 T. NMR Biomed 7, 12-20.

Bartko, J., 1966. The intraclass correlation coefficient as a measure of reliability. Psychological Reports 19, 3-11.

Bennett, C.M., Guerin, S.A., Miller, M.B., 2009. The impact of experimental design on the detection of individual variability in fMRI. Cognitive Neuroscience Society, San Francisco, CA.

Bennett, C.M., Wolford, G.L., Miller, M.B., in press. The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience.

Bodurka, J., Ye, F., Petridou, N., Bandettini, P.A., 2005. Determination of the brain tissue-specific temporal signal to noise limit of 3 T BOLD-weighted time course data., Proc. Intl. Soc. Mag. reson. Med., Miami.

Bosnell, R., Wegner, C., Kincses, Z.T., Korteweg, T., Agosta, F., Ciccarelli, O., De Stefano, N., Gass, A., Hirsch, J., Johansen-Berg, H., Kappos, L., Barkhof, F., Mancini, L., Manfredonia, F., Marino, S., Miller, D.H., Montalban, X., Palace, J., Rocca, M., Enzinger, C., Ropele, S., Rovira, A., Smith, S., Thompson, A., Thornton, J., Yousry, T., Whitcher, B., Filippi, M., Matthews, P.M., 2008. Reproducibility of fMRI in the clinical setting: implications for trial designs. Neuroimage 42, 603-610.

Caceres, A., Hall, D.L., Zelaya, F.O., Williams, S.C., Mehta, M.A., 2009. Measuring fMRI reliability with the intra-class correlation coefficient. Neuroimage 45, 758-768.

Carrier, J., Monk, T.H., 2000. Circadian rhythms of performance: new trends. Chronobiol Int 17, 719-732.

Casey, B.J., Cohen, J.D., O’Craven, K., Davidson, R.J., Irwin, W., Nelson, C.A., Noll, D.C., Hu, X., Lowe, M.J., Rosen, B.R., Truwitt, C.L., Turski, P.A., 1998. Reproducibility of fMRI results across four institutions using a spatial working memory task. Neuroimage 8, 249-261.

Chen, E.E., Small, S.L., 2007. Test-retest reliability in fMRI of language: group and task effects. Brain Lang 102, 176-185.

Chiarelli, P.A., Bulte, D.P., Wise, R., Gallichan, D., Jezzard, P., 2007. A calibration method for quantitative BOLD fMRI based on hyperoxia. Neuroimage 37, 808-820.

Cicchetti, D., Sparrow, S., 1981. Developing criteria for establishing interrater reliability of specific items: Applications to assessment of adaptive behavior. Am J Ment Defic 86, 127-137.

Clement, F., Belleville, S., 2009. Test-retest reliability of fMRI verbal episodic memory paradigms in healthy older adults and in persons with mild cognitive impairment. Hum Brain Mapp.

Cohen, J., 1977. Statistical power analysis for the behavioral sciences., (revised edition) ed. Academic Press, New York, NY.

Cohen, M.S., DuBois, R.M., 1999. Stability, repeatability, and the expression of signal magnitude in functional magnetic resonance imaging. J Magn Reson Imaging 10, 33-40.

Costafreda, S.G., in press. Pooling fMRI data: meta-analysis, mega-analysis and multi-center studies. . Frontiers in Neuroinformatics.

Costafreda, S.G., Brammer, M.J., Vencio, R.Z., Mourao, M.L., Portela, L.A., de Castro, C.C., Giampietro, V.P., Amaro, E., Jr., 2007. Multisite fMRI reproducibility of a motor task using identical MR systems. J Magn Reson Imaging 26, 1122-1126.

Dale, A., 1999. Optimal Experimental Design for Event-Related fMRI. Human Brain Mapping 8, 109-114.

Di Bonaventura, C., Vaudano, A.E., Carni, M., Pantano, P., Nucciarelli, V., Garreffa, G., Maraviglia, B., Prencipe, M., Bozzao, L., Manfredi, M., Giallonardo, A.T., 2005. Long-term reproducibility of fMRI activation in epilepsy patients with Fixation Off Sensitivity. Epilepsia 46, 1149-1151.

Duncan, K.J., Pattamadilok, C., Knierim, I., Devlin, J.T., 2009. Consistency and variability in functional localisers. Neuroimage 46, 1018-1026.

Eaton, K.P., Szaflarski, J.P., Altaye, M., Ball, A.L., Kissela, B.M., Banks, C., Holland, S.K., 2008. Reliability of fMRI for studies of language in post-stroke aphasia subjects. Neuroimage 41, 311-322.

Eickhoff, S.B., Laird, A.R., Grefkes, C., Wang, L.E., Zilles, K., Fox, P.T., 2009. Coordinate-based activation likelihood estimation meta-analysis of neuroimaging data: a random-effects approach based on empirical estimates of spatial uncertainty. Hum Brain Mapp 30, 2907-2926.

Feredoes, E., Postle, B.R., 2007. Localization of load sensitivity of working memory storage: quantitatively and qualitatively discrepant results yielded by single-subject and group-averaged approaches to fMRI group analysis. Neuroimage 35, 881-903.

Fernandez, G., Specht, K., Weis, S., Tendolkar, I., Reuber, M., Fell, J., Klaver, P., Ruhlmann, J., Reul, J., Elger, C.E., 2003. Intrasubject reproducibility of presurgical language lateralization and mapping using fMRI. Neurology 60, 969-975.

Freedman, R., 2003. Schizophrenia. N Engl J Med 349, 1738-1749.

Freyer, T., Valerius, G., Kuelz, A.K., Speck, O., Glauche, V., Hull, M., Voderholzer, U., 2009. Test-retest reliability of event-related functional MRI in a probabilistic reversal learning task. Psychiatry Res.

Friedman, L., Glover, G.H., 2006. Reducing interscanner variability of activation in a multicenter fMRI study: controlling for signal-to-fluctuation-noise-ratio (SFNR) differences. Neuroimage 33, 471-481.

Friedman, L., Stern, H., Brown, G.G., Mathalon, D.H., Turner, J., Glover, G.H., Gollub, R.L., Lauriello, J., Lim, K.O., Cannon, T., Greve, D.N., Bockholt, H.J., Belger, A., Mueller, B., Doty, M.J., He, J., Wells, W., Smyth, P., Pieper, S., Kim, S., Kubicki, M., Vangel, M., Potkin, S.G., 2008. Test-retest and between-site reliability in a multicenter fMRI study. Hum Brain Mapp 29, 958-972.

Gold, S., Christian, B., Arndt, S., Zeien, G., Cizadlo, T., Johnson, D.L., Flaum, M., Andreasen, N.C., 1998. Functional MRI statistical software packages: a comparative analysis. Hum Brain Mapp 6, 73-84.

Gountouna, V.E., Job, D.E., McIntosh, A.M., Moorhead, T.W., Lymer, G.K., Whalley, H.C., Hall, J., Waiter, G.D., Brennan, D., McGonigle, D.J., Ahearn, T.S., Cavanagh, J., Condon, B., Hadley, D.M., Marshall, I., Murray, A.D., Steele, J.D., Wardlaw, J.M., Lawrie, S.M., 2009. Functional Magnetic Resonance Imaging (fMRI) reproducibility and variance components across visits and scanning sites with a finger tapping task. Neuroimage.

Grafton, S., Hazeltine, E., Ivry, R., 1995. Functional mapping of sequence learning in normal humans. Journal of Cognitive Neuroscience 7, 497-510.

Harrington, G.S., Buonocore, M.H., Farias, S.T., 2006a. Intrasubject reproducibility of functional MR imaging activation in language tasks. AJNR Am J Neuroradiol 27, 938-944.

Harrington, G.S., Tomaszewski Farias, S., Buonocore, M.H., Yonelinas, A.P., 2006b. The intersubject and intrasubject reproducibility of FMRI activation during three encoding tasks: implications for clinical applications. Neuroradiology 48, 495-505.

Havel, P., Braun, B., Rau, S., Tonn, J.C., Fesl, G., Bruckmann, H., Ilmberger, J., 2006. Reproducibility of activation in four motor paradigms. An fMRI study. J Neurol 253, 471-476.

Hoenig, K., Kuhl, C.K., Scheef, L., 2005. Functional 3.0-T MR assessment of higher cognitive function: are there advantages over 1.5-T imaging? Radiology 234, 860-868.

Huang, J., Katsuura, T., Shimomura, Y., Iwanaga, K., 2006. Diurnal changes of ERP response to sound stimuli of varying frequency in morning-type and evening-type subjects. J Physiol Anthropol 25, 49-54.

Huettel, S.A., Song, A.W., McCarthy, G., 2008. Functional Magnetic Resonance Imaging, 2nd ed. Sinauer Associates, Sunderland, MA.

Jabbi, M., Keysers, C., Singer, T., Stephan, K.E., 2009. Response to “Voodoo Correlations in Social Neuroscience” by Vul et al.

Jansen, A., Menke, R., Sommer, J., Forster, A.F., Bruchmann, S., Hempleman, J., Weber, B., Knecht, S., 2006. The assessment of hemispheric lateralization in functional MRI–robustness and reproducibility. Neuroimage 33, 204-217.

Jezzard, P., Clare, S., 1999. Sources of distortion in functional MRI data. Hum Brain Mapp 8, 80-85.

Johnstone, T., Somerville, L.H., Alexander, A.L., Oakes, T.R., Davidson, R.J., Kalin, N.H., Whalen, P.J., 2005. Stability of amygdala BOLD response to fearful faces over multiple scan sessions. Neuroimage 25, 1112-1123.

Jovicich, J., Czanner, S., Greve, D., Haley, E., van der Kouwe, A., Gollub, R., Kennedy, D., Schmitt, F., Brown, G., Macfall, J., Fischl, B., Dale, A., 2006. Reliability in multi-site structural MRI studies: effects of gradient non-linearity correction on phantom and human data. Neuroimage 30, 436-443.

Kiebel, S., Holmes, A., 2007. The general linear model. In: Friston, K., Ashburner, J., Kiebel, S., Nichols, T., Penny, W. (Eds.), Statistical Parametric Mapping: The Analysis of Functional Brain Images. Academic Press, London.

Kiehl, K.A., Liddle, P.F., 2003. Reproducibility of the hemodynamic response to auditory oddball stimuli: a six-week test-retest study. Hum Brain Mapp 18, 42-52.

Kimberley, T.J., Khandekar, G., Borich, M., 2008. fMRI reliability in subjects with stroke. Exp Brain Res 186, 183-190.

Kong, J., Gollub, R.L., Webb, J.M., Kong, J.T., Vangel, M.G., Kwong, K., 2007. Test-retest study of fMRI signal change evoked by electroacupuncture stimulation. Neuroimage 34, 1171-1181.

Kruger, G., Glover, G.H., 2001. Physiological noise in oxygenation-sensitive magnetic resonance imaging. Magn Reson Med 46, 631-637.

Leontiev, O., Buxton, R.B., 2007. Reproducibility of BOLD, perfusion, and CMRO2 measurements with calibrated-BOLD fMRI. Neuroimage 35, 175-184.

Lieberman, M.D., Berkman, E.T., Wager, T.D., 2009. Correlations in social neuroscience aren’t voodoo: Commentary on Vul et al. (2009). Perspectives on Psychological Science 4.

Liou, M., Su, H.R., Savostyanov, A.N., Lee, J.D., Aston, J.A., Chuang, C.H., Cheng, P.E., 2009. Beyond p-values: averaged and reproducible evidence in fMRI experiments. Psychophysiology 46, 367-378.

Liu, J.Z., Zhang, L., Brown, R.W., Yue, G.H., 2004. Reproducibility of fMRI at 1.5 T in a strictly controlled motor task. Magn Reson Med 52, 751-760.

Loubinoux, I., Carel, C., Alary, F., Boulanouar, K., Viallard, G., Manelfe, C., Rascol, O., Celsis, P., Chollet, F., 2001. Within-session and between-session reproducibility of cerebral sensorimotor activation: a test–retest effect evidenced with functional magnetic resonance imaging. J Cereb Blood Flow Metab 21, 592-607.

MacDonald, S.W., Nyberg, L., Backman, L., 2006. Intra-individual variability in behavior: links to brain structure, neurotransmission and neuronal activity. Trends Neurosci 29, 474-480.

Machielsen, W.C., Rombouts, S.A., Barkhof, F., Scheltens, P., Witter, M.P., 2000. FMRI of visual encoding: reproducibility of activation. Hum Brain Mapp 9, 156-164.

Mackenzie-Graham, A.J., Van Horn, J.D., Woods, R.P., Crawford, K.L., Toga, A.W., 2008. Provenance in neuroimaging. Neuroimage 42, 178-195.

Magon, S., Basso, G., Farace, P., Ricciardi, G.K., Beltramello, A., Sbarbati, A., 2009. Reproducibility of BOLD signal change induced by breath holding. Neuroimage 45, 702-712.

Maitra, R., 2009. Assessing certainty of activation or inactivation in test-retest fMRI studies. Neuroimage 47, 88-97.

Maitra, R., Roys, S.R., Gullapalli, R.P., 2002. Test-retest reliability estimation of functional MRI data. Magn Reson Med 48, 62-70.

Maldjian, J.A., Laurienti, P.J., Driskill, L., Burdette, J.H., 2002. Multiple reproducibility indices for evaluation of cognitive functional MR imaging paradigms. AJNR Am J Neuroradiol 23, 1030-1037.

Manoach, D.S., Halpern, E.F., Kramer, T.S., Chang, Y., Goff, D.C., Rauch, S.L., Kennedy, D.N., Gollub, R.L., 2001. Test-retest reliability of a functional MRI working memory paradigm in normal and schizophrenic subjects. Am J Psychiatry 158, 955-958.

Marshall, I., Simonotto, E., Deary, I.J., Maclullich, A., Ebmeier, K.P., Rose, E.J., Wardlaw, J.M., Goddard, N., Chappell, F.M., 2004. Repeatability of motor and working-memory tasks in healthy older volunteers: assessment at functional MR imaging. Radiology 233, 868-877.

Mayer, A.R., Xu, J., Pare-Blagoev, J., Posse, S., 2006. Reproducibility of activation in Broca’s area during covert generation of single words at high field: a single trial FMRI study at 4 T. Neuroimage 32, 129-137.

McGonigle, D.J., Howseman, A.M., Athwal, B.S., Friston, K.J., Frackowiak, R.S., Holmes, A.P., 2000. Variability in fMRI: an examination of intersession differences. Neuroimage 11, 708-734.

Meindl, T., Teipel, S., Elmouden, R., Mueller, S., Koch, W., Dietrich, O., Coates, U., Reiser, M., Glaser, C., 2009. Test-retest reproducibility of the default-mode network in healthy individuals. Hum Brain Mapp.

Miki, A., Liu, G.T., Englander, S.A., Raz, J., van Erp, T.G., Modestino, E.J., Liu, C.J., Haselgrove, J.C., 2001. Reproducibility of visual activation during checkerboard stimulation in functional magnetic resonance imaging at 4 Tesla. Jpn J Ophthalmol 45, 151-155.

Miki, A., Raz, J., van Erp, T.G., Liu, C.S., Haselgrove, J.C., Liu, G.T., 2000. Reproducibility of visual activation in functional MR imaging and effects of postprocessing. AJNR Am J Neuroradiol 21, 910-915.

Mikl, M., Marecek, R., Hlustik, P., Pavlicova, M., Drastich, A., Chlebus, P., Brazdil, M., Krupa, P., 2008. Effects of spatial smoothing on fMRI group inferences. Magn Reson Imaging 26, 490-503.

Miller, M.B., Donovan, C.L., Van Horn, J.D., German, E., Sokol-Hessner, P., Wolford, G.L., 2009. Unique and persistent individual patterns of brain activity across different memory retrieval tasks. Neuroimage 48, 625-635.

Miller, M.B., Handy, T.C., Cutler, J., Inati, S., Wolford, G.L., 2001. Brain activations associated with shifts in response criterion on a recognition test. Can J Exp Psychol 55, 162-173.

Miller, M.B., Van Horn, J.D., Wolford, G.L., Handy, T.C., Valsangkar-Smyth, M., Inati, S., Grafton, S., Gazzaniga, M.S., 2002. Extensive individual differences in brain activations associated with episodic retrieval are reliable over time. J Cogn Neurosci 14, 1200-1214.

Morgan, V.L., Dawant, B.M., Li, Y., Pickens, D.R., 2007. Comparison of fMRI statistical software packages and strategies for analysis of images containing random and stimulus-correlated motion. Comput Med Imaging Graph 31, 436-446.

Morrison, P.D., Murray, R.M., 2005. Schizophrenia. Curr Biol 15, R980-984.

Moser, E., Teichtmeister, C., Diemling, M., 1996. Reproducibility and postprocessing of gradient-echo functional MRI to improve localization of brain activity in the human visual cortex. Magn Reson Imaging 14, 567-579.

Muller, R., Buttner, P., 1994. A critical discussion of intraclass correlation coefficients. Stat Med 13, 2465-2476.

Mumford, J.A., Nichols, T., 2009. Simple group fMRI modeling and inference. Neuroimage 47, 1469-1475.

Mumford, J.A., Nichols, T.E., 2008. Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. Neuroimage 39, 261-268.

Mumford, J.A., Poldrack, R.A., Nichols, T., 2007. FMRIpower: A Power Calculation Tool for 2-Stage fMRI models. Human Brain Mapping, Chicago, IL.

Munneke, J., Heslenfeld, D.J., Theeuwes, J., 2008. Directing attention to a location in space results in retinotopic activation in primary visual cortex. Brain Res 1222, 184-191.

Murphy, K., Bodurka, J., Bandettini, P.A., 2007. How long to scan? The relationship between fMRI temporal signal to noise ratio and necessary scan duration. Neuroimage 34, 565-574.

Neumann, J., Lohmann, G., Zysset, S., von Cramon, D.Y., 2003. Within-subject variability of BOLD response dynamics. Neuroimage 19, 784-796.

Nichols, T., Brett, M., Andersson, J., Wager, T., Poline, J.B., 2005. Valid conjunction inference with the minimum statistic. Neuroimage 25, 653-660.

Nunnally, J., 1970. Introduction to psychological measurement. McGraw Hill, New York.

Oakes, T.R., Johnstone, T., Ores Walsh, K.S., Greischar, L.L., Alexander, A.L., Fox, A.S., Davidson, R.J., 2005. Comparison of fMRI motion correction software tools. Neuroimage 28, 529-543.

Ogawa, S., Menon, R.S., Tank, D.W., Kim, S.G., Merkle, H., Ellermann, J.M., Ugurbil, K., 1993. Functional brain mapping by blood oxygenation level-dependent contrast magnetic resonance imaging. A comparison of signal characteristics with a biophysical model. Biophys J 64, 803-812.

Peelen, M.V., Downing, P.E., 2005. Within-subject reproducibility of category-specific visual activation with functional MRI. Hum Brain Mapp 25, 402-408.

Peyron, R., Garcia-Larrea, L., Gregoire, M.C., Costes, N., Convers, P., Lavenne, F., Mauguiere, F., Michel, D., Laurent, B., 1999. Haemodynamic brain responses to acute pain in humans: sensory and attentional networks. Brain 122 ( Pt 9), 1765-1780.

Phan, K.L., Liberzon, I., Welsh, R.C., Britton, J.C., Taylor, S.F., 2003. Habituation of rostral anterior cingulate cortex to repeated emotionally salient pictures. Neuropsychopharmacology 28, 1344-1350.

Poldrack, R.A., Prabhakaran, V., Seger, C.A., Gabrieli, J.D., 1999. Striatal activation during acquisition of a cognitive skill. Neuropsychology 13, 564-574.

Poline, J.B., Strother, S.C., Dehaene-Lambertz, G., Egan, G.F., Lancaster, J.L., 2006. Motivation and synthesis of the FIAC experiment: Reproducibility of fMRI results across expert analyses. Hum Brain Mapp 27, 351-359.

Raemaekers, M., Vink, M., Zandbelt, B., van Wezel, R.J., Kahn, R.S., Ramsey, N.F., 2007. Test-retest reliability of fMRI activation during prosaccades and antisaccades. Neuroimage 36, 532-542.

Ramsey, N., Tallent, K., van Gelderen, P., Frank, J., Moonen, C., Weinberger, D., 1996. Reproducibility of Human 3D fMRI Brain Maps Acquired During a Motor Task. Human Brain Mapping 4, 113-121.

Rau, S., Fesl, G., Bruhns, P., Havel, P., Braun, B., Tonn, J.C., Ilmberger, J., 2007. Reproducibility of activations in Broca area with two language tasks: a functional MR imaging study. AJNR Am J Neuroradiol 28, 1346-1353.

Rombouts, S.A., Barkhof, F., Hoogenraad, F.G., Sprenger, M., Scheltens, P., 1998. Within-subject reproducibility of visual activation patterns with functional magnetic resonance imaging using multislice echo planar imaging. Magn Reson Imaging 16, 105-113.

Rombouts, S.A., Barkhof, F., Hoogenraad, F.G., Sprenger, M., Valk, J., Scheltens, P., 1997. Test-retest analysis with functional MR of the activated area in the human visual cortex. AJNR Am J Neuroradiol 18, 1317-1322.

Rostami, M., Hosseini, S.M., Takahashi, M., Sugiura, M., Kawashima, R., 2009. Neural bases of goal-directed implicit learning. Neuroimage 48, 303-310.

Rutten, G.J., Ramsey, N.F., van Rijen, P.C., van Veelen, C.W., 2002. Reproducibility of fMRI-determined language lateralization in individual subjects. Brain Lang 80, 421-437.

Safrit, M., 1976. Reliability theory. American Alliance for Health, Physical Education, and Recreation, Washington, DC.

Salli, E., Korvenoja, A., Visa, A., Katila, T., Aronen, H.J., 2001. Reproducibility of fMRI: effect of the use of contextual information. Neuroimage 13, 459-471.

Salthouse, T.A., Nesselroade, J.R., Berish, D.E., 2006. Short-term variability in cognitive performance and the calibration of longitudinal change. J Gerontol B Psychol Sci Soc Sci 61, P144-151.

Schunck, T., Erb, G., Mathis, A., Jacob, N., Gilles, C., Namer, I.J., Meier, D., Luthringer, R., 2008. Test-retest reliability of a functional MRI anticipatory anxiety paradigm in healthy volunteers. J Magn Reson Imaging 27, 459-468.

Shehzad, Z., Kelly, A.M., Reiss, P.T., Gee, D.G., Gotimer, K., Uddin, L.Q., Lee, S.H., Margulies, D.S., Roy, A.K., Biswal, B.B., Petkova, E., Castellanos, F.X., Milham, M.P., 2009. The resting brain: unconstrained yet reliable. Cereb Cortex 19, 2209-2229.

Shrout, P., Fleiss, J., 1979. Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin 86, 420-428.

Simmons, W.K., Reddish, M., Bellgowan, P.S., Martin, A., 2009. The Selectivity and Functional Connectivity of the Anterior Temporal Lobes. Cereb Cortex.

Smith, A.T., Singh, K.D., Balsters, J.H., 2007. A comment on the severity of the effects of non-white noise in fMRI time-series. Neuroimage 36, 282-288.

Smith, S.M., Beckmann, C.F., Ramnani, N., Woolrich, M.W., Bannister, P.R., Jenkinson, M., Matthews, P.M., McGonigle, D.J., 2005. Variability in fMRI: a re-examination of inter-session differences. Hum Brain Mapp 24, 248-257.

Specht, K., Willmes, K., Shah, N.J., Jancke, L., 2003. Assessment of reliability in functional imaging studies. J Magn Reson Imaging 17, 463-471.

Stark, R., Schienle, A., Walter, B., Kirsch, P., Blecker, C., Ott, U., Schafer, A., Sammer, G., Zimmermann, M., Vaitl, D., 2004. Hemodynamic effects of negative emotional pictures – a test-retest analysis. Neuropsychobiology 50, 108-118.

Sterr, A., Shen, S., Zaman, A., Roberts, N., Szameitat, A., 2007. Activation of SI is modulated by attention: a random effects fMRI study using mechanical stimuli. Neuroreport 18, 607-611.

Strother, S., La Conte, S., Kai Hansen, L., Anderson, J., Zhang, J., Pulapura, S., Rottenberg, D., 2004. Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis. Neuroimage 23 Suppl 1, S196-207.

Strother, S.C., Anderson, J., Hansen, L.K., Kjems, U., Kustra, R., Sidtis, J., Frutiger, S., Muley, S., LaConte, S., Rottenberg, D., 2002. The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework. Neuroimage 15, 747-771.

Swallow, K.M., Braver, T.S., Snyder, A.Z., Speer, N.K., Zacks, J.M., 2003. Reliability of functional localization using fMRI. Neuroimage 20, 1561-1577.

Symms, M.R., Allen, P.J., Woermann, F.G., Polizzi, G., Krakow, K., Barker, G.J., Fish, D.R., Duncan, J.S., 1999. Reproducible localization of interictal epileptiform discharges using EEG-triggered fMRI. Phys Med Biol 44, N161-168.

Tegeler, C., Strother, S.C., Anderson, J.R., Kim, S.G., 1999. Reproducibility of BOLD-based functional MRI obtained at 4 T. Hum Brain Mapp 7, 267-283.

Thomason, M.E., Foland, L.C., Glover, G.H., 2007. Calibration of BOLD fMRI using breath holding reduces group variance during a cognitive task. Hum Brain Mapp 28, 59-68.

Triantafyllou, C., Hoge, R.D., Krueger, G., Wiggins, C.J., Potthast, A., Wiggins, G.C., Wald, L.L., 2005. Comparison of physiological noise at 1.5 T, 3 T and 7 T and optimization of fMRI acquisition parameters. Neuroimage 26, 243-250.

Turkeltaub, P.E., Guinevere, F.E., Jones, K.M., Zeffiro, T.A., 2002. Meta-Analysis of the Functional Neuroanatomy of Single-Word Reading: Method and Validation. Neuroimage 16, 765-780.

Turner, R., Jezzard, P., Wen, H., Kwong, K.K., Le Bihan, D., Zeffiro, T., Balaban, R.S., 1993. Functional mapping of the human visual cortex at 4 and 1.5 tesla using deoxygenation contrast EPI. Magn Reson Med 29, 277-279.

Van Horn, J.D., Ellmore, T.M., Esposito, G., Berman, K.F., 1998. Mapping voxel-based statistical power on parametric images. Neuroimage 7, 97-107.

Van Horn, J.D., Toga, A.W., 2009. Multisite neuroimaging trials. Curr Opin Neurol 22, 370-378.

Vul, E., Harris, C., Winkielman, P., Pashler, H., 2009. Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science 4.

Wager, T.D., Nichols, T., 2003. Optimization of experimental design in fMRI: a general framework using a genetic algorithm. Neuroimage 18, 293-309.

Wagner, K., Frings, L., Quiske, A., Unterrainer, J., Schwarzwald, R., Spreer, J., Halsband, U., Schulze-Bonhage, A., 2005. The reliability of fMRI activations in the medial temporal lobes in a verbal episodic memory task. Neuroimage 28, 122-131.

Waites, A.B., Shaw, M.E., Briellmann, R.S., Labate, A., Abbott, D.F., Jackson, G.D., 2005. How reliable are fMRI-EEG studies of epilepsy? A nonparametric approach to analysis validation and optimization. Neuroimage 24, 192-199.

Waldvogel, D., van Gelderen, P., Immisch, I., Pfeiffer, C., Hallett, M., 2000. The variability of serial fMRI data: correlation between a visual and a motor task. Neuroreport 11, 3843-3847.

Wei, X., Yoo, S.S., Dickey, C.C., Zou, K.H., Guttmann, C.R., Panych, L.P., 2004. Functional MRI of auditory verbal working memory: long-term reproducibility analysis. Neuroimage 21, 1000-1008.

Whalley, H.C., Gountouna, V.E., Hall, J., McIntosh, A.M., Simonotto, E., Job, D.E., Owens, D.G., Johnstone, E.C., Lawrie, S.M., 2009. fMRI changes over time and reproducibility in unmedicated subjects at high genetic risk of schizophrenia. Psychol Med 39, 1189-1199.

White, T., O’Leary, D., Magnotta, V., Arndt, S., Flaum, M., Andreasen, N.C., 2001. Anatomic and functional variability: the effects of filter size in group fMRI data analysis. Neuroimage 13, 577-588.

Woolrich, M.W., Ripley, B.D., Brady, M., Smith, S.M., 2001. Temporal autocorrelation in univariate linear modeling of FMRI data. Neuroimage 14, 1370-1386.

Yetkin, F.Z., McAuliffe, T.L., Cox, R., Haughton, V.M., 1996. Test-retest precision of functional MR in sensory and motor task activation. AJNR Am J Neuroradiol 17, 95-98.

Yoo, S.S., O’Leary, H.M., Lee, J.H., Chen, N.K., Panych, L.P., Jolesz, F.A., 2007. Reproducibility of trial-based functional MRI on motor imagery. Int J Neurosci 117, 215-227.

Yoo, S.S., Wei, X., Dickey, C.C., Guttmann, C.R., Panych, L.P., 2005. Long-term reproducibility analysis of fMRI using hand motor task. Int J Neurosci 115, 55-77.

Zandbelt, B.B., Gladwin, T.E., Raemaekers, M., van Buuren, M., Neggers, S.F., Kahn, R.S., Ramsey, N.F., Vink, M., 2008. Within-subject variation in BOLD-fMRI signal changes across repeated measurements: quantification and implications for sample size. Neuroimage 42, 196-206.

Zhang, J., Anderson, J.R., Liang, L., Pulapura, S.K., Gatewood, L., Rottenberg, D.A., Strother, S.C., 2009. Evaluation and optimization of fMRI single-subject processing pipelines with NPAIRS and second-level CVA. Magn Reson Imaging 27, 264-278.

Zhang, J., Liang, L., Anderson, J.R., Gatewood, L., Rottenberg, D.A., Strother, S.C., 2008. A Java-based fMRI processing pipeline evaluation system for assessment of univariate general linear model and multivariate canonical variate analysis-based pipelines. Neuroinformatics 6, 123-134.

Zhilkin, P., Alexander, M.E., 2004. Affine registration: a comparison of several programs. Magn Reson Imaging 22, 55-66.

Zou, K.H., Greve, D.N., Wang, M., Pieper, S.D., Warfield, S.K., White, N.S., Manandhar, S., Brown, G.G., Vangel, M.G., Kikinis, R., Wells, W.M., 3rd, 2005. Reproducibility of functional MR imaging: preliminary results of prospective multi-institutional study performed by Biomedical Informatics Research Network. Radiology 237, 781-789.

Quote of the Week – Pashler

prefrontal — Wed, 06 Jan 2010 02:38:48 +0000

“It’s hellishly complicated, this data analysis, and that creates great opportunity for inadvertent mischief.” – Hal Pashler (As seen in Science News)