Tuesday, May 21, 2013

He said, she said, then they said...



Conflicting studies can make life tough. A good systematic review could sort it out. It might be possible for the studies to be pooled into a meta-analysis. That can show you the spread of individual study results and what they add up to, at the same time.

But what about when systematic reviews disagree? When the "he said, she said" of conflicting studies goes meta, it can be even more confusing. New layers of disagreement get piled onto the layers from the original research. Yikes! This post is going to be tough-going...

A group of us defined this discordance among reviews as: the review authors disagree about whether or not there is an effect, or the direction of effect differs between reviews. A difference in direction of effect can mean one review gives a "thumbs up" and another a "thumbs down."

Some people are surprised that this happens. But it's inevitable. Sometimes you need to read several systematic reviews to get your head around a body of evidence. Different groups of people approach even the same question in different but equally legitimate ways. And there are lots of different judgment calls people can make along the way. Those decisions can change the results the systematic review will get.

When and how they searched for studies - and what type and subject - means that it's not at all unusual for groups of reviewers to be looking at different sets of studies for much the same question.

After all that, different groups of people can interpret evidence differently. They often make different judgments about the quality of a study or part of one - and that could dramatically affect its value and meaning to them.

It's a little like watching a game of football where there are several teams on the field at once. Some of the players are on all the teams, but some are playing for only one or two. Each team has goal posts in slightly different places - and each team isn't necessarily playing by the same rules. And there's no umpire.

Here's an example of how you can end up with controversy and people taking different positions even when there's a systematic review. The area of some disagreement in this subset of reviews is about psychological intervention after trauma to prevent post-traumatic stress disorder (PTSD) or other problems:

Published in 2002; Published in 2005Published in 2005Published in 2010; Published in 2012Published in 2013.

The conclusions range from saying debriefing has a large benefit to saying there is no evidence of benefit and it seems to cause some PTSD. Most of the others, but not all, fall somewhere in between, leaning to "we can't really be sure". Most are based only on randomized trials, but one has none, and one has a mixture of study types.

The authors are sometimes big independent national or international agencies. A couple of others include authors of the studies they are reviewing. The definition of trauma isn't the same - they may or may not include childbirth, for example. The interventions aren't the same.

The quality of evidence is very low. And the biggest discordance - whether or not there is evidence of harm - hinges mostly on how much weight you put on one trial.

It's about debriefing. The debriefing group is much bigger than the control group because they stopped the trial early, and while it's complicated, that can be a source of bias.

The people in the debriefing group were at quite a lot higher risk of PTSD in the first place. Data for more than 20% of the people randomized is missing - and that biases the results too (it's called attrition bias). You can't be sure those people didn't return because they were depressed, for example. If so, that could change the results. 

It's no wonder there's still a controversy here.

If you want to read more about debriefing, here's my post in Scientific American: Dissecting the controversy about early psychological response to disasters and trauma.

Thursday, May 9, 2013

They just Google THAT?!


I admit I needed Google to quickly find out that the category for bunny-shaped clouds is "zoomorphic". And I think Google is wonderful - and so does Tess. But...

There's just been another study published about the latest generation of doctors and their information and searching habits. Like Tess' friend, they rely pretty heavily on Googling. We could all be over-estimating, though, just how good people are at finding things with Google - including the biomedically trained.

Many of us assume that the "Google generation" or "digital natives" are as good at finding information as they are at using technology. A review in 2008 came to the conclusion that this was "a dangerous myth" and those things don't go hand in hand. It may not have gotten any better since then either.

Information literacy is about knowing when you need information, and knowing how to find and evaluate it. Google leads us to information that the crowd is basically endorsing. If the crowd has poor information literacy in health, then that can reinforce the problem.

This is an added complication for health consumers. While there's an increasing expectation that healthcare system decisions and clinical decisions be based on rigorous assessments of evidence, that's not really trickling down very fast. Patient information is generally still pretty old school.

What would it mean for patient information to be really evidence-based? I believe it includes using methods to minimize bias in finding and evaluating research to base the information on, and using evidence-based communication. Those ideas are gaining ground, for example in standards in England and Germany, and this evaluation by WHO Europe of one group of us putting these concepts into practice.

Missing critical information that can shift the picture is one of the most common ways that reviews of research can get it wrong. For systematic reviews of evidence, searching for information well is a critical and complex task.

This brings us to why Tess' talents, passions and chosen career are so important. We need health information specialists and librarians to link us with good information in many ways.

This week at the excellent annual meeting of the Medical Library Association in Boston (think lots of wonderful Tess'es!), there was a poster by Whitney Townsend and her colleagues at the Taubman Health Sciences Library (University of Michigan). Their assessment of 368 systematic reviews suggests that even systematic reviewers need help searching.

Google's great, but it doesn't mean we don't still need to "go to the library."


(Disclosure: I work in a library these days - the world's largest medical one at the National Institute of Health (NIH). If this has put you in the mood for honing up your searching skills, there are some tips for searching PubMed Health here.)


Tuesday, April 23, 2013

Women and children overboard



It's the Catch-22 of clinical trials: to protect pregnant women and children from the risks of untested drugs....we don't test drugs adequately for them.

In the last few decades, we've been more concerned about the harms of research than of inadequately tested treatments for everyone, in fact. But for "vulnerable populations," like pregnant women and children, the default was to exclude them.

And just in case any women might be, or might become, pregnant, it was often easier just to exclude us all from trials.

It got so bad, that by the late 1990s, the FDA realized regulations and more for pregnant women - and women generally - had to change. The NIH (National Institutes of Health) took action too. And so few drugs had enough safety and efficacy information for children that, even in official circles, children were being called "therapeutic orphans." Action began on that, too.

There is still a long way to go. But this month there was a sign that maybe times really are changing. The FDA approved Diclegis for nausea and vomiting in pregnancy. It's a new formulation of the key ingredients of Bendectin, the only other drug ever approved for that purpose in the USA. Nothing else has been shown to work.

Thirty years ago, the manufacturer withdrew Bendectin from the market because it was too expensive to keep defending it in the courts. It's a gripping story, involving the media, activists, junk science and some fraud. It had a major influence on clinical research, public opinion and more. You can read more about it in my guest blog at Scientific American, Catch-22, clinical trial edition: the double bind for women and children.

In dozens of court cases over Bendectin, judges and juries struggled with competing testimony about scientific evidence. In one hearing, a judge offered the unusual option of a "blue ribbon jury" or a "blue, blue ribbon jury": selecting only people who would be qualified to understand the complex testimony and issues of causation. The plaintiffs refused.

Ultimately, in one of the Bendectin cases, Daubert versus Merrell Dow Pharmaceuticals, the Supreme Court re-defined the rules around scientific evidence for US courts. The previous Frye Rule called for consensus. The 1972 Federal Rules of Evidence said "all relevant evidence is admissible."

The new Daubert standard determined that evidence must be "reliable" - grounded in "the methods and procedures of science" - not just relevant.

We still need everyone involved to better understand what reliable scientific evidence on clinical effects really means, though. You can read more about that here at Statistically Funny.


Tuesday, April 9, 2013

Look, Ma - straight A's!



Unfortunately, little Suzy isn't the only one falling for the temptation to dismiss or explain away inconvenient performance data. Healthcare is riddled with this, as people pick and choose studies that are easy to find or that prove their points.

In fact, most reviews of healthcare evidence don't go through the painstaking processes needed to systematically minimize bias and show a fair picture. You can read more about how it's done thoroughly in this explanation of systematic reviews at PubMed Health.

A fully systematic review very specifically lays out a question and how it's going to be answered. Then the researchers stick to that study plan, no matter how welcome or unwelcome the results. They go to great lengths to find the studies that have looked at their question, and they analyze the quality and meaning of what they find.

The researchers might do a meta-analysis - a statistical technique to combine the results of studies (explained here at Statistically Funny). But you can have a systematic review without a meta-analysis - and you can do a meta-analysis of a group of studies without doing a systematic review.

To help make it easier for people to sift out the fully systematic from the less thorough reviews, a group of us, led by Elaine Beller, have just published guidelines for abstracts of systematic reviews. It's part of the PRISMA Statement initiative to improve reporting of systematic reviews.

A quick way to find systematic reviews is the National Library of Medicine's PubMed Health. It's a one-stop shop of systematic reviews, information based on systematic reviews and key resources to help you understand clinical effectiveness research. You can read more about PubMed Health here.

Do systematic reviews entirely solve the problem Julie saw with those school grades? Unfortunately, not always. Many trials aren't even published at all, and no amount of searching or digging can get to them. This happens even when the trial has good news, but it happens more often with disappointing results. The "fails" can be very well-hidden. Yes, it's as bad as it sounds: Ben Goldacre explains the problem and its consequences here

You can help by signing up to the All Trials campaign - please do, and encourage everyone you know to do it too. Healthcare interventions simply won't all be able to have reliable report cards until the trials are not just done, but easy to get at.


Interest declaration: I'm the editor of PubMed Health and on the editorial advisory board of PLOS Medicine.


Sunday, April 7, 2013

Don't worry ... it's just a standard deviation


Of course, every time Cynthia and Gregory make the 8-block downtown trip to the Stinsons, it's going to take a different amount of time, depending on traffic and so on - even if it only varies by a minute or two.

Most of the time, the trip to the Stinsons' apartment would take between 10 minutes (in the middle of the night) and 45 minutes (in peak hour). Giving a range like that is called a confidence interval (explained here).

So what's a standard deviation and what does it tell you? Well, it's not a comment on Gregory's behavior! Deviance as a term for abnormal behavior is an invention of the 1940s and '50s. Standard deviation (or SD) is a statistical term first used in 1894 by one of the key figures in modern statistics, Karl Pearson.

The standard deviation shows how far results are from the mean (or average). The standard deviation will be bigger when the numbers are more spread out, and smaller when there's not a huge amount of difference.

Lots of results will cluster within 1 standard deviation of the mean, and most will be within 2 standard deviations. Roughly like this:




95% of results are going to be within 2 standard deviations in either direction from the mean. You can read about how 95% (or 0.05) came to have this significance here. Statistical significance is explained here at Statistically Funny.



Monday, March 25, 2013

Every move you make....Are you watching you?


Monitoring....There's something about getting something into numbers and targets that just makes it seem to be so controllable, isn't there? And many people - including many doctors - just love gadgets and measuring things. No wonder there is so much monitoring in health and fitness.

Actually, there's too much monitoring in some health matters. Some monitoring could cause anxiety without benefit, or lead to actions that do more harm than good.

Professor Paul Glasziou, author of Evidence-Based Monitoring, talked about this on Monday at Evidence Live. For monitoring to be effective there has to be:
  • valid and accurate measurement,
  • informed interpretation, and
  • effective action that can be taken on the results.  
Then there has to be an effective monitoring regimen.

None of that is simple. Frequent testing can mean you end up acting on random variations, not real changes in health. There's more at Statistically Funny about when statistical significance can mislead and the statistical risks of multiple testing.

Self-monitoring can be a path to freedom and better health in some circumstances - if you use insulin or an anticoagulant like warfarin, for instance. But constant monitoring of everything you can measure is a whole other kettle of fish. You can read more about this, monitoring apps and 'the quantified self' in my guest blog at Scientific American: 'Every breath you take, Every move you make...' How much monitoring is too much?

Saturday, March 9, 2013

Nervously approaching significance


We're deluged with claims that we should do this, that or the other thing because some study has a "statistically significant" result. But don't let this particular use of the word "significant" trip you up: when it's paired with "statistically", it doesn't mean it's necessarily important.

Statistical significance is reached when a "p" value is less than 5% (shown by <0.05 or a specific number such as p=0.001). That is critical mathematically, because it means that the relationship is highly unlikely to be a coincidence. The probability that it is a fluke is less than 5% (0.05 or 5 out of a 100).

In the example p=0.001, the certainty is even stronger: there's only a 1 in 1000 chance that is a fluke. You can read more about statistical significance and being certain in Data Bingo! Oh no! here at Statistically Funny.

A statistically valid association, though, is not necessarily significant in the sense of "important". A sliver of a difference could reach statistical significance if a study is big enough. For example, if one group of people sleeps a tiny bit longer on average a night than another group of people, that could be statistically significant. But it wouldn't be enough for one group of people to feel more rested than the other.

This is why people will often say something was statistically significant, but clinically unimportant, or not clinically significant. Clinical significance is a value judgment, often implying a difference that would change the decision that a clinician or patient would make. Others speak of a minimal clinically important difference (MCID or MID). That can mean they are talking about the minimum difference a patient could detect - but there is a lot of confusion around these terms.

Researchers and medical journals are more likely to trumpet "statistically significant" trial results to get attention from doctors and journalists, for example. Those medical journal articles are a key part of marketing pharmaceuticals, too. Selling copies of articles to drug companies is a major part of the business of many (but not all) medical journals. 

And while I'm on the subject of medical journals, I need to declare my own relationship with one I've long admired: PLOS Medicine - an international open access journal. As well as being proud to have published there, I'm delighted to have recently joined their Editorial Board.

Friday, February 22, 2013

Well, he would say that, wouldn't he?



When a big medical conference is on, we're saturated with coverage of the presentations. This makes them a critical target for marketing and a major contributor to unrealistic expectations about what health care - and health research - can really do.

To their great credit, professional societies are developing policies to ensure disclosure of financial interests by presenters. After all, financial interests pose an obvious risk of bias to research about the effects of health care - and to what the research means for clinical practice and education. Disclosures are increasing at conferences, but compliance can be a problem.

Disclosure is made especially hard when so many people make a mockery of disclosing their potential conflicts - maybe by not pointing it out when the commercial source of their salary has been passed through a filter first - or by ignoring major hospitality from a company while at the same time declaring trinkets. The disclosure of indirect financial interests is a particular problem.

A conference just wound up here in Washington DC discussing these among other issues in Selling Sickness. It'll be addressed again this September in Dartmouth at Preventing Over-Diagnosis.

And then there's our side of it as an audience: we need to get better at interpreting people's disclosures and knowing when our antenna should really be out - without throwing the baby out with the bathwater. That's not easy either! I wrote a reader's guide to disclosure of interests in medical journals a while ago, if you're interested in this aspect of bias. It's called "They would say that, wouldn't they?"


* The origins of "He would say that, wouldn't he?"

Sunday, February 17, 2013

You will meet too much false precision


Precise numbers and claims - as though there is no margin for error - are all around us. When someone tells you that 54.3% of people with some disease will have a particular outcome, they're basically predicting the future of all groups of people based on what happened to another group of people in the past. Well, what are the chances of that, eh? 

If our fortune teller was quoting the result of a study here, it could be written like this:  67.5% (95% CI: 62%-73%). The CI stands for "confidence interval" and it's an indication of certainty. It's showing us that 95 times out of 100, similar groups of people in similar circumstances would experience this result, somewhere between 62% and 73% of the time.

The chances of the result always being precisely 67.5% can be pretty slim or very high, depending on lots of things. If there is enough data to be really sure, the confidence interval will be narrow: the best case scenario and the worst case scenario will be close together (say, 66% to 69%).

We do this all the time. If someone asks, "How long does it take to get to your house?", we don't say "39.35 minutes". We say, "Usually about half an hour to 45 minutes, depending on the traffic."

In a systematic review, you will often see an outcome of an individual study shown as a line. The length of that line is showing you the length of the confidence interval around the result. It looks something like this:


This is called a forest plot. Find more from Statistically Funny on this in The Forest Plot Trilogy.


Wednesday, January 30, 2013

Data Bingo! Oh no!



Oh boy - look what a data hunter has dragged in this time! Why is this problem so common? And who on earth is Bonferroni?

Our friend here found one "statistically significant" result when he looked at goodness knows how many differences between groups of people. He's fallen totally for a statistical illusion that's a hazard of 'multiple testing'. And a lot of headline writers and readers will fall for it, too.

Then he's made it worse by taking his unproven hypothesis (that a particular drink on a particular day in a particular group of people prevented stroke) and whacking on another unproven hypothesis (that if everyone else drinks lots of it, benefits will ensue). But it's the problem of multiple testing (also called multiplicity) where Bonferroni comes in.

It's pretty much inevitable that multiple testing will churn out some wrong answers. Something that the Italian mathematician, Carlo Bonferroni (1892-1960), figured out how to analyze.

 A "statistically significant" difference between groups of people means that more than 95 times out of a 100, roughly the same difference is likely to be experienced by other similar groups of people in similar circumstances. That's a high probability of being right. Or put another way, it's less than a 5/100 or 5% probability of being wrong (a "p" value of less than 0.05). 

If you test for multiple possibilities, you need to expect even your statistically significant "findings" to be wrong on average 5 times out of a 100 (or 1 in 20 findings). If you test only a few things, your chances of this kind of random error is very low.

But especially if you have a big dataset, the more things you look at, the higher the chance is that you'll drag total flukes out. With high-powered computers crunching big data, this becomes a big problem - large numbers of spurious findings that can't be replicated.

Bonferroni's name graces some statistical tests used to interpret results when doing multiple tests. There are others. Some are concerned that techniques based on Bonferroni are too conservative - too likely to throw the baby out with the water, if you like. So they use tests that have a different basis, such as the False Discovery Rate (FDR).

Statistical tests can't totally eliminate the chance of random error, though. So you usually need more than just a single possibly random test result to be sure about something.

If you're interested in how to communicate statistics accurately and well, check out Session 2G at Science Online this week: Evelyn Lamb and I are co-moderating. Follow on Twitter with #PublicStats (#Scio13).

Getting more technical...

What about multiplicity issues in systematic reviews? As the Cochrane Handbook (section 16.7.2) points out, systematic reviews concentrate on estimating pre-specified effects - not searching for possible effects. Safeguards still matter, though. Even pre-specified analyses need to be kept to a minimum. And how many analyses were done needs to be kept in mind when interpreting results.

If you would like to read more technical information about multiple testing, here are some free slides from the University of Washington. And if you want to read more about the controversies and issues, here's a primer in Nature and an article in the Journal of Clinical Epidemiology (behind paywalls).