Sat | Dec 7, 2019

Sorry, wrong number: Statistical benchmark comes under fire

Published:Monday | November 18, 2019 | 12:15 AM

NEW YORK (AP):

Earlier this fall Dr Scott Solomon presented the results of a huge heart drug study to an audience of fellow cardiologists in Paris.

The results Solomon was describing looked promising: Patients who took the medication had a lower rate of hospitalization and death than patients on a different drug.

Then he showed his audience another number.

“There were some gasps, or ‘Ooohs,” Solomon, of Harvard’s Brigham and Women’s Hospital, recalled recently. “A lot of people were disappointed.”

One investment analyst reacted by reducing his forecast for peak sales of the drug — by $1 billion.

What happened?

The number that caused the gasps was 0.059. The audience was looking for something under 0.05.

What it meant was that Solomon’s promising results had run afoul of a statistical concept you may never have heard of: statistical significance. It’s an all-or-nothing thing. Your statistical results are either significant, meaning they are reliable, or not significant, indicating an unacceptably high chance that they were just a fluke.

The concept has been used for decades. It holds a lot of sway over how scientific results are appraised, which studies get published, and what medicines make it to drugstores.

But this year has brought two high-profile calls from critics, including from inside the arcane world of statistics, to get rid of it — in part out of concern that it prematurely dismisses results like Solomon’s.

Significance is reflected in a calculation that produces something called a p-value. Usually, if this produces a p-value of less than 0.05, the study findings are considered significant. If not, the study has failed the test.

Solomon’s study just missed. So the apparent edge his drug was showing over the other medication was deemed insignificant. By this criterion there was no “real” difference.

Solomon believes the drug in fact produced a real benefit and that a larger or longer-lasting study could have reached statistical significance.

“I’m not crying over spilled milk,” he said. “We do set the rules. The question is, is that the right way to go about it?”

He’s not alone in asking that question.

“It is a safe bet that people have suffered or died because scientists (and editors, regulators, journalists and others) have used significance tests to interpret results,” epidemiologist Kenneth Rothman of RTI Health Solutions in Research Triangle Park, N.C., and Boston University wrote in 2016.

The danger is both that a potentially beneficial medical finding can be ignored because a study doesn’t reach statistical significance, and a harmful or fruitless medical practice could be accepted simply because it does, he said in an email.

The p-value cutoff for significance is “a measure that has gained gatekeeper status ... not only for publication but for people to take your results seriously,” says Northwestern University statistician Blake McShane.

It’s no wonder that a statistician, at a recent talk to journalists about the issue just before Halloween, displayed a slide of a jack-o’-lantern carved with this sight, obviously terrifying to anyone in science or medicine: “P = .06.”

McShane and others argue that the importance of the p-value threshold is undeserved. He co-authored a call to abolish the notion of statistical significance, which was published in the prestigious journal Nature this year. The proposal attracted more than 800 co-signers.

Even the American Statistical Association, which had never issued any formal statement on specific statistical practices, came down hard in 2016 on using any kind of p-value cutoff in this way. And this year it went further, declaring in a special issue with 43 papers on the subject, “It is time to stop using the term “statistically significant’ entirely.”