Consumers Checkbook What Patients Say About Their Doctors
Your Comments

Is there some reliability threshold that a doctor's scores must meet in order to be reported to the public at all?

CHECKBOOK/CSS has decided not to report to the public a doctor's score on a question unless either (1) the doctor's score on that question is significantly "Better" or "Lower" than average or (2) the doctor's sample size on that question meets a certain threshold. We have used something called the "reliability" statistic to decide on that sample size threshold. Reliability can range from 0.0 to 1.0. We have decided not to report to the public any score on a question for a doctor where the number of survey responses received for that doctor is below a number that, if all doctors in the survey had the same number of responses, would yield a reliability of 0.7.

Unfortunately, even with the relatively high numbers of responses collected in this survey, there are still many cases where specific questions cannot be reported to the public consistent with our reliability criterion.

For those who are not familiar with the use of the reliability statistic, we will explain the concept briefly here. The reason to use the reliability statistic (although other measures could also be used) is that it is an indicator of how confidently one will be able to distinguish among doctors on a list based on their survey scores. The higher the reliability statistic, the less the chance that someone comparing doctors' survey scores will conclude that two doctors are different (or not different) when the difference (or lack of difference) is just a result of the "luck of the draw."

The reliability statistic takes into account three characteristics of a set of survey responses for doctors. First, how much variation is there in the ratings each doctor's patients give their doctors? The more doctors' patients tend to agree about their doctors (small within-doctor variance), the higher the reliability statistic. Second, how much variation is there from doctor to doctor in the mean rating each doctor gets from his or her patients? The larger the differences from doctor to doctor (large between-doctor variance), the higher the reliability statistic. Third, how large is the number of respondents for each doctor? Larger numbers of responses produce a higher reliability statistic. Where within-doctor variance is low, between-doctor variance is high, and number of responses is high, the reliability statistic tends to be high, and your confidence in distinguishing (or not distinguishing) among doctors can be relatively high.

Return to Background, Methods, and Improvement Resources
Share your thoughts. . . Post a Comment
All Comments - Show
05/07/09
05:40 PM
We have been asked what is the minimum sample size for reporting.

We will give all physicians their own scores to review confidentially on all questions, regardless of sample size or other statistical properties.

But for many physicians, results will not be publicly reported for a substantial number of questions. In a small number of cases, no results on any question will be reported for a physician.

In Memphis, for example, we will publicly report on at least one question for 430 of the 437 doctors included in the survey, but for some of these 430 physicians, only one or a few questions will be publicly reported.

The decision of whether or not to publicly report survey results on a physician is made on a question-by-question basis and depends on the statistical properties of the question for the specific physician and on the statistical properties of the question for Memphis as a whole.

We will publicly report on any physician on any question if the t-test for that physician on that question indicates at a 95 percent confidence level that that physician's performance is better or worse than the all-physician average in the community. That is a function of (1) the size of the difference between that physician's score and the community average on that question, (2) the variation among the scores given on that question by that physician's patients, and (3) the number of patient ratings for that physician on that question. We did not have a t-test result sufficient for public reporting on any question for any physician where the number of completed surveys for that physician was fewer than 10.

If a physician’s score for a measure failed to qualify for public reporting based on t-test results, there was another standard used to determine whether to allow his/her score to be publicly reported. Such a standard is, of course, necessary since many physicians with very large sample sizes and meaningful data should not be excluded from reporting simply because they have scores that are not far from average. This second standard is that the data must achieve a certain level on a "reliability" statistic.

This concept of "reliability" is calculated on each question by taking into account the between-physician variation in physician-average scores, the average within-physician variation in scores, and sample size. For each question, for every possible number of responses, we calculated what the "reliability" statistic of the dataset would be if all physicians had that sample size. This is a conservative approach in terms of what is publicly reported. We will be publicly reporting for each question all results for all physicians with a sample size that produces a "reliability" statistic of 0.7 or higher in this calculation.

Using this standard—which is the standard that actually comes into play for determining reportability for most physicians on most questions (since most scores are not significantly different from the average)—the following are the minimum sample sizes for public reporting for illustrative questions in Memphis: for Q20 (overall rating), minimum number of patient responses required for public reporting is 22; for Q21 (recommend to family and friends), minimum is 30; for Q14, (did doctor give easy-to-understand instructions), minimum is 39; for Q11 (did doctor explain things well), minimum is 31; for Q23 (courteous and respectful office staff), minimum is 39. In Denver, the minimums turn out to be substantially higher just because of the statistical properties of the Denver pool of responses.

--rkrughoff at CHECKBOOK/CSS

Post a Comment
Want to participate in the discussion? Please login or register for free.
Already registered? Please login here.

Email
Password

Not registered? Register for free here.