Physician profiling methods are envisioned as a means of promoting healthcare quality by recognizing the contributions of individual physicians. Developing methods that can reliably distinguish among physicians' performance is challenging because of small sample sizes, incomplete data, and physician panel differences. In this study, we developed a hierarchical, weighted composite model to reliably compare primary care physicians across domains of care, and we demonstrated its use within a clinical system.
We evaluated 199 primary care physicians from a large integrated healthcare delivery system using 19 quality and two efficiency measures taken from the Healthcare Effectiveness Data and Information Set and existing pay-for-performance programs. Individual measures were calculated, compared to benchmarks, and grouped into two composites: one focused on quality and one on efficiency. Each composite was fitted to the model, assessed for reliability (signal-to-noise ratio), and weighted to create a single summary score for each primary care physician. The quality-of-care composite had a median reliability of .98, with 99.5% of all physician reliability estimates exceeding threshold. The efficiency composite had a median reliability of .97, with 94.9% of all physician reliability estimates exceeding threshold.
Our results demonstrate that reliable physician profiling is possible across care domains using a hierarchical composite model based on multiple data. The model was used to distribute incentive payouts among primary care physicians but is adaptable to many settings.
Physician profiling has been recognized as potentially valuable in healthcare reform by U.S. policy makers and commercial purchasers, who cope with the world's highest healthcare expenditures (KFF, 2009). Physician profiling requires models or methods for scoring the care that physicians have delivered to their patient panels, usually based on several individual, program-specific quality measures. Health plans use profiling extensively in pay-for-performance (P4P) programs to determine performance targets for contract negotiations and to generate financial incentives at physician and organizational levels. Performance feedback and incentives are intended to guide physicians to deliver high-quality, efficient care (Wodchis, Ross, & Detsky, 2007; Lexa, 2008) as well as to derive insights about the care delivery system.
When P4P programs are driven by health plans, physician groups may be unaware of specific model variables, such as case-mix adjustments and weighting schemes, which vary by health plan. In addition, healthcare organizations may need to manage the sometimes conflicting reports across health plans (Draper, 2009). While some healthcare organizations have developed their own performance tools, such as balanced scorecards (Curtright, Stolp-Smith, & Edell, 2000; Impagliazzo, Ippolito, & Zoccoli, 2009; Stewart & Greisler, 2002), no standard approach exists to evaluate measures across domains of care, practice functionality, and patient experience that allows for broader profiling of individual clinicians.
Efforts to measure performance at the individual physician level require complete and accurate data and face many challenges. Issues of misclassification and robustness in profiling models can occur when real differences in performance among physicians cannot be detected (Adams, Mehrotra, Thomas, & McGlynn, 2010; Hofer et al., 1999). Reliability measures whether the performance of one physician can be confidently differentiated from another and incorporates the variation between physicians based on their number of quality events, defined as the size of the eligible patient population for a given quality measure. Reliability can be low when there is very little physician variability in a compressed scale (Nyweide, Weeks, Gottlieb, Casalino, & Fisher, 2009). Studies have found low reliabilities associated with profiling based on individual measures in the context of case-mix adjustment (Hong et al., 2010) and costing (Adams et al., 2010). Most health plans can access only their own claims, which may not yield a representative sample of the physician's patient population and may result in insufficient sample sizes to support reliable profiling (Rodriguez et al., 2012). Scholle et al. (2009) suggest pooling physician data across payers with the ability to link patients to physicians.
Profiling methods that aggregate information across multiple measures, such as by taking the mean of the component measures (Parkerton, Smith, Belin, & Feldbau, 2003), have been proposed to address reliability concerns (Scholle et al., 2008; Smith, Sussman, Bernstein, & Hayward, 2013). Scholle et al. (2008) introduced a model that creates a composite measure for each physician using weighted z scores across measures. Because z scores measure performance in terms of standard deviations from the mean, the same measurement scale is, in effect, employed for all measures. Composites may be focused on a particular diagnosis (e.g., diabetes) or a type of care (e.g., preventive) (Lovett & Liang, 2012; Sequist, Schneider, Li, Rogers, & Safran, 2011). Such composites have been shown in some cases to create reliable measures (Smith et al., 2013; van Doom-Klomberg et al., 2013). In this study, we present a method for creating a reliable physician profiling model by pooling patient-level data across multiple payers and data sources and combining measures into domains of care using Scholle et al.'s (2008) approach to increase sample size and variation across primary care physicians (PCPs). We demonstrate its use at a multiple-clinic healthcare organization, which we call Health Clinic (a fictitious name), at which the scores were used to calculate incentive payments. The model is expandable to include various domain types (e.g., acute, chronic, preventive, patient experience, medical home), sources of data (e.g., electronic health record [EHR] data, Medicaid and Medicare data), and stratification within domains for patient mix or clinical practice characteristics. Healthcare organizations can tailor the model to match organizational characteristics.
Case Study Setting and Data
The model presented here was used to profile PCPs in a managed care network in Massachusetts. The study population included 199 physicians across pediatric primary care, family medicine, and general internal medicine specialties from 68 Massachusetts outpatient clinics, which were part of the integrated Health Clinic system.
We illustrate the use of the model with data from 2008. The case study site, Health Clinic, provided the 2008 measure rates, numerators, and denominators per individual physician for selected measures. Data were pulled from a central claims repository for three health plans, the clinical system's lab results, and billing and scheduling information. Health plan administrative claims and enrollment data across the three payers represented roughly 75% of all paid claims. Additional information included panel size per physician, physician specialty, active employment status per physician, and externally vetted benchmark performance per measure.
To build our model, we used the iterative process shown in Figure 1, which illustrates major steps and highlights important decisions that must be made at each step. These decisions are guided by the desire to create robust and reliable profiling scores within the given organizational context and the limitations of the available data.
The process begins with model specification, which involves defining the purpose for profiling, establishing the target population, and choosing the profiling model or approach. Health Clinic sought to reward PCPs with an incentive payout based on quality and efficiency contributions to the network. The target population was physicians in primary care pediatrics, general internal medicine, and family medicine who contributed to the network rate for 2008 performance. Eligibility criteria required that the physician be in the network for all of 2008 and active at the time of distribution, with a practice panel size greater than 30. Health Clinic decided that physicians with a practice panel size lower than 30 probably had other administrative duties and would have very few quality events per measure. Data from 199 providers (110 internal medicine, 56 family medicine, and 33 pediatric) met the target population and payout criteria.
We used Scholle et al.'s (2008) z-score approach to create a composite profiling score for each physician (as shown in Figure 2). In Health Clinic's application, z scores are calculated for individual measures and physicians. For each measure, a physician rate was determined by dividing the number of patients satisfying the numerator criteria (quality hits) by the total eligible population of patients for that physician (quality events), which was then converted to a standardized z score. We then grouped measures into domains and weighted the z scores relative to the proportion of a physician's events for a particular measure out of the total number of events for that physician across all measures within a domain. The weighted z scores were summed across each domain and standardized to calculate a domain-level z score. The domain-level z scores were then combined using weights given to each domain, with a final standardization applied to determine an overall z score. At Health Clinic, the weights for each domain were chosen to align with organizational policy.
In model fitting, the selection criteria for including measures in the model are defined, and measures for domains are determined. Selection criteria include, for example, whether measures are case-mix adjustable, feasible with available data, part of P4P measurement, and clinically relevant. Potential measures include P4P and internal reward program metrics, medical home survey results, patient experience survey results, and Healthcare Effectiveness Data and Information Set (HEDIS) metrics. Diverse measures across PCP specialties, domains of care, and data sources (e.g., lab data, EHR entries, patient survey results) should be considered to promote fair and accurate comparisons across physicians and to support effective use of the results. Selected measures were then sorted into homogeneous domains (e.g., chronic care, resource utilization, medical home elements).
The final model design for Health Clinic is illustrated in Figure 2; boxes without shading represent the 2008 implementation. Measures were selected based on the availability of data and benchmarks and on whether the measures were part of P4P measurement, publicly reported, externally vetted, evidence based, and clinically relevant. Forty-two quality and two efficiency measures were evaluated initially. Twelve measures were dropped due to small physician coverage, and six more were eliminated due to nonstandard measurement (e.g., well-child visits ages 7-11 and diabetes composites). Twenty-six measures met the initial criteria and were grouped into domains of quality of care and efficiency. To minimize variability and multimodal distributions, a measure was included in a physician's composite score only if the physician had more than four quality events (as suggested by Scholle et al., 2008). With this threshold, three more measures were dropped due to an inadequate number of physicians with sufficient quality events to profile.
Model reliability assessment is the process of evaluating composite scores for reliability, determining minimum sample sizes, and, if necessary, iterating back to model fitting to create a domain structure to meet a minimum reliability of .70, which is considered psychometrically acceptable (Nunnally & Bernstein, 1994).
Reliability is a measure of signal to noise or the [R.sup.2] statistic in regression as the fit of the observed values to the true values. Using Adams (2009) and Scholle et al. (2008), we calculated reliability by means of the Spearman-Brown formula (Shrout & Fleiss, 1979). The estimate can be interpreted as the percentage of the total variation that can be explained by physician-to-physician variation. High-reliability estimates (i.e., those that most closely approach 1) suggest that one can confidently tell physicians apart. Low-reliability estimates indicate that it is difficult to determine with confidence whether physician performance can be differentiated given the sample sizes, proportions, and physician-to-physician variation.
The composite error variances per physician were calculated as the sum of the squared weights multiplied by the individual measure average error variance (Adams, 2009). The physician-to-physician variance was determined by hierarchical linear modeling (HLM) for each composite using a PROC MIXED SAS procedure (Adams, 2009; Wakeling, 2004). An SAS input table was created for each measure and included the individual physician composite rates and individual error variances. This type of HLM is called a one-way random effects model or a components-of-variance model (Petruccelli, Nandram, & Chen, 1999). To obtain the physician-to-physician variance, we inputted the weighted z scores and error variances for each physician into the PROC MIXED program (Wakeling, 2004).
The physician-to-physician variance and error variance estimates were then used to calculate reliability estimates for each physician by the Spearman-Brown formula. When the minimum reliability was less than .70, the model-fitting step was repeated by reassessing the domain structure.
In measure verification, individual scores are compared to known benchmarks to screen for significant discrepancies due to systematic errors in data collection or measurement. Quality hits or misses can be double-checked by conducting a chart review on a random sample of patients and select measures. New calendar year data can also be compared against previous years' data, taking into account changes in standardized measurement rules such as HEDIS (NCQA, 2007). Measurement results can be benchmarked against externally vetted data, such as those collected by healthcare coalitions that pool data across payers. Discrepancies are inevitable due to differences in timing, data sources, and physician-patient assignments, but allowable deviations should be established and action taken to troubleshoot deviations above the threshold. Measures should be dropped when root causes of deviations cannot be determined; in this case, model-fitting and/or model-assessment steps were revisited.
For Health Clinic, the individual measure results were validated against 2007 Massachusetts Health Quality Partners network results and EHR data. Two measures--eye screening for diabetic retinal disease and well-infant visits--were dropped due to differences that could not be resolved. Measure verification was also carried out within each physician specialty and physician clinic type to search for biases indicating inaccurate or incomplete data.
The final list of 21 measures and two domains is shown in Table 1, along with the number of physicians having quality events and the mean number of events, varying by measure. For example, diabetes mellitus measures were represented by more physicians and showed a higher mean number of quality events than did coronary artery disease (CAD) measures.
The last stage of physician profiling is developing an output plan in alignment with the organization's policy and P4P or other performance goals. This step of the model is customizable and should be linked to the model-specification step by the administrating entity. For Health Clinic, the domain scores were combined using weights determined by the governing committee at Health Clinic and implemented for the 2008 rewards program. Payouts were determined from the overall composite score per physician, panel size, and allotted dollars. The payout plan included only the physicians eligible for profiling and excluded all PCPs with a final z score lower than -1. One point was added to each PCP's score, and the sum was multiplied by the PCP's panel size to yield panel points per physician; the additional point allowed all physicians scoring above -1 to receive a positive score for payout calculation. The total of available incentive dollars was then divided by the total panel points across all physicians to yield a dollar per panel point rate. This rate was multiplied by the panel size of each physician to yield individual payout amounts.
The reliability results for each individual measure used at Health Clinic were calculated using the mean rate and mean quality events across providers and are summarized in Table 1. Reliabilities ranged from .11 to .89, with generic prescribing having the highest reliability, highest average number of quality events, and highest percentage of physicians meeting the minimum sample size required to achieve reliability of .70. Colorectal cancer screening and well-adolescent visit measures also had reliability estimates above .70. Overall, reliability results varied considerably across individual measures, typically improving with the number of quality events. As expected, as the reliability of the measures increased, the percentage of physicians meeting the minimum sample size increased.
Composite reliabilities were estimated for each physician in two domains: quality of care and efficiency. The results are summarized in Table 2. The median reliability for the quality-of-care composite was .98, and the median reliability for the efficiency composite was .97, both exceeding the recommended threshold of .70. The approximate sample size across all measures in the domain needed to achieve a reliability estimate greater than .70 was 17 for the quality-of-care composite and 63 for the efficiency composite. The percentage of total evaluated physicians (199) meeting the minimum reliability threshold was 96% for quality of care and 90% for efficiency; all physicians receiving a payout had reliability estimates greater than .70.
The model was implemented for the 2008 rewards program at Health Clinic. The payouts calculated for physicians ranged from $0 to approximately $14,000. About 90% of physicians received payouts of between $0 and $6,000.
The results in Tables 1 and 2 demonstrate the success in using composites to address reliability issues. Our results also confirm the difficulties in using individual measures. To create reliable composite measures, less common and disease-specific measures were included, possibly introducing additional variability and multimodality in the data. We found that the z-score range across physicians per measure was highly influenced by the minimum quality event threshold and the number of physicians profiled. By definition, as the variation in a measure increases, the z-score range becomes compressed (Petruccelli et al., 1999).
A sensitivity analysis was carried out at Health Clinic to understand the impact of undetected errors on payout. In the analysis, when 20 quality hits were missed for the highest weighted measure, a 2-3% change in payout occurred for those physicians with at least average network panel size. The payout method minimized the magnitude of misclassification effects on payout by scoring physicians continuously rather than grouping them into discrete levels for payout.
The physician profiling model in Figure 2 can be expanded to include additional data sources and domains, supporting applications in diverse healthcare settings. For example, the domains in the Health Clinic model could be expanded to include internally assessed medical home survey data, patient experience survey data, and episode treatment group data. Data sources such as EHRs and satisfaction surveys could also inform additional measures, including health status and comfort (Berwick, 2009; Linder, Kaleba, & Kmetik, 2009). The quality-of-care domain could be divided into three domains of care (acute, preventive, and chronic, shown in gray in Figure 2). The hurdle for regrouping is identifying additional measures that support validity, completeness and accuracy of data, and reliability.
Several model limitations are related to the availability of data, which constrains the measures that can be used and the reliability of the profiling tool. First, lack of access and transparency across payers at the individual physician level limits completeness and accuracy; for example, Health Clinic did not have access to claims data from Medicaid and Medicare. Of the original 44 measures identified for possible inclusion in Health Clinic's model, 15 were eventually eliminated due to poor physician coverage and small denominator size.
Second, the model supports action at the domain level because quality hits and events per measure cannot be interpreted reliably at the individual measure level. Such a view of care delivery across domains might address major deficiencies in the overall healthcare process rather than in one measure at a time.
Third, the model scores physicians on the basis of annual measures that rely heavily on claims data, with a typical runout period of 6 months. Because physicians may get their report card as long as 9 months after the performance period ends, improvement from interventions might take 2 years to show in the results, limiting the usefulness of the model in evaluating rapid-cycle improvement trials.
The final data limitation in this study is its focus on reliability rather than an exploration of whether the measures used in the model are valid, that is, true indicators of individual performance or the profiling program objective. Results may be influenced by factors outside a physician's control, such as clinic functionalities and patient panel characteristics, which are significant predictors of quality and efficiency (Hofer et al., 1999; Hong et al., 2010; Krein, Hofer, Kerr, & Hayward, 2002; Solberg, Asche, Pawlson, Scholle, & Shih, 2008). Practice processes may offer more opportunities to improve quality of care than individual measures do, as reflected in the medical home concept (Rogers, 2008; Rosenthal, 2008; Vesely, 2008). After case-mix adjustment, factoring out variability caused by patient demographics, socioeconomic status, and duration of disease, Hofer et al. (1999) found that no more than 4% of total variation was caused by individual physicians.
Other limitations relate to the specific model used at Health Clinic. First, the purpose of profiling at Health Clinic was to reward physicians for their contribution to network performance. However, due to variation in physician quality events per measure, a physician who performed above the network rate could receive a below-average standardized score. Second, physicians were scored on their performance in only one cycle, without recognizing performance improvement over previous years or setting benchmarks for future performance.
Implications for Practice
The case study model was built with the input and approval of the governing body of the profiled physicians. Physicians received individualized reports showing their numerators and denominators per individual measure and composite measure along with system statistics. Physicians embraced this methodology and valued the feedback as a fair assessment of their performance. The previous reward program ranked physicians on a small set of P4P measures and organizational incentives and lacked consistency across the members of the network. Additionally, the previous program did not measure all of the physicians in the managed care network. Health Clinic continued to use the case study profiling methodology in 2010 (with 2009 data) and 2011 (with 2010 data), at which time the funding was no longer available to support the rewards program.
In this article, we present a physician profiling method designed to produce reliable results, and we demonstrate its application at Health Clinic. The hierarchical composite model allows a diverse set of measures (i.e., process, outcome, structure, efficiency, and satisfaction) to be used at the individual or group level, thereby enabling broad profiling of individual clinicians. The model enables combining measures with disparate sample sizes, levels of difficulty, and variation, including measures that were individually unreliable for differentiating physician performance. The model is flexible and can accommodate change in future years, is transparent in terms of measurement, aligns with the organizational quality agenda, and is perceived as fair by targeted physician groups, thereby supporting key elements needed for useful physician profiling (Garnick et ah, 1994).
We also provide a foundation for additional work. While the resulting model at Health Clinic met design expectations for reliability and purpose, future studies should examine case-mix adjustments and practice functionalities to improve validity. In addition, initiatives to create shared pools of transparent information for physician performance across insurers would provide more complete and accurate data and therefore improve the ability to profile physician performance. Finally, the model could be part of a participative weighting process in which clinics and physicians could choose their own improvement foci, increasing the relevance of the resulting model for participating physicians.
Adams, J. L. (2009). The reliability of provider profiling: A tutorial. Santa Monica, CA: RAND Corporation.
Adams, J. L., Mehrotra, A., Thomas, J. W., & McGlynn, E. A. (2010). Physician cost profiling--Reliability and risk of misclassification. New England Journal of Medicine, 362, 1014-1021.
Berwick, D. M. (2009). Measuring physicians' quality and performance: Adrift on Lake Wobegon. Journal of the American Medical Association, 302, 2485-2486.
Curtright, J. W., Stolp-Smith, S. C., & Edell, E. S. (2000). Strategic performance management: Development of a performance measurement system at the Mayo Clinic. Journal of Healthcare Management, 45(1), 58-68.
Draper, D. A. (2009, June). Physician performance measurement: A key to higher quality and lower cost growth or a lost opportunity? Center for Studying Health System Change. Commentary 3. Retrieved from http://www.hschange.com/CONTENT /1064/1064.pdf
Garnick, D. W., Fowles, J., Lawthers, A. G., Weiner, J. P., Parente, S. T., & Palmer, R. H. (1994). Focus on quality: Profiling physicians' practice patterns. Journal of Ambulatory Care Management, 17, 44-75. Hofer, T. P., Hayward, R. A., Greenfield, S., Wagner, E. H., Kaplan, S. H., & Manning, W. G. (1999). The unreliability of individual physician "report cards" for assessing the costs and quality of care of a chronic disease. Journal of the American Medical Association, 281, 2098-2105.
Hong, C. S., Atlas, S. J., Chang, Y., Subramanian, S. V., Ashburner, J. M., Barry, M. & Grant, R. W. (2010). Relationship between patient panel characteristics and primary care physician clinical performance rankings. Journal of the American Medical Association, 304, 1107-1113.
Impagliazzo, C., Ippolito, A., & Zoccoli, P. (2009). The balanced scorecard as a strategic management tool: Its application in the regional public health system in Campania. Health Care Management, 28, 44-54.
Kaiser Family Foundation (KFF). (2009, May 19). Side-by-side comparison of major health care reform proposals. Retrieved from http://kff.org/health-reform /issue-brief/side-by-side-comparison-of -major-health-care-reform-proposals/
Krein, S. L., Hofer, T. P., Kerr, E. A., & Hayward, R. A. (2002). Whom should we profile? Examining diabetes care practice variation among primary care providers, provider groups, and health care facilities. Health Services Research, 37, 1159-1180.
Lexa, F. J. (2008). Pay for performance and the revolution in American medical culture. Journal of the American College of Radiology, 5, 168-173.
Linder, J. A., Kaleba, E. O., & Kmetik, K. S. (2009). Using electronic health records to measure physician performance for acute conditions in primary care: Empirical evaluation of the community-acquired pneumonia clinical quality measure set. Medical Care, 47, 208-216.
Lovett, K. M., & Liang, B. A. (2012). Quality care opportunities: Refining physician performance measurement in ambulatory care. American Journal of Managed Care, 18(6), e212-e216.
National Committee for Quality Assurance (NCQA). (2007). HEDIS 2007 technical specifications for physician measurement. Washington, DC: NCQA.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York, NY: McGraw-Hill.
Nyweide, D. J., Weeks, W. B., Gottlieb, D. J., Casalino, L. P., & Fisher, E. S. (2009). Relationship of primary care physicians' patient caseload with measurement of quality and cost performance. Journal of the American Medical Association, 302, 2444-2450.
Parkerton, P. H., Smith, D. G., Belin, T. R., & Feldbau, G. A. (2003). Physician performance assessment: Nonequivalence of primary care measures. Medical Care, 41, 1034-1047.
Petruccelli, I. D., Nandram, B., & Chen, M. (1999). Applied statistics for engineers and scientists. Upper Saddle River, NJ: Prentice Hall.
Rodriguez, H. P., Perry, L., Conrad, D. A., Maynard, C., Martin, D., & Grembowski, D. (2012). The reliability of medical group performance measurement in a single insurer's pay for performance program. Medical Care, 50(2), 117-123
Rogers, J. C. (2008). The patient-centered medical home movement--Promise and peril for family medicine. Journal of the American Board of Family Practice, 21, 370.
Rosenthal, T. C. (2008). The medical home: Growing evidence to support a new approach to primary care. Journal of the American Board of Family Practice, 21, 427.
Scholle, S. H., Roski, J., Adams, (. L., Dunn, D. L., Kerr, E. A., Dugan, D. P., & Jensen, R. E. (2008). Benchmarking physician performance: Reliability of individual and composite measures. American Journal of Managed Care, 14, 833-838.
Scholle, S. H., Roski, J., Dunn, D. L., Adams, J. L., Dugan, D. P., Pawlson, L. G., & Kerr, E. A. (2009). Availability of data for measuring physician quality performance. American Journal of Managed Care, 15, 67-72.
Sequist, T. D., Schneider, E. C., Li, A., Rogers, W. H., & Saffian, D. G. (2011). Reliability of medical group and physician performance measurement in the primary care setting. Medical Care, 49(2), 126-131.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.
Smith, K. A., Sussman, J. B., Bernstein, S. J., & Hayward, R. A. (2013). Improving the reliability of physician "report cards." Medical Care, 51(3), 266-274.
Solberg, L. I., Asche, S. E., Pawlson, L. G., Scholle, S. H., & Shih, S. C. (2008). Practice systems are associated with high-quality care for diabetes. American Journal of Managed Care, 14, 85-92.
Stewart, L.J., & Greisler, D. (2002). Measuring primary care practice performance within an integrated delivery system: A case study. Journal of Healthcare Management, 47(4), 250-261.
Van Doorn-Klomberg, A. L., Braspenning, J. C. C., Feskens, R. C. W., Bouma, M., Campbell, S. M., & Reeves, D. (2013). Precision of individual and composite performance scores: The ideal number of indicators in an indicator set. Medical Care, 51(1), 115-121
Vesely, R. (2008). Medical home push. Align payment with quality: AHIP. Modern Healthcare 38, 15.
Wakeling, I. (2004). BETABIN macro.2004 [Data analysis software.] Retrieved from http://www.sensory.org/library/files/SAS /betabin-v22.sas
Wodchis, W. P., Ross, J. S., & Detsky, A. S. (2007). Is P4P really FFS? Journal of the American Medical Association, 298, 1797-1799.
Harry C. Sax, MD, FACHE, FACS, executive vice chair of Surgery, Cedars Sinai Medical Center, Los Angeles, California
"Public reporting is coming."
"If my profile doesn't look good, I'll be excluded from contracts."
"But wait, my patients are sicker and the family has unrealistic expectations."
"This will lead to economic credentialing."
Pelletier et al. posit that physician profiling is envisioned as a means to promote healthcare quality. But to some--perhaps many--physicians, the subcontext is more sinister. With any attempt to compare physicians comes a combined reaction among them of fear, excitement, and competition.
I spent my early academic career at the University of Rochester, and each year, we awaited the publication of institutional and individual cardiac surgery morbidity and mortality rates by the State of New York (see, e.g., New York State Department of Health, 2012). Across the state, reactions to the rates were predictable: Those who performed well leveraged their results in promotional ad copy, and those who did not questioned the data.
Yet behind the posturing, real change took place. Programs that lagged reached out to top performers to learn. Some smaller programs realized they did not have the resources to excel and chose to close. Good surgeons got better, and struggling surgeons either improved or left the state. Over the years, the quality and safety of cardiac surgery in New York improved dramatically.
Profiling physicians in the primary care setting is much more challenging. Hard measures, such as mortality and readmission rates, are replaced with best-practice compliance. Of course, it is easy to measure the percentage of prescriptions that are filled generically or the number of diabetics who receive retinal screening, and these are reliable, measurable endpoints. But do they truly measure quality? Differentiating among individual physicians in terms of guidelines compliance does not speak to the true measures of survival in the new era of healthcare delivery: cost-effectiveness, risk mediation, and patient satisfaction.
Even when measures are in place, physicians' behaviors will not change unless the amount of incentive at risk is significant. In this study, maximum incentive payments were approximately $14,000, or less than 10% of a typical base salary. Given that this is the first pass at a profiling model, it is appropriate to reduce downside risk. As the market becomes competitive, there will be a wider range of penalties, and accurate assessment is vital.
At Cedars Sinai, we have focused profiling efforts on high-volume admitters by developing a cadre of trained physician advisers who meet one-on-one to review fellow physicians' profiles in real time. Concerns regarding attribution and risk adjustment are addressed as they emerge. This increased awareness has led to more accurate documentation of comorbidities. We also have published performance data among the physicians practicing in one of our centers of excellence to examine variability in resource utilization across a single procedure, such as laminectomy, cholecystectomy, joint replacement, or appendectomy. Doctors are inherently competitive, and seeing where they stand relative to others is a powerful motivator.
Pelletier et al. are to be congratulated for conducting a scientific analysis of the weight and validity of measures. They took into account systematic errors and were able to quantify physician-to-physician variability. It would be interesting to correlate high performers with risk-adjusted inpatient admission rates.
Public reporting is coming--now is our time to partner with our physicians to ensure that we are truly measuring what is important ... and align incentives to drive appropriate behavior.
New York State Department of Health. (2012, October 15). Cardiovascular disease data and statistics. Retrieved from https://www.health.ny.gov/statistics/diseases/cardiovascular/
For more information about the concepts in this article, contact Dr. Pelletier at email@example.com.
Lori R. Pelletier, PhD, associate vice president, UMass Memorial Health Care, and assistant professor, Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, Massachusetts; Sharon A. Johnson, PhD, associate professor, Operations and Industrial Engineering, Worcester Polytechnic Institute School of Business, Worcester; Edward R. Westrick, MD, president of Medical Affairs, medical director, and primary care physician, PACE Organization of Rhode Island, Providence; Elaine R. Fontaine, director, Data Quality and Analytics, Rhode Island Quality Institute, Providence; Alan D. Krinsky, PhD, senior data analyst, Office of Clinical Integration, UMass Memorial Health Care; Robert A. Klugman, MD, vice president, Medical Affairs, Eastern Region, Kindred Healthcare, Andover, Massachusetts; and Arlene S. Ash, PhD, professor, Quantitative Health Sciences, University of Massachusetts Medical School
TABLE 1 Reliability Results for 19 Individual Quality Measures and 2 Efficiency Measures Total Number Mean No. of of Quality Physicians Events With Quality Across Domain Measure Events >4 Physicians Quality Well-adolescent visit 207 34 of care Well-child visit 97 24 Pharyngitis 82 7 Upper respiratory infection 104 11 Chlamydia ages 16-20 142 6 Chlamydia ages 21-24 138 6 Diuretics 172 16 ACE and ARBs 179 21 CAD LDL control 104 5 CAD LDL testing 100 7 Diabetes nephropathy 181 14 Diabetes LDL control 163 10 Diabetes LDL testing 181 14 Diabetes A1C < 7 (good) 161 11 Diabetes A1C > 9 (poor) 161 11 Diabetes testing (2/year) 181 14 Cervical CS 172 68 Breast CS 173 60 Colorectal CS 154 70 Efficiency Lower back pain 166 6 Generic prescribing 259 1,111 Minimum across measures 82 5 Maximum across measures 259 1,111 Reliability Sample Size at Mean Needed at Quality Mean Rate Events and for .70 Domain Measure Mean Rate Reliability Quality Well-adolescent visit 0.73 29 of care Well-child visit 0.19 233 Pharyngitis 0.62 11 Upper respiratory infection 0.33 51 Chlamydia ages 16-20 0.23 52 Chlamydia ages 21-24 0.20 58 Diuretics 0.60 26 ACE and ARBs 0.68 24 CAD LDL control 0.11 98 CAD LDL testing 0.16 78 Diabetes nephropathy 0.41 46 Diabetes LDL control 0.22 82 Diabetes LDL testing 0.26 90 Diabetes A1C < 7 (good) 0.39 39 Diabetes A1C > 9 (poor) 0.15 148 Diabetes testing (2/year) 0.37 54 Cervical CS 0.53 138 Breast CS 0.54 117 Colorectal CS 0.78 45 Efficiency Lower back pain 0.17 70 Generic prescribing 0.97 77 Minimum across measures 0.11 11 Maximum across measures 0.97 233 % of Physicians Meeting Minimum Sample Size for .70 Domain Measure Reliability Quality Well-adolescent visit 38 of care Well-child visit 0 Pharyngitis 0 Upper respiratory infection 6 Chlamydia ages 16-20 0 Chlamydia ages 21-24 0 Diuretics 25 ACE and ARBs 44 CAD LDL control 0 CAD LDL testing 0 Diabetes nephropathy 1 Diabetes LDL control 0 Diabetes LDL testing 0 Diabetes A1C < 7 (good) 3 Diabetes A1C > 9 (poor) 0 Diabetes testing (2/year) 0 Cervical CS 22 Breast CS 20 Colorectal CS 59 Efficiency Lower back pain 0 Generic prescribing 89 Minimum across measures 0 Maximum across measures 89 TABLE 2 Reliability Results for the Composites Mean Quality Median Events Average Reliability Total Across Physician Across Composites Physicians Physicians Rate Physicians Quality of care 199 252 0.72 0.98 Efficiency 199 1,115 0.77 0.97 Approximate % of Physicians Sample Meeting Size Needed Minimum for .70 Sample Size for Composites Reliability .70 Reliability Quality of care 17 96 Efficiency 63 90