Challenges In Interpreting The Evidence On Medicare Alternative Payment Models

Medicare’s share of gross domestic product, projected to grow by over 50 percent by 2050, is among the federal government’s leading long-term fiscal challenges. Alternatives to fee-for-service reimbursement, such as accountable care organizations (ACOs) and episode-based payments, offer the promise of “bending the trend” through provider-driven reductions in low-value care rather than benefit reductions or fee cuts.

The Affordable Care Act (ACA) attempted to transform payment models through two vehicles. First, the ACA established what is now a multi-track Accountable Care Organization program—the Medicare Shared Savings Program (MSSP). Second, it also established the Center for Medicare and Medicaid Innovation (CMMI) at the Centers for Medicare & Medicaid Services (CMS). CMMI was charged with developing, launching, and evaluating new models. A CMMI model cannot qualify for nationwide expansion if it is projected to increase spending.

Assessing whether a model will save money if expanded is not an exact science. However, the statutory criteria for nationwide expansion are very strict: CMS must certify that, if expanded, the model would reduce spending without reducing quality (or improve quality without increasing spending). In practice, the Office of the Actuary at CMS interprets the law as requiring a high bar for certification such that the statutory conditions must be met with near certainty. The Office of the Actuary applies the best available evidence—especially quantitative program evaluations of relevant pre-expansion programs—to what is, by statute, a forward-looking determination.

This approach to certification does not necessarily reflect the inherent uncertainty around projected impact that remains after even the most diligent of impact evaluations. For example, if the goal were to save as much money as possible, programs expected to save money should be expanded. Requiring near-certainty of savings before expanding an initiative implies that some programs expected to save money will not be allowed to diffuse.

Real-world evaluations, however, face several thorny conceptual and practical issues. Because these evaluations are influential in the certification process, policymakers may fail to adopt models that improve over the status quo if these issues are overlooked—particularly given the strict criteria for diffusion.

First, specifying the treatment and treatment group for purposes of evaluation can be challenging. Specifically, alternative payment models (APMs) often include multiple components and features, and key parameters may change over time. This makes defining the treatment (that is, the model that was evaluated) challenging. Moreover, with voluntary participation the treatment group can change, sometime substantially, over the course of the evaluation

Second, defining the counterfactual—that is, what would happen in the absence of the alternative payment model—has grown increasingly difficult as APMs have grown more numerous. And third, the optimal measure of success is not obvious. We discuss each of these issues in further detail below.

We argue that while formal program evaluations should be incorporated into formal assessments about whether or not to expand a model, these estimated impacts should be viewed as rough. Furthermore, the criteria to permit model diffusion should be sufficiently flexible to allow programs with expected savings to diffuse.

Specifying The Treatment And Treatment Group

Individual APMs Offer Multiple Components And Options

APMs often offer multiple options for participation. For example, population-based payment models typically let participants select the amount of risk assumed; until recently, Bundled Payment for Care Improvement-Advanced (BPCI-A) participants could select as many or as few episodes as desired. APM effectiveness varies by risk assumed, episodes selected, participation duration, and other program features that vary by track or episode. It would be ideal if evaluators could study every combination of participant characteristics and program parameters. However, statistically appropriate subgroup analyses generally require samples that are larger than the sample sizes needed for program-wide evaluations.

Conclusions based on the aggregate impact of a program may useful, but these aggregate conclusions mask important effects when impact varies based on track, episode, or other features. For example, the CMS-commissioned evaluation of BPCI Classic reported increased expenditures overall, but this finding conceals important variation across episodes. Major joint replacement of the lower extremity—the highest-volume episode—appears to have performed well, but an aggregate evaluation is less favorable because these favorable results are diluted by results in other episodes.

It is hard to assess whether the episodes that generated savings did so by chance or if indeed the program works better for some conditions. In fact, encouraging results from the initial iteration of another orthopedic episode initiative (the Comprehensive Care for Joint Replacement (CJR) model) suggest that the positive effects of episode based payment in MJRLE may be generalizable.

APMs Vary Over Time

A related issue is the tendency of CMS to change models following launch. Some changes should be expected due to the inherent challenges in operationalizing any broad and complex APM. Nevertheless, substantive changes can make it difficult to draw inferences about program effectiveness.

For example, CMS made several changes to rules governing BPCI-A in January 2021. In particular, target price formulas—which generate the per-episode benchmarks that participants have to beat to share in savings or avoid repayments to CMS—were substantially modified. In addition, CMS began requiring participation in all episodes within a service line to reduce episode-level cherry-picking on the basis of favorable benchmarks. Given these changes, it is hard to draw conclusions about any specific iteration of BPCI-A. One may assess the impact of the combined set of BPCI iterations, but the success (or failure) of the latest iteration is much more difficult to assess if layered on past versions

Voluntary Participation And Changing Participants

CMS often permits different types of organizations to participate in the same initiative (for example, physician-group only ACOs, hospital-centric ACOs). However, the strength of an initiative’s incentive to save may differ by type of organization. For example, physician-led ACOs have stronger incentives to save than hospital-led ACOs, and in fact have had a greater impact on utilization.

Given this within-model variation, evaluations averaging performance across all model participants may mask effects among different types of organizations and different types of initiative participants. Analyzing program outcomes by type of participant subgroup can help determine whether to continue a program—particularly if the program can be focused on the type of organizations that have succeeded, or changed to improve performance for the types of organizations that haven’t succeeded.

This issue is made more complex by voluntary participation in any given model or “sub-model” within a broader initiative. For example, hospitals with high historical spending —and commensurately high benchmarks—are presumably those with the greatest opportunity for savings. These high-spending hospitals have been disproportionately likely to assume risk in episode-based payment. Evaluations struggle to ascertain a true treatment effect when these hospitals differ from their peers on unobservable characteristics associated with both high spending and program participation, a situation that may often be the case.

Changes in provider participation add further complexity. Participants come and go for reasons that are difficult to ascertain. Participation may reflect expectations of success, or organizational leadership and strategy that induces participation but is less correlated with potential success. Changing program rules may also bring about changes in participation. In many cases, statistical techniques can adjust for non-random participation; however, the greater the number of program features, and the greater the velocity of program entry and exit, the harder it is to adjust for voluntary participation.

Even when evaluators use statistical methods to address voluntary participation, estimated effects pertain only to participants (the “compliers”). It is not clear if findings would generalize were the mix of participants to meaningfully change—for example, if the initiative were made mandatory.

Defining The Counterfactual

Assessing whether a program is “working” hinges on what it is being compared to. The comparison for a given APM is commonly assumed to be traditional Medicare (that is, fee-for-service Medicare). However, without explicit attention, evaluation users may assume that the intervention is being compared to a pre-ACA world trended forward (that is, a world with little APM activity). However, because the MSSP was written into the ACA and is thus considered “current law”, the comparison for CMMI models most relevant for purposes of model certification and expansion is a world with MSSP but no other CMMI models.

In any case, most evaluations use a pre-post comparison group design (difference-in-differences). This raises two main challenges. First, many APM participants have experience participating in precursor programs prior to the initiative under evaluation. Thus, the “pre” period in a difference-in-differences study often includes some degree of related treatment. For example, 30 percent of March 2019 BPCI-A hospital participants previously participated in BPCI-Classic. This implies that the measured BPCI-A effect is, in part, relative to BPCI, which may attenuate the estimated impact of BPCI-A.

Similarly, consider NORC’s evaluation of the Next Generation ACO Model. More than half—53 percent—of ACOs participating in the 2016 Next Generation (NextGen) ACO cohort participated in either the Pioneer ACO program or MSSP before entering NextGen. But only 12 percent of the comparison group had Medicare ACO experience. Given that NextGen participants may have secured “low-hanging” savings when they operated as Pioneer or MSSP ACOs, it stands to reason that the NextGen participants could have had substantially different opportunities than the comparison group to drive for further efficiencies. Statistical adjustments may help, but they are never perfect. Thus, the core question being answered in existing evaluations is whether NextGen ACOs achieve savings relative to pre-existing arrangements, as opposed to whether they achieve savings relative to a world with neither NextGen nor Pioneer.

A second issue is that, for good reason, CMS let “a thousand flowers bloom” by launching many models in quick succession following establishment of CMMI. This situation complicates evaluation. Any program effect is compared to control practices whose performance may be affected by participation in a different program. For example, performance of a CMMI ACO model may be judged, in part, relative to practices that are participating in episode-based models. Especially if participation in alternative payments continues to grow, this approach implies that two programs—each of which might be impactful if implemented alone—could both be judged failures because the comparison group for each program includes the other. Beneficiaries and taxpayers would be poorly served if both initiatives are deemed failures and neither is adopted.

Measuring Success

Identifying politically tenable strategies to manage the long-term trajectory of health care spending growth is among health care’s most enduring challenges. To this end, it is common for policymakers and evaluators to focus on estimated net savings—that is, what was the effect of the program on spending after accounting for shared savings, bonus, or reconciliation payments? (This is described as “payer savings” in a prior Health Affairs Blog post.)

However, provider behavior is difficult to change, and long-run potential is arguably of greater policy relevance than short-term savings. For this reason, gross savings, savings before accounting for shared savings disbursements (or “utilization savings”), may be a better signal of promise. Gross savings may also be less sensitive to how benchmarks are established and how downside and upside risk is shared.

Experience from the public and private sectors alike suggests the merit of avoiding short-term views. An independent evaluation of the MSSP found that early gross savings was a harbinger of net savings in later years for ACOs with the most exposure to the program. Early evaluations of the Blue Cross Blue Shield of Massachusetts Alternative Quality Contract suggested gross savings but net losses. Over time, net savings has materialized; savings from averted utilization exceeds shared saving payments. While not inevitable, gross savings was indeed a harbinger of net savings. Thus, the existence of gross savings might be one factor considered in assessing future promise.

Moving Forward

Policymakers should look to evaluations for “directional” guidance rather than point estimates. They should emphasize syntheses that use multiple methods whenever possible. (The CMS Office of the Actuary appropriately does this already.) Specific results from any particular evaluation should be taken with a grain of salt given the issues and limitations discussed in this post. Confidence can increase when directionally consistent results are found using multiple rigorous methods. This has been possible for both the early years of MSSP and the early years of the CJR Model.

Stakeholders should be particularly careful in interpreting evidence on APMs. Negative or null results could mean:

the concept is a dud in whole, or just in part;
the concept is good, but the initiative had execution problems during some or all of the performance period; or
the evaluation is under-powered to detect meaningful effects.

It is impossible to know with certainty if an APM will save money in the future. In this spirit, demanding near certainty of savings will prevent diffusion of programs we expect will save money (without harming quality). Taxpayers and beneficiaries will be best served by continuing models expected to save money over the long run, even if we cannot be sure these expectations will be borne out.

Authors’ Note

The authors acknowledge helpful comments from François de Brantes on an early draft of this post. This post reflects the views of the authors, and not necessarily those of any organization they are affiliated with.

Jason D. Buxbaum and Michael E. Chernew acknowledge receipt of research support from Signify Health that funded preparation of this post.

Michael E. Chernew has received additional research grants from Arnold Ventures, Blue Cross Blue Shield Association, Health Care Service Corporation, National Institute on Aging, Ballad Health, Commonwealth Fund, Agency for Healthcare Research and Quality, and National Institutes of Health. Chernew has received personal fees from MJH Life Sciences (American Journal of Managed Care), MITRE, Madalena Consulting, American Medical Association, Commonwealth Fund, RTI Health Solutions, Emory University, Washington University, and University of Pennsylvania; equity in V-BID Health (partner), Virta Health, Archway Health, Curio Wellness (board of directors), Station Health, and Health(at)Scale. Chernew serves on advisory boards for Congressional Budget Office (panel of health advisors), National Institute for Health Care Management, National Academies, Blue Cross Blue Shield Association, and Blue Health Intelligence. Chernew serves as a board member for Health Care Cost Institute, Massachusetts Health Connector (Vice Chair), and serves as the current Chair of Medicare Payment Advisory Commission.