Using Severity Adjustment Classification for Hospital Internal and External Benchmarking

P. Fontaine, RN, MHA, MBA


Governments and patients evaluate a hospital's quality of care by looking at performance data. In many countries, the data used to compare and evaluate outcomes is frequently based on Diagnosis Related Groups (DRGs). Since 1994, the government of Belgium has been using AP-DRG version 10.0, a refinement of the HCFA-DRG concept, to compare the Belgian hospitals' length of stay with the national average for the same case mix. This measurement is then used to adjust government financing of individual hospitals. Hospitals with a LOS greater than the national LOS for the same case mix are penalized; hospitals with a shorter LOS are rewarded with a financial bonus.

To ensure that hospital funding is appropriate and reflects an expanded measurement of clinical performance, the Belgian government made the transition from AP-DRGs to All Patient Refined DRGs (APR-DRG) since 2001. In this paper, we examine the difference between the APR DRGs and the AP-DRGs for hospital benchmarking and the impact of coding quality on hospital performance. This paper is based on an ongoing project called Matrix to evaluate the possibilities to use APR-DRGs for internal and external benchmarking. The Matrix project collects administrative, case-mix, and charge data from more than 40 Belgian hospitals since 1997.

General Concepts

All Patient Refined-Diagnosis Related Groups (APR-DRG)1

For several years now, DRGs have been used to analyze a hospital's case mix. However, clinicians, administrators and regulators have often attached different meanings to the concept of case mix complexity, depending on their backgrounds and purposes. The term "case mix complexity" has been used to refer to an interrelated but distinct set of patient attributes that include severity of illness, risk of dying, prognosis, treatment difficulty, need for intervention, and resource complexity. The original objective of the DRGs was to develop a patient classification system that related the types of patients treated to the resources they consumed. So, the AP-DRGs focused exclusively on resource intensity. Over time, there has been some evolution in the use. Hospitals used DRGs for instance to compare mortality rates, to implement and support critical pathways, to compare hospitals across a wide range of resources and outcome measures, or as a basis for internal management and planning systems. To meet those needs, the objective of the DRG system needed to be expanded.

In the APR-DRG classification, the number of DRGs is reduced to 355 base DRGs by eliminating all age, complications (CC), and major CC distinctions from the AP-DRGs. Every base APR-DRG is assigned two distinct characteristics that measure the severity of illness and the risk of mortality:

  • The severity of illness has the following meaning: "The extent of physiologic decompensation or organ system loss of function"
  • Risk of mortality: "The likelihood of dying"

Every attribute (severity of illness or risk of mortality) is further subdivided into four subclasses. Those four subclasses are numbered sequentially from 1 to 4 indicating respectively minor, moderate, major, or extreme severity of illness or risk of mortality. Although the subclasses are numbered, they represent categories and not numeric variables. Since severity of illness and risk of mortality are different attributes, they can have different values for the same patient-for example, a patient with APR-DRG 346: Connective Tissue Disorders can have a Severity of Illness of 3 (Major) and a risk of mortality of 2 (Moderate).

By creating base DRGs with two attributes, the APR-DRGs take into account the concepts of severity of illness, risk of mortality, and resource intensity. This can be considered a valuable addition to the AP-DRGs because additional analysis on severity of illness and risk of mortality can be performed.


To decide which DRG-model to use for internal or external benchmarking, we are interested in the performance of the different models, which means we would like to understand the proportion of the observed variance (for example, in length of stay or charges) that can be explained by the model (AP-DRGs versus APR-DRGs version15.0.)

To measure the performance of the different models, ICD-9-CM coding data was used from 42 hospitals, which resulted in a database of 585,742 admissions for 2002. For every admission, we determined the APR-DRG, severity of illness, length of stay, pharmacy charge, and medical charge. Outliers characterized by extremely large lengths of stay are extremely high charges. They harm the homogeneity within the DRGs. To exclude outliers, the data has been trimmed. Reliant on the dependent variable under investigation, observations respectively with length of stay, pharmacy charges, or medical charges greater than (P75+3*(P75-P25)) were excluded.

We used ANOVA to calculate the R2 , which explains the proportion of variance in the dependent variable that can be explained by the dependent variable (the DRG system used).


Explained Variance by the Different DRG Models

Other studies have shown that the proportion of the variance that can be explained by the DRG system is higher for APR-DRGs than for AP-DRGs.

Based on studies performed by 3M Health Information System, the trimmed R_ for cost and length of stay is better for APR-DRGs than for AP-DRGs. Those results are shown in Table 1.

% Difference
LOS All 0.4358 0.4787 9.8%
LOS Surgical 0.4289 0.5033 17.3%
LOS Medical 0.4273 0.4419 3.4%
Cost All 0.5600 0.6009 7.3%
Cost Surgical 0.5372 0.5883 9.5%
Table 1. Trimmed R2 for Cost and Length of Stay2 (US data)

The analysis was based on a US database that included 675 acute general hospitals and 40 freestanding acute children's hospitals. The calendar year was 1993, and there were 4,203,646 admissions in the database. More than 60 percent of the variation in cost and more than 47 percent of the variation in LOS can be explained by the APR-DRGs.

For use in a Belgian benchmarking context, we would like to know whether those results apply to Belgian hospitals. For 1997, we performed ANOVA analysis on 317,021 inpatient stays, and for 2002 we used 585,742 observations. The analyses on trimmed 1997 and trimmed 2002 data from the "Matrix" hospitals showed following results.

% Difference
1997 LOS All 0.3158 0.3314 4.9%
1997 LOS Surgical 0.3971 0.4180 5.3%
1997 LOS Medical 0.2429 0.2489 2.5%
2002 LOS All 0.3669 0.4330 18.0%
2002 LOS Surgical 0.4378 0.5687 29.9%
Table 2. Trimmed R2 for Length of Stay (Belgium data)

Although we used trimmed data of Belgian hospitals, the explained variance by the DRG model is significantly lower for both systems (AP-DRG and APR-DRG) than the results on US data. The R2 on Italian APR-DRG (Grouper revision 12.0) data over the period July 1997-June 1998 was 0.317, which is slightly lower than the results from our study over 1997.3 The question is how to explain the difference between the US and Belgian data on performance of the same model. This could be due to practice differences (less variability in practice patterns within the US) or coding quality, since APR-DRG is more vulnerable to coding errors.

Over 2002, 43 percent of the variance in length of stay could be explained by the model. This relatively modest value could be due to patient characteristics that have not been taken into account. In Belgium, APR-DRGs are used to calculate the number of hospital beds to be financed based on the national average length of stay. In calculating the number of justified beds, the Belgian Ministry of Health added some additional variables such as age category and financial category (for example, length of stay on rehabilitation wards longer than 50 percent of total stay or less than 30 observations for the APR-DRG -Severity of illness and age category combination, outlier type I, outlier type II,...)

By adding the additional variables, the untrimmed R2 on length of stay increases to 55.16 percent for all patients, 57.46 percent for the surgical stays, and 55.64 percent for the medical stays. See Table 3.





Age category
2002 LOS All




2002 LOS Surgical








Table 3. R2 for Length of Stay on APRDRG + Age Category (Belgium data, 2002)

Explained Variance for Different Dependent Variables

We cannot simply assume that the result for the dependent variable length of stay applies to other dependent variables, such as pharmacy charges or medical charges. When using APR-DRGs for benchmarking on charges, there is little difference on reduction in variance between length of stay and pharmacy charges. Both dependent variables have an R2 of 43 percent as shown in Table 4. For medical charges such as radiology, surgery, laboratory, the overall reduction in variance is 69.86 percent, which is significantly stronger than for length of stay or pharmacy charges. However, for the medical DRGs, the model is far more weak when considering medical charges as the dependent variable.

Dependent variable
2002 Trimmed LOS All 0.4330
2002 Trimmed LOS Surgical 0.5687
2002 Trimmed LOS Medical 0.3722
2002 Trimmed Pharmacy charges All 0.4352
2002 Trimmed Pharmacy charges Surgical 0.5358
2002 Trimmed Pharmacy charges Medical 0.3748
2002 Trimmed Medical charges All 0.6986
2002 Trimmed Medical charges Surgical 0.7010
Medical charges
Table 4. R2 for Length of Stay, Pharmacy Charges, and Medical Charges (Belgium data, 2002)

Effect of Coding Practices

Comparing 1997 and 2002 on the "Matrix" dataset (see Table 2), shows an increase in the R2 values for the dependent variable length of stay. The increase in R2 values is stronger for the APR-DRGs than for the AP-DRGs. We assume that this difference in reduction of variance is attributable to more extensive and more specific coding practices in Belgian hospitals. The measurement of the severity of illness in the APR-DRGs is essentially based on the secondary diagnosis and the interaction between the secondary diagnosis and between the secondary diagnosis and the principal diagnosis.

As shown in Figure 1, there is a strong increase in the average number of diagnoses coded for the "Matrix" hospitals between 1997 and 2002. The average number increases from 2.71 in 1997 to 4.44 by the end of 2002.

Figure 1. Average Number of Diagnoses Coded between 1997 and 2002. (Inpatient stays from 42 Belgian hospitals)

Over the same time period (1997 to 2002), the distribution of the severity of illness for the same group of hospitals changed considerably (See Figure 2). While in 1997, on average 67.1 percent of the inpatient stays had a severity of illness level of Minor, by 2002, only 56.3 percent of the patients had a severity of illness level Minor. Conversely more patients were assigned a higher severity of illness level.

Figure 2. Severity of Illness Distribution 1997-2002

Figure 3. Correlation between Average Number of Diagnoses Coded and the Percentage Severity Level Minor

The increase in severity of illness level is not attributable to a higher case mix in Belgian hospitals. Using simple regression, we found that there is a significant relation between the average number of diagnoses coded on a hospital level and the fraction of inpatient stays with severity level Minor (see Figure 3). The more diagnoses are coded on average, the less patients are assigned to severity level Minor. For the higher severity levels, the opposite relationship is true. The more diagnoses are coded, the more patients are assigned to higher severity levels. Since patients in higher severity levels tend to have longer lengths of stay (see box-whisker plot Figure 4), extensive coding is immanent when using severity adjusted classification systems for internal and external benchmarking.

Figure 4. Box Plot of Length of Stay by Severity of Illness

Not only can the average number of diagnoses coded have an impact on the performance of a hospital when using severity-adjusted classification systems, the overall coding quality and accuracy have an impact.

To measure general coding quality on a hospital level, we carried out a pilot project with the Data Quality Editor (DQE) software from 3M HIS on ICD-9-CM coding data from 27 hospitals in the matrix project totalling about 435,291 inpatient stays in the year 2000. The DQE detects potential coding errors by calculating 84 different edits. Potential coding errors are then classified into nine patient categories depending on the impact of the coding error (for example, coding rule violation that does not impact DRG assignment; coding issues that could decrease DRG payment by more than 10 percent; coding issues that could increase DRG payment with more than 10 percent).

For every hospital, we calculated the following variables:

  • Average number of diagnoses coded
  • Fraction of patients with severity level Minor
  • Length of stay
  • Percentage of patients in the nine DQE patient categories
  • Hospital performance on length of stay expressed as a percentage after case-mix normalization

We then performed a stepwise regression to investigate the impact the different independent variables can have on the dependent variable, namely the hospital performance on length of stay. Only two factors were statistically significant, namely the length of stay and the fraction of patients in severity level minor. The R2 of the model is 0.8215 (F=39.36 p<0.000001). These relations were present: The performance of a hospital decreases when the length of stay increases, and the performance of a hospital increases when the number of patients in severity level minor decreases. No statistically significant impact was found for one of the nine coding quality categories calculated by the DQE. This could be caused by the fact that the DQE model was build on HCFA DRGs and not on the APR-DRGs.


To guarantee a fair evaluation of hospital performance, we must be sure that the model used to categorize the different patients takes into account as many causes of variation as possible. Given that the proportion of the variance that can be explained by the DRG system is higher for APR-DRG than for AP-DRGs, APR-DRGs are of better use for internal and external benchmarking. However, the performance of the classification system can evolve over time and can be different for different countries. The R2 for APR-DRG on length of stay grew from 33 percent in 1997 on a sample of Belgian hospitals to 43 percent in 2002. Adding some other variables, such as age categories, can further improve the explanatory power of the model. The improvement in R2 over time for the APR-DRG is caused by a better coding process in the Belgian hospitals. So, it became evident that drawing valid conclusions about hospital performance, outcome or efficiency using APR-DRGs, is highly dependant on the accuracy and the specificity of the coded data.


  1. All Patient Refined Diagnosis Related Groups Version 15.0 Definitions Manual. Volume 1. 1998, 3M Health Information Systems.
  2. Goldfield, Norbert. Physician Profiling and Risk Adjustment. Gaithersburg, Maryland: Aspen Publishers, 1999. p. 413.
  3. Lorenzoni, Luca. "Use of APR-DRGs in 15 Italian hospitals." Case-mix, Volume 2, Number 4. 2000.

Source: 2004 IFHRO Congress & AHIMA Convention Proceedings, October 2004