Re-abstraction Studies to Assess Data Quality for Use in the Development of a Grouping Methodology

Sandra Mitchell and Holly Bartoli


The purpose of this paper is to provide an overview of the data quality re-abstraction studies that have been conducted in Canada. These studies provide a baseline of reliability for clinical administrative data submitted to the Discharge Abstract Database (DAD). Canada began a staggered provincial implementation of ICD-10-CA and the Canadian Classification of Interventions (CCI) in 2001. With the introduction of these new classification systems, a new acute care inpatient grouper will be developed. The results of the re-abstraction will study provide valuable information on the quality of the clinical data that will be used to develop the new grouper.


An ongoing challenge for any organization producing statistical information is to ensure that the quality of the information it produces is suited for its intended uses and that data users are provided with good information about data quality. To this end, the Canadian Institute for Health Information (CIHI) has established a comprehensive and systematic data quality program that includes the implementation and ongoing monitoring of a corporate Data Quality Framework, as well as conducting special studies that focus on data quality issues.

This paper will discuss special studies at CIHI: the Discharge Abstract Database (DAD) Data Quality Study and the Case Mix Group/Complexity (CMGTM/PlxTM)1 Data Quality Study. It will go on to highlight the reasons behind the studies, the process, and the findings, and then explain how the results of these studies are being used as we continue on with the development of a new grouping methodology.

The Canadian Institute for Health Information (CIHI)

CIHI is a national, not-for-profit organization that plays a critical role in the development of Canada's health information system. CIHI's mandate is to coordinate the development and maintenance of a comprehensive and integrated approach to health information in Canada. CIHI's diverse data holdings are playing an increasingly important role in supporting public debate and decision making about the Canadian health system.

CIHI’s Data Quality Enhancement Program

CIHI established a comprehensive and systematic data quality program involving the implementation of a data quality framework and initiated special studies focusing on specific data quality issues. One of the objectives of the data quality program is to conduct special studies evaluating data quality of administrative data by returning to original sources of information and independently assessing them.

Redevelopment of the Case Mix Grouping Methodology

Case Mix Groups, or CMG(tm), are the foundation of acute inpatient grouping, length of stay and resource intensity weight methodologies. The patient's Most Responsible Diagnosis (MRDx) is used to assign the case to one of the 25 Major Clinical Categories (MCC). Within each MCC, based on the presence or absence of an operative procedure, the case is directed towards a surgical or medical hierarchy flowchart.

In 1997, CIHI introduced a complexity overlay called PlxTM to its CMG methodology for most CMG. The complexity overlay identifies diagnoses, over and above the MRDx used for CMG assignment, for which prolonged length of stay and more costly treatment might reasonably be expected.

One of the objectives of the re-abstraction studies was to verify the quality of certain diagnoses and interventions relevant to CMG/Plx assignment to facilitate the development of the ICD-10-CA/CCI grouping methodology. As the grouper redevelopment is being performed using ICD-10-CA/CCI data understanding any quality issues with this new classification is paramount. The results of the third year of the DAD Data Quality Re-abstraction Study are of particular interest.

Diagnosis Typing

One of the key data elements that the CIHI Grouper redevelopment team will be assessing is the discrepancy rate of diagnosis typing. Diagnosis typing plays a key role in the development of a grouper and the later assignment of cases to a grouper.

Diagnosis typing is the mechanism utilized to identify a patient's diagnoses on the discharge abstract. Diagnosis Types and definitions utilized by CIHI clients are listed below.

Most Responsible Diagnosis

A Diagnosis Type (M) is the one diagnosis or condition that can be described as being most responsible for the patient's stay in hospital. If there is more than one such condition, the one held most responsible for the greatest portion of the length of stay or greatest use of resources should be selected. If no diagnosis was made, the main symptom, abnormal finding or problem should be selected as the MRDx.

Primary Diagnoses

Comorbidities are all conditions that coexist at the time of the hospital admission or develop subsequently and demonstrate at least one of the following:

  • Significantly affects the treatment received
  • Requires treatment beyond maintenance of the preexisting condition
  • Increases the length of stay (LOS) by at least 24 hours

Canada is the only country that distinguishes the onset of comorbidities before and after admission. The State of California also collects pre- and post-admission type diagnosis information.

A Diagnosis Type (1) is a condition that existed pre-admission, has been assigned an ICD-10-CA code, and satisfies the requirements for determining comorbidity. A Diagnosis Type (2) is a condition that arises post-admission, has been assigned an ICD-10-CA code, and satisfies the requirements for determining comorbidity. If a post-admission comorbidity qualifies as the MRDx, it must be recorded as both the MRDx and as a diagnosis Type (2).

Secondary Diagnosis

A Diagnosis Type (3) is a condition for which a patient may or may not have received treatment, has been assigned an ICD-10-CA code, and does not satisfy the requirements for determining comorbidity. Currently, ICD-10-CA manifestation codes identified with an asterisk are typed as a 3 when the matching etiology (dagger) code is typed as Most Responsible. Some hospitals may identify specific codes that they want capture as a type 3, but generally there is an inconsistency nationally in the capture of diagnosis type 3 conditions.

Secondary diagnosis typing is also unique to Canada, with the exception of countries that collect the dagger/asterisk code combinations using the secondary diagnosis type for the asterisk code.

Transfer Diagnosis

An ICD-10-CA diagnosis code associated with the first, second or third patient service transfer on the discharge abstract.

Morphology Diagnosis

Diagnosis Type (4) identifies morphology codes (which are derived from ICD-O (ICD-Oncology) codes) that describe the type and behaviour of a neoplasm.

Provincially Defined Diagnosis

Diagnosis Type (5) is provincially defined.

External Cause of Injury Diagnosis

A diagnosis Type (9) is an external cause of injury code. It is mandatory for use with codes in the range S00-T98 Injury, poisoning, and certain other consequences of external causes. (Category U98, Place of occurrence, is mandatory with codes in the range W00-Y34, with the exception of Y06 and Y07. Category U99, Type of activity, is optional).

Newborn Diagnosis

Diagnosis Type (0) is reserved for newborn coding. In order for babies born by cesarean section to be grouped appropriately (per the current CMG methodology) one of these codes must either be the MRDx or a diagnosis type (0): Z38.01, Z38.31, Z38.61, Z38.63, Z38.65, Z38.67, Z38.69 or P03.4.

Diagnosis Type (0) is also used to identify insignificant conditions that do not affect the newborn's treatment or length of stay.

Data Quality Special Studies

CIHI introduced a three-year national DAD data quality re-abstraction study in 2000. The first two years of the study involved the re-abstraction of ICD-9/CM coded charts and the third year of the study involved the re-abstraction of ICD-10-CA/CCI coded charts. The DAD Data Quality Study was the first national study that used a statistical sampling methodology to reliably measure the accuracy of selected non-medical and clinical administrative data contained in the DAD.

The study assessed the data quality of the DAD by returning to the original sources of information (that is, patient charts) and comparing this information with what exists in the CIHI database. The study reviewed two years of ICD-9/CM/CCP data and one year of ICD-10-CA/CCI data. Data collection for year one of the study was conducted in the fall 2000 (fiscal 1999/2000 data) and for year two (fiscal 2000/2001 data) was collected in the fall 2001. Due to the staggered provincial implementation of ICD-10-CA/CCI, the third year of the study was postponed to fiscal year 2003/2004.

Study methodology including scope, sampling, and data collection methods are outlined below for the Discharge Abstract Database (DAD) Data Quality Study-Years 1 and 2, the Case Mix Groups/Complexity (CMG/Plx) Data Quality Study and the Discharge Abstract Database (DAD) Data Quality Study-Year 3.

1. Discharge Abstract Database (DAD) Data Quality Study-Years 1 and 2

The goal of the DAD Data Quality Study was to evaluate the accuracy of selected administrative data, at the national level contained in the DAD. The specific objectives of the study were to:

  1. Evaluate and measure the overall accuracy of the DAD
  2. Evaluate and measure the impact of data collection from incomplete charts
  3. Evaluate and measure the coding quality of diagnoses and procedures relevant to specific Health Indicators represented in the CIHI Health Indicators Framework
  4. Identify and measure how often diagnoses and procedures are not coded according to CIHI guidelines and identify where additional coding guidelines may be required
  5. Assess whether any of the of the above evaluations have an impact on the assignment of case mix group and database expected length of stay (ELOS)
  6. Facilitate the evaluation of the change to new diagnosis and intervention standards ICD-10-CA/ CCI

Sampling Methodology

The study used a multistage sampling approach to identify which charts would be re-abstracted. The first sampling stage randomly selected facilities across Canada2 stratified by geography and size.

The second sampling stage randomly selected charts from each facility. All abstracts in the DAD were assigned to one CIHI health indicator (refer to objectives of study). In cases where an abstract could be assigned to more than one indicator, for selection purposes only, the condition with less prevalence was given priority. During analysis of the data, other diagnosis and procedures in the abstract were reviewed.

Eighteen facilities participated in the first year of the study, allowing for the re-abstraction of 2,737 charts. Eleven facilities participated in the second year of the study, resulting in the re-abstraction of 1,555 charts.

In order to get an optimal sample design, it was assumed that at the national level, the proportion of charts for each indicator that contained a discrepancy was 15 percent. The reliability required for the sample was a coefficient of variation (CV) of 16.5 percent (that is, a standard error of 2.5 percent). Using this assumption and reliability requirement, a minimum sample size was then determined for each indicator. This sample size was increased by 10 percent to account for chart non-response (unavailability) and a further 10 percent for possible situations of better than expected productivity by the re-abstractors. There were 150 charts randomly selected at each participating facility; however, the number of charts selected for each indicator varied among the participating facilities, so that the overall minimum number of charts for that indicator overall of the facilities was achieved as far as possible. Note that each sampled chart has an unequal probability of selection under this design.

Collection of Study Data

CIHI Classification Specialists3 re-abstracted the data for the study by returning to the original source of the data on-site at each facility for a one -week period. All the original information from the DAD was downloaded to a laptop application immediately prior to the collection week. The computer-assisted application was designed and developed by CIHI to facilitate the collection of the study data. The application featured the use of pull-down lists of discrepancy codes and reasons for the discrepancy, as well as a comment field that allowed entry of additional information pertaining to the discrepancy. Additional reference material that would ordinarily be available was also loaded onto the laptop. The Classification Specialists entered all of the re-abstracted, discrepancy, and reason data directly into the application. All subjective clinical information, such as diagnoses and interventions, was be re-abstracted blindly, that is, with out viewing the original abstracted data.

For each discrepancy identified, both medical and non-medical, the re-abstractor assigned the type of discrepancy and a possible reason. There was no reconciliation process with the original hospital abstractor and the identity of the original abstractor was not collected for this Study.

A pilot test was also conducted to evaluate these measures as well as data needs, data collection methods, timing, and data processing methods.

2. Case Mix Group/Complexity (CMG/Plx) Study

With the implementation of the new classification standards the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Canada-Canadian Modification (ICD-10-CA) and the Canadian Classification of Health Interventions (CCI), CIHI will be redeveloping the Case Mix Group Complexity (CMG/Plx) Grouper. In preparation for the Grouper redevelopment, CIHI reviewed the complexity component of the Grouper, including Grade lists. One aspect of this review was the re-abstraction of actual charts. This additional study was performed on ICD-9 data while waiting for sufficient ICD-10-CA/CCI data to perform the third year of the DAD Data Quality Study.


The scope of this project included the re-abstraction of medical records from acute care inpatient facilities across Canada. This information was then be compared with what exists in the DAD. The universe of the Study includes healthcare facilities in Canada. The target population of the Study is provincial acute care facilities4 reporting to the DAD.

The goal of the Case Mix Complexity (CMG/Plx) Data Quality Study was to evaluate the data quality of selected clinical and administrative data for statistical purposes from CIHI's DAD. The Study assessed the data quality of the DAD by returning to the original sources of information and comparing this information with what exists in the CIHI database. While a facility level report on general findings was provided to each facility, it should be noted that the study was not an audit of an institution's coding. The primary use of the data collected will be to contribute to the assessment of the DAD data quality at a national level. The objectives of the study are to:

  1. Evaluate and measure the overall data quality of the DAD CMG Grouper variables
  2. Evaluate and measure the coding quality of diagnoses and interventions relevant to CMG/Plx assignment
  3. Facilitate the development of the ICD-10-CA/CCI CMG Grouper
  4. Facilitate the ongoing development of coding guidelines for the new classification standards (ICD-10-CA and CCI).

Study Approach

As with the DAD Data Quality Study patient charts were re-abstracted and compared to the CIHI database information. The Study used a multi-stage sampling approach. Data collection occurred at each participating facility during the spring/summer 2002. The first sampling stage randomly selected acute care facilities stratified by geographical region and size. The sample size was 18 facilities. The second sampling phase randomly selected charts from within the selected facilities. Charts were randomly pre-selected from the database to concentrate on selected complexity levels (refer to Goals/Objectives). Approximately 65 charts were re-abstracted from each facility. Fewer charts were re-abstracted at each facility as the higher complexity levels imply more difficult, time-consuming charts.

The results of the first two years of the DAD Data Quality Study and the CMG/Plx Study may be found at

3. Discharge Abstract Database (DAD) Data Quality Study-Year 3

The Year 3 study builds on the objectives, sampling methodology, scope, and data collection processes that were designed for the Data Quality Study-Years 1 and 2. Health indicators from the first two study years were reassessed for comparison with the new classification and additional indicators was chosen.

As with the first two years of the study the third year also used a multistage sampling approach. The first stage randomly selected facilities across Canada stratified by size of the facility. The second sampling stage randomly selected charts within the facilities to concentrate on the diagnoses and interventions applicable to the selected health indicators. Approximately 145 charts were re-abstracted from each facility. For participating facilities that implemented ICD-10-CA/CCI in fiscal year 2001-2002, re-abstraction occurred for one week in July-August 2003, with a sample size of 9 facilities. For those that implemented ICD-10-CA/CCI in fiscal year 2002-2003, data collection occurred for one week during October-November 2003, with a sample size of 11 facilities. The participating facilities totaled 20 with a total of 2,743 charts being re-abstracted.

Study Results

Note: At time of writing, the ICD-10-CA/CCI Re-abstraction Study results were not available for publication. Results and analyses will be presented at the October presentation.

Diagnosis typing was introduced as an administrative data element to the DAD for administrative and research purposes. The CMG grouper, which was introduced in 1990, was able to take advantage of these existing data elements. The refinement of the CMG grouper with the implementation of the Complexity overlay methodology in 1997 placed more emphasis on pre- and post-admit conditions. Over the years, users of the grouped data became more knowledgeable about complexity, and the impact of certain diagnosis types with specific diagnoses.

While this was occurring, certain jurisdictions, such as the province of Ontario incorporated CMG and Resource Intensity Weights (RIWTM) in their funding formulas. The result was the optimization of coding by hospitals. Further, consulting firms were hired by hospitals to provide recommendations to hospital coders on how to maximize their CMG and RIW potential. These practices went unchallenged. CIHI at the time did not have the resources to enhance and update the Complexity overlay methodology or conduct audits or conduct data mining exercises. Provincial Ministries of Health likewise did not conduct audits or penalize facilities for inappropriate coding.

Some consistent finding over the three year study are noted below:

  • A number of data quality issues relating to diagnosis typing were identified.
  • Coding discrepancies for Most Responsible Diagnoses are low.
  • Diagnosis types 1 and 2 are being collected with increasingly greater frequency.
  • Diagnoses that should be typed as secondary are being typed as comorbid.
  • There has been a decreased frequency of diagnosis type 3 coding.
  • Principal Procedures are collected consistently.

Grouping Methodology Redevelopment

The development of a new acute care inpatient grouper is dependent on the quality of the data from which it is developed. The data quality re-abstraction studies have demonstrated that the Most Responsible Diagnosis codes and the Principal Procedure codes are captured in a reliable and consistent manner.

Coding is a subjective process and as such, there will never be full agreement between coders. A discussion on the discrepancy rate for principal or main diagnosis was held at the 2003 Patient Classification System/Europe, where some countries were noting discrepancy rates ranging from 15 to 25 percent. It would appear that coders submitting data to CIHI do well compared to some of their international colleagues. The challenge in Canada is in determining comorbid conditions.

The inconsistent coding practices in determining comorbid conditions are rooted in many causes:

  • Coders are not adhering to the CIHI coding standards due to not being aware of them, not reading them or not agreeing with them.
  • Coders have difficulty determining which conditions should be typed as comorbid versus secondary.
  • Coding for grouping and complexity assignment has taken precedent over coding for administrative purposes.
  • Consulting firms have promoted the coding for optimization of grouping potential.

Other Steps to Improve or Assess Data Quality

Along with the implementation of ICD-10-CA/CCI, CIHI supports several initiatives to encourage data quality. The Classifications department is continually working to improve and develop education workshops to assist Health Records specialists in the field to remain current with the classification. They also maintain an online Coding Query service. Canadian standards are developed and updated based on input from the field in consultation with a National Coding Advisory Committee. The membership of this committee includes representatives from all provinces and territories.

The Case Mix Department is currently conducting the following activities to assess the database for compliance of the coding standards:

  • Mining the ICD-10-CA/CCI data for erroneous data or trends, to ensure good data is used for the development of Resource Intensity Weights (RIW) or Expected Length of Stay (ELOS) statistics.
  • Initiating data mining exercises to identify opportunities to adjust the current database based on the identification of suspect coding and comparing these submissions with CIHI coding standards.
  • Initiating data analyses of the DAD to identify opportunities to create more diagnosis code edits.

Until the new grouping methodology can be developed with ICD-10-CA/CCI data a conversion grouper is in place. Unfortunately difficulties converting a very specific classification back to a relatively non-specific classification along with new and updated coding standards, has resulted in some grouping shifts. To support CIHI clients in using the Case Mix Groups CIHI created a document that:

  • Highlights the impact of the introduction of ICD-10-CA/CCI on the assignment of CMG and DPG
  • Informs users of potential limitations in the use of these grouping methodologies
  • Assists users in conducting time series analysis with data coded in ICD-10-CA/CCI grouped with CMG 2003

Refer to the CIHI website for the documents Coping with the Introduction of ICD-10-CA and CCI-Fiscal 2002-2003.


CIHI is committed to ensuring quality data. While there is no standard definition of data quality, there are a number of dimensions of quality that can be consistently applied to the maintenance of data quality. These include accuracy, timeliness, comparability, usability, and relevance. Policy makers, healthcare leaders and the general public are dependent on quality data for decisions that affect the Canadian healthcare system. Through the ongoing data quality evaluations of CIHI's data holdings and the conduction of special data quality studies, CIHI will facilitate the continuous production of quality information.

CIHI will need to look at strengthening its coding and diagnosis typing standards and educating clients. At the same time, hospitals have to take responsibility to review and incorporate CIHI coding standards into their coding practices. Provincial ministries of health also need to take a proactive role in encouraging the appropriate use of diagnosis typing.

CIHI has adopted data quality across the corporation. Re-abstraction studies and data mining will ensure the continuous data quality review of coding practices and of data submitted to the DAD.


Canadian Institute for Health Information. CIHI's Commitment to Excellence Furthered through Data Quality Initiatives. CIHI directions ICIS. Ottawa, Ontario, Canada. October/November 2001.

Canadian Institute for Health Information. Discharge Abstract Database Data Quality Study -Preliminary Year 1 Findings. Ottawa, Ontario, Canada. March 2002.

Canadian Institute for Health Information. Case Mix Group Complexity Data Quality Study. Ottawa, Ontario, Canada. March 2002.

Canadian Institute for Health Information. Grouper Redevelopment Project Plan. Internal Document. Ottawa, Ontario, Canada. October 2003.

Canadian Institute for Health Information RIW and Expected Length of Stay Methodology. Ottawa, Ontario, Canada.2003.

Canadian Institute for Health Information Impact of ICD-10-CA on the Assignment of CMG and DPG. Ottawa, Ontario, Canada. November 2003.

Canadian Institute for Health Information. Diagnosis Typing: Current Canadian and International Practices-Draft. Ottawa, Ontario, Canada. June 2004.

Mitchell, Sandra and Brown, Ann. Performance Measurement Data: Is It Fit For Use? Data Quality Initiatives At CIHI. Proceedings of Performance Measurement in Healthcare: Responding to the Need for Greater Accountability. Toronto, Ontario, Canada. May 15-16, 2002.


  1. Registered Trademark of the Canadian Institute for Health Information
  2. The target population of the Study is provincial acute care facilities reporting to the DAD. Facilities in Quebec and some in Manitoba do not submit data to the DAD. Facilities in the Yukon, Northwest, and Nunavut Territories were not included in the study population for cost reasons.
  3. CIHI Classification Specialists are certified with the Canadian College of Health Records Administrators; they are responsible for developing, interpreting, and teaching classification systems; are well experienced in various hospital settings, and have expert knowledge of medial terminology and diagnosis and procedure classification standards.
  4. Acute care facility is defined by the institution type flag of the DAD. This does not include rehabilitation, chronic care, nursing homes, psychiatric, home care, same day surgery, or emergency facilities reporting to the DAD.

Source: 2004 IFHRO Congress & AHIMA Convention Proceedings, October 2004