|
|||||
|
|
||||||
© 1999 American Society for Clinical Oncology Interobserver Variability in the Detection of Cervical-Thoracic Hodgkin's Disease by Computed TomographyFrom the Pediatric Oncology Group, Diagnostic Imaging Committee, and Department of Diagnostic Imaging, St Jude Children's Research Hospital, Memphis, TN; Quality Assurance Review Center, Providence, RI; and Pediatric Oncology Group, Statistical Office, Gainesville, FL. Address reprint requests to Barry D. Fletcher, MD, Department of Diagnostic Imaging, St Jude Children's Research Hospital, 332 N Lauderdale St, Memphis, TN 38105; email barry.fletcher{at}stjude.org
PURPOSE: Computed tomography (CT) scans of the neck and chest are obtained at diagnosis of Hodgkin's disease to establish disease extent, plan radiotherapy, and serve as baseline studies for subsequent evaluation of response to therapy. However, differences in interpretation may occur even among experienced radiologists. This study was designed to test the extent of variation among expert radiologists' interpretations and to assess how their interpretations differed from that of the primary (institutional) radiologists. MATERIALS AND METHODS: Five radiologists independently reviewed randomly selected CT scans of 59 patients enrolled onto two Pediatric Oncology Group Hodgkin's disease treatment protocols. For each patient, 31 potential disease sites were scored as positive, negative, uncertain, or unassessable. Agreement among the reviewers and between the reviewers and the primary readers was analyzed.
RESULTS: For 58% of the sites, at least four of the five reviewers agreed in CONCLUSION: There are disparities among radiologists' interpretations of cervical-thoracic CT imaging of patients with Hodgkin's disease. This variability may affect patient care and the performance and results of multi-institutional clinical trials. We propose that a standardized method of reporting might improve the consistency of interpretation of CT scans in these patients.
THE MAJORITY OF PATIENTS with Hodgkin's disease present with cervical and/or thoracic lymph node involvement, and computed tomography (CT) of the chest is the primary diagnostic imaging procedure for determining disease extent.1 In as many as 15% of anatomic sites examined, CT scans provide evidence of disease involvement that is not apparent on chest radiography.2 For that reason, the CT findings play an important role in disease staging and radiation therapy planning. Accurate and consistent interpretation of CT images is crucial to the effectiveness of Hodgkin's disease treatment protocols.3 Because imaging diagnoses are inherently subjective, interpretations may vary substantially even among expert observers.4-8 Several studies have shown that radiologists differ in their interpretation of CT images used to determine the mediastinal lymph node status of patients with lung cancer.9-12 In the assessment of CT scans of pediatric patients with Hodgkin's disease, Fletcher et al13 found strong agreement between two observers in the same institution but did not compare their findings at individual thoracic disease sites. We have noted that treating institutions' radiologic reports sometimes differ from the findings of central reviewers for the Pediatric Oncology Group (POG) Quality Assurance Review Center in interpreting the CT images of patients enrolled onto POG protocols for Hodgkin's disease. We therefore performed a retrospective study to estimate interobserver concordance among an expert panel of radiologists in detecting nodal and extranodal disease, to estimate the frequency with which findings of the review panel and the primary institutional readers varied, and to identify sites of nodal disease that are most prone to inconsistent interpretation.
Five radiologists were selected from the POG Diagnostic Imaging Discipline Committee for their experience in interpreting CT scans of patients with Hodgkin's disease. They independently reviewed intravenous-contrastenhanced CT images of the neck and thorax that were selected as a simple random sample of appropriate images from Quality Assurance Review Center files of all patients enrolled onto the POG 9225 (advanced stage) and POG 9226 (early stage) Hodgkin's disease clinical studies. Fifty-nine CT scans were chosen, representing 39 (61%) of the 64 patients enrolled onto POG 9225 and 20 (54%) of the 37 patients enrolled onto POG 9226. The total number of scans chosen for review (59) was deemed sufficient for meaningful statistical analysis, at the same time presenting a reasonable effort for the 1-day time period allotted each radiologist to complete his review. All CT scans were interpreted independently. The readers were blinded to the prior reader's interpretation, the assigned disease stage, and the patient's outcome. The scans were presented to all of the participating radiologists in the same order. The readers were informed only that the subjects were children aged 3 to 18 years with histologically confirmed Hodgkin's disease who had disease above the diaphragm. The readers used a data capture form to score disease involvement at a total of 31 sites (Fig 1). Each location was scored as positive (disease present), negative (disease absent), uncertain, or unassessable. No specific criteria for disease status were mandated for use by the panelists.
Reviewer proficiency was assessed for each site by calculating Williams' index of agreement,14 which expresses how well each reader's ratings agree with those of the other readers. The O'Connell-Dobson modification of the kappa statistic15 was used to measure overall agreement among the reviewers at each site. Finally, we used the kappa statistic16 to assess agreement between a consensus (simple majority) of the review panel and the primary radiologist's written report of the CT studies. A statistical package (SAS Institute, Inc, Cary, NC) was used to perform all calculations.
Interobserver Variation First we assessed the magnitude of variation among the members of the review panel in interpretation of the CT scans. Williams' index expresses agreement of a single reader with each of the other readers in the group. An index less than 1.0 indicates that an individual reviewer's rate of agreement was lower that that of the other reviewers. The index was calculated for each reviewer at each of the 31 sites (Table 1). One reviewer had indices less than 0.9 at 10 sites, whereas the other four reviewers' indices were less than 0.9 at no more than two sites. On average, no reviewer's index of agreement differed by more than 10% from the norm.
Table 2 shows the numbers of cases in which at least four of the five reviewers agreed, listed by site. At no site did all five reviewers agree in all 59 cases. However, for approximately 58% of the sites, four or all five of the expert panelists agreed in
Agreement With the Primary Reader
This study found that expert radiologists did not agree completely in their interpretation of cervical-thoracic CT images of children with Hodgkin's disease and that institutional radiologists' interpretations agreed poorly with overall expert readings. Although these results have important implications for the care of patients with Hodgkin's disease and the analysis of treatment protocol data, they are not unprecedented, and their significance may be underestimated by clinicians. Previous studies also have shown significant interobserver variation in the detection of other intrathoracic malignancies by CT imaging.9,11,12,17 Analysis of observer variability, when applied, is an important component of studies that involve multiple readers, and it provides valuable information about their applicability.18 Our first objective was to measure variability in interpretation among five radiologists who had experience and acknowledged expertise in interpreting CT images of patients with Hodgkin's disease. Although it was not possible to determine the accuracy of any individual interpretation, we were able to compare each expert reader's interpretation with the consensus view of the other panelists. One reader's interpretations differed somewhat from those of the rest of the group. However, there was no serious outlier at any site.
The magnitude of observer variation may be expected to increase in proportion to the difficulty of interpreting images of potential disease sites.6 Therefore, our listing of potential disease sites in order of reader agreement indirectly lists those sites by increasing difficulty of interpretation. In approximately 60% of the sites, at least four of the five expert radiologists agreed in Use of the kappa statistic, which allowed us to take into account the magnitude of agreement attributable to chance alone,16 identified a number of problematic sites. The low kappa scores associated with infraclavicular sites may reflect difficulty in defining their precise anatomic location relative to the clavicles, supraclavicular fossae, and axillae. Inconsistent CT interpretation of hilar lymphadenopathy has been recognized previously19 and was apparent here. However, several low kappa values are likely to have been artifactual. The lowest values were associated with evaluation of left paratracheal nodes, which two readers did not recognize as a distinct anatomic site, rating it unassessable. Furthermore, kappa statistics tend to weight disagreements more heavily when the prevalence of a positive finding approaches zero.20 This statistical anomaly may explain the poor agreement obtained on interpretation of chest wall, pericardial, posterior mediastinal, and internal mammary node involvement. The kappa analysis also identified patients whose CT scans were difficult to interpret. In five of the 59 cases, two readers rated some sites as unassessable, whereas the other three noted most sites as negative. Our final objective was to determine the level of agreement between the review panel and the primary radiologist at the treating facility. Poor agreement was demonstrated for two thirds of the sites, a considerably higher frequency than that found among members of the review panel. However, the primary readers had interpreted the images prospectively in a clinical setting and did not use the structured data collection form used by the panel members. Consequently, their reports used less consistent definitions of disease sites. Furthermore, the extent and influence of their interactions with the patients' clinicians are unknown. The relatively large disparity between the panelists' and the primary readers' interpretations suggests that a uniform approach to interpretation of CT images might enhance the usefulness of these studies in patients with newly diagnosed Hodgkin's disease. The variable interpretations of the CT scans among the review panel suggest that even central review of images is an imperfect process and that collaborative treatment protocols for newly diagnosed Hodgkin's disease should incorporate scientific procedures designed to formulate a multireviewer consensus.18 A more immediate concern is the high level of disagreement found between the review panel and the primary reader. This disparity may represent reader error, but it more likely reflects inherent differences between a standardized scoring system and a conventional descriptive radiologic report. In either case, initial disease staging and subsequent delivery of radiation therapy could be adversely affected. To some extent, the variations in interpretation we observed can be mitigated by the use of additional imaging methods and clinical information. However, as shown in a previous study of chest radiograph interpretation, knowledge of the clinical status of the patient improves the diagnostic accuracy more than the interobserver concordance.7 Furthermore, CT introduces problems of a more technical nature. The quality of the image can be a significant factor in the interpretability of the study, as we found in at least five of the 59 CT scans the panel of observers were given in this study. Our findings add to a growing literature documenting variations in interpretation of radiologic studies. A recent review by Robinson6 concluded that errors and variations in interpretation are "the weakest aspect of clinical imaging." In mammography, this problem is being addressed by full implementation of the American College of Radiology's recommended breast imaging recording and data system,21 which uses a structured terminology and a systematic assessment of specified breast areas. Other systems have also been recommended, including the Radiographic Vertebral Index 6 for assessing the extent of disease in multiple myeloma.22 Standardized terminology and definitions of disease sites have been used successfully in lung cancer studies.10,11,19 A systematic assessment form such as the one used in this study may improve the accuracy and quality of cervical-thoracic CT interpretation in pediatric Hodgkin's disease, but prospective validation will be needed. Variations in interpretation are addressed infrequently in clinical investigations and are even less frequently addressed in multi-institutional clinical trials. This study assessed only the diagnostic imaging findings at initial presentation of patients treated on two POG Hodgkin's disease studies. Outcome measurements based on tumor regression/progression or the appearance of new tumor sites may be subject to similar variations in interpretation that could profoundly affect the conclusions of a clinical trial. Our study shows that the contribution of diagnostic imaging interpretations to the results of clinical trials should be appraised carefully, and that these interpretations should be integrated with the available clinical data in a central review. We also suggest that variations in image interpretation may be reduced by using a standard method to report abnormal findings. This approach may warrant widespread use after validation in a prospective clinical trial.
Supported by grants no. CA-29511 (Quality Assurance Review Center) and CA-03161, CA-69177, CA-29691, CA-33625, CA-15525, CA-31566, CA-20549, CA-28476, CA-29293, CA-33587, CA-69428, CA-28383, CA-32053, CA-25408, CA-33603,CA-15989, and CA-05587 (Pediatric Oncology Group) from the National Cancer Institute. We thank the following radiologists for their participation as members of the review panel: Elliott K. Fishman, Fredric A. Hoffer, William M. Kauffman, Jonathan L. Williams, and John C. Leonidas.
1. Cohen MD, Siddiqui A, Weetman R, et al: Hodgkin disease and non-Hodgkin lymphomas in children: Utilization of radiological modalities. Radiology 158:499-505, 1986
2.
Castellino RA, Blank N, Hoppe RT, et al: Hodgkin disease: Contributions of chest CT in the initial staging evaluation. Radiology 160:603-605, 1986 3. Castellino RA: Diagnostic imaging evaluation of Hodgkin's disease and non-Hodgkin's lymphoma. Cancer 67:1177-1180, 1991[Medline] 4. Espeland A, Korsbrekke K, Albrektsen G, et al: Observer variation in plain radiography of the lumbosacral spine. Br J Radiol 71:366-375, 1998[Abstract] 5. Jarvik JG, Haynor DR, Koepsell TD, et al: Interreader reliability for a new classification of lumbar disk disease. Acad Radiol 3:537-544, 1995 6. Robinson PJA: Radiology's Achilles heel: Error in variation in the interpretation of the Röntgen image. Br J Radiol 70:1085-1098, 1997[Abstract] 7. Tudor GR, Finlay D, Taub N: An assessment of inter-observer agreement and accuracy when reporting plain radiographs. Clin Radiol 52:235-238, 1997[Medline] 8. Shaw NJ, Hendry M, Eden OB: Inter-observer variation in interpretation of chest x-rays. Scot Med J 35:140-141, 1990 9. Bollen ECM, Goei R, Hof-Grootenboer BE, et al: Interobserver variability and accuracy of computed tomographic assessment of nodal status in lung cancer. Ann Thorac Surg 58:158-162, 1994[Abstract]
10.
Cascade PN, Gross BH, Kazerooni EA, et al: Variability in the detection of enlarged mediastinal lymph nodes in staging lung cancer: A comparison of contrast-enhanced and unenhanced CT. AJR Am J Roentgenol 170:927-931, 1998
11.
Guyatt GH, Lefcoe M, Walter S, et al: Interobserver variation in the computed tomographic evaluation of mediastinal lymph node size in patients with potentially resectable lung cancer. Chest 107:116-119, 1995 12. Webb WR, Sarin M, Zerhouni EA, et al: Interobserver variability in CT and MR staging of lung cancer. J Comput Assist Tomogr 17:841-846, 1993[Medline]
13.
Fletcher BD, Kauffman WM, Kaste SC, et al: Use of Tl-201 to detect untreated pediatric Hodgkin disease. Radiology 196:851-855, 1995 14. Williams GW: Comparing the joint agreement of several raters with another rater. Biometrics 32:619-627, 1976[Medline] 15. O'Connell DL, Dobson AJ: General observer-agreement measures on individual subjects and groups of subjects. Biometrics 40:973-983, 1984 16. Fleiss JL: Measurement of interrater agreement, in Statistical Methods for Rates and Proportions. New York, NY, Wiley, 1981, pp 217-225
17.
Wilimas JA, Kaste SC, Kauffman WM, et al: Use of chest computed tomography in staging of pediatric Wilms' tumor: Interob-server variability and prognostic significance. J Clin Oncol 15:2631-2635, 1997
18.
Obuchowski NA, Zepp RC: Simple steps for improving multiple-reader studies in radiology. AJR Am J Roentgenol 166:517-521, 1996
19.
Glazer GM, Gross BH, Aisen AM, et al: Imaging of the pulmonary hilum: A prospective comparative study in patients with lung cancer. AJR Am J Roentgenol 145:245-248, 1985 20. Thompson WD, Walter SD: A reappraisal of the kappa coefficient. J Clin Epidemiol 41:949-958, 1988[Medline] 21. American College of Radiology: Breast Imaging Reporting and Data System (BI-RADS). Reston, VA, American College of Radiology, 1993 22. Browman GP, Markman S, Thompson G, et al: Assessment of observer variation in measuring the radiographic vertebral index in patients with multiple myeloma. J Clin Epidemiol 43:833-840, 1990[Medline] Submitted December 3, 1998; accepted March 12, 1999. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||
|
Copyright © 1999 by the American Society of Clinical Oncology, Online ISSN: 1527-7755. Print ISSN: 0732-183X
|