Inter-observer variability between readers of CT images: all for one and one for all

Nikolas S. Kulberg; Кульберг Николай Сергеевич; Nikolas S. Kulberg; Roman V. Reshetnikov; Решетников Роман Владимирович; Roman V. Reshetnikov; Vladimir P. Novik; Новик Владимир Петрович; Vladimir P. Novik; Alexey B. Elizarov; Елизаров Алексей Борисович; Alexey B. Elizarov; Maxim A. Gusev; Гусев Максим Александрович; Maxim A. Gusev; Victor A. Gombolevskiy; Гомболевский Виктор Александрович; Victor A. Gombolevskiy; Anton V. Vladzymyrskyy; Владзимирский Антон Вячеславович; Anton V. Vladzymyrskyy; Sergey P. Morozov; Морозов Сергей Павлович; Sergey P. Morozov

doi:10.17816/DD60622

Inter-observer variability between readers of CT images: all for one and one for all

Authors: Kulberg N.S.¹^,2, Reshetnikov R.V.¹^,3, Novik V.P.¹, Elizarov A.B.¹, Gusev M.A.¹^,4, Gombolevskiy V.A.¹, Vladzymyrskyy A.V.¹, Morozov S.P.¹
Affiliations:
1. Moscow Center for Diagnostics and Telemedicine
2. Federal Research Center “Computer Science and Control” of Russian Academy of Sciences
3. Institute of Molecular Medicine, The First Sechenov Moscow State Medical University
4. Moscow Polytechnic University
Issue: Vol 2, No 2 (2021)
Pages: 105-118
Section: Original Study Articles
URL: https://journals.rcsi.science/DD/article/view/60622
DOI: https://doi.org/10.17816/DD60622
ID: 60622

Cite item

Full Text

Abstract
Full Text
About the authors
References
Supplementary files
Statistics

Abstract

BACKGROUND: The markup of medical image datasets is based on the subjective interpretation of the observed entities by radiologists. There is currently no widely accepted protocol for determining ground truth based on radiologists’ reports.

AIM: To assess the accuracy of radiologist interpretations and their agreement for the publicly available dataset “CTLungCa-500”, as well as the relationship between these parameters and the number of independent readers of CT scans.

MATERIALS AND METHODS: Thirty-four radiologists took part in the dataset markup. The dataset included 536 patients who were at high risk of developing lung cancer. For each scan, six radiologists worked independently to create a report. After that, an arbitrator reviewed the lesions discovered by them. The number of true-positive, false-positive, true-negative, and false-negative findings was calculated for each reader to assess diagnostic accuracy. Further, the inter-observer variability was analyzed using the percentage agreement metric.

RESULTS: An increase in the number of independent readers providing CT scan interpretations leads to accuracy increase associated with a decrease in agreement. The majority of disagreements were associated with the presence of a lung nodule in a specific site of the CT scan.

CONCLUSION: If arbitration is provided, an increase in the number of independent initial readers can improve their combined accuracy. The experience and diagnostic accuracy of individual readers have no bearing on the quality of a crowd-tagging annotation. At four independent readings per CT scan, the optimal balance of markup accuracy and cost was achieved.

Keywords

X-ray computed tomography, datasets as topic, ground truth, observer variation

Full Text

##article.viewOnOriginalSite##

About the authors

Nikolas S. Kulberg

Moscow Center for Diagnostics and Telemedicine; Federal Research Center “Computer Science and Control” of Russian Academy of Sciences

Author for correspondence.
Email: kulberg@npcmr.ru
ORCID iD: 0000-0001-7046-7157
SPIN-code: 2135-9543

Cand. Sci. (Phys.-Math.)

Russian Federation, 24 Petrovka str., 109029, Moscow; Moscow

Roman V. Reshetnikov

Moscow Center for Diagnostics and Telemedicine; Institute of Molecular Medicine, The First Sechenov Moscow State Medical University

Email: reshetnikov@fbb.msu.ru
ORCID iD: 0000-0002-9661-0254
SPIN-code: 8592-0558

Cand. Sci. (Phys.-Math.)

Russian Federation, 24 Petrovka str., 109029, Moscow; Moscow

Vladimir P. Novik

Moscow Center for Diagnostics and Telemedicine

Email: v.novik@npcmr.ru
ORCID iD: 0000-0002-6752-1375
SPIN-code: 2251-1016
Russian Federation, 24 Petrovka str., 109029, Moscow

Alexey B. Elizarov

Moscow Center for Diagnostics and Telemedicine

Email: a.elizarov@npcmr.ru
ORCID iD: 0000-0003-3786-4171
SPIN-code: 7025-1257

Cand. Sci. (Phys.-Math.)

Russian Federation, 24 Petrovka str., 109029, Moscow

Maxim A. Gusev

Moscow Center for Diagnostics and Telemedicine; Moscow Polytechnic University

Email: m.gusev@npcmr.ru
ORCID iD: 0000-0001-8864-8722
SPIN-code: 1526-1140
Russian Federation, 24 Petrovka str., 109029, Moscow; Moscow

Victor A. Gombolevskiy

Moscow Center for Diagnostics and Telemedicine

Email: g_victor@mail.ru
ORCID iD: 0000-0003-1816-1315
SPIN-code: 6810-3279

MD, Cand. Sci. (Med.)

Russian Federation, 24 Petrovka str., 109029, Moscow

Anton V. Vladzymyrskyy

Moscow Center for Diagnostics and Telemedicine

Email: a.vladzimirsky@npcmr.ru
ORCID iD: 0000-0002-2990-7736
SPIN-code: 3602-7120

Dr. Sci. (Med.), Professor

Russian Federation, 24 Petrovka str., 109029, Moscow

Sergey P. Morozov

Moscow Center for Diagnostics and Telemedicine

Email: morozov@npcmr.ru
ORCID iD: 0000-0001-6545-6170
SPIN-code: 8542-1720

Dr. Sci. (Med.), Professor

Russian Federation, 24 Petrovka str., 109029, Moscow

References

Morozov SP, Kulberg NS, Gombolevsky VA, et al. Moscow Radiology Dataset CTLungCa-500. 2018. (In Russ). Available from: https://mosmed.ai/datasets/ct_lungcancer_500/
Morozov SP, Gombolevskiy VA, Elizarov AB, et al. A simplified cluster model and a tool adapted for collaborative labeling of lung cancer CT Scans. Comput Methods Programs Biomed. 2021;206:106111. doi: 10.1016/j.cmpb.2021.106111
Kulberg NS, Gusev MA, Reshetnikov RV, et al. Methodology and tools for creating training samples for artificial intelligence systems for recognizing lung cancer on CT images. Heal Care Russ Fed. 2020;64(6):343–350. doi: 10.46563/0044-197X-2020-64-6-343-350
Hessel SJ, Herman PG, Swensson RG. Improving performance by multiple interpretations of chest radiographs: effectiveness and cost. Radiology. 1978;127(3):589–594. doi: 10.1148/127.3.589
Herman PG, Hessel SJ. Accuracy and its relationship to experience in the interpretation of chest radiographs. Invest Radiol. 1975;10(1):62–67. doi: 10.1097/00004424-197501000-00008
MacMahon H, Naidich DP, Goo JM, et al. Guidelines for management of incidental pulmonary nodules detected on ct images: from the fleischner society 2017. Radiology. 2017;284:228–243. doi: 10.1148/radiol.2017161659
Gerke O, Vilstrup MH, Segtnan EA, et al. How to assess intra- and inter-observer agreement with quantitative PET using variance component analysis: a proposal for standardisation. BMC Med Imaging. 2016;16(1):54. doi: 10.1186/s12880-016-0159-3
Rasheed K, Rabinowitz YS, Remba D, Remba MJ. Interobserver and intraobserver reliability of a classification scheme for corneal topographic patterns. Br J Ophthalmol. 1998;82(12):1401–1406. doi: 10.1136/bjo.82.12.1401
Van Riel SJ, Sánchez CI, Bankier AA, et al. Observer variability for classification of pulmonary nodules on low-dose ct images and its effect on nodule management. Radiology. 2015;277(3):863–871. doi: 10.1148/radiol.2015142700
Wickham H, François R, Henry L, Müller K. dplyr: A Grammar of Data Manipulation. R package version 1.0.4. 2021.
Gamer M, Lemon J, Fellows I, Singh P. irr: Various Coefficients of Interrater Reliability and Agreement. 2019.
Wickham H. ggplot2: elegant Graphics for Data Analysis. Springer-Verlag New York; 2016. 260 р.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2020. Available from: http://www.r-project.org/index.html
Van Rossum G, Drake FL. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA; 2009.
Ardila D, Kiraly AP, Bharadwaj S, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med. 2019;25(6):954–961. doi: 10.1038/s41591-019-0447-x
Peters R, Heuvelmans M, Brinkhof S, et al. Prevalence of pulmonary multi-nodularity in CT lung cancer screening. 2015.
Creative Research Systems. The survey systems: Sample size calculator. 2012.
Hugo GD, Weiss E, Sleeman WC, et al. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer. Med Phys. 2017;44(2):762–771. doi: 10.1002/mp.12059
Bakr S, Gevaert O, Echegaray S, et al. A radiogenomic dataset of non-small cell lung cancer. Sci Data. 2018;5:180202. doi: 10.1038/sdata.2018.202
Armato SG, McLennan G, Bidaut L, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on ct scans. Med Phys. 2011;38(2):915–931. doi: 10.1118/1.3528204

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

2. Fig. 1. Accuracy and consistency of estimates as a function of the number of radiographers participating in the primary survey. The 95% confidence interval is shown in gray. The points correspond to different samples of primary experts. For experiments with two, three and four experts, three different samples were selected from the original six radiographers; for five - two.

Download (104KB)

Indexing metadata

3. Fig. 2. Examples of CT studies with significant disagreement (a, b, CTLungCa-500 AN RLADD02000018919, ID RLSDD02000018855) and full agreement (c, d, CTLungCa-500 AN RLAD42D007-25151, ID RLSD42D007-25151) between experts. The studies are presented in frontal projection in pulmonary (a, c) and soft tissue (b, d) modes. The radiologists' marks are shown with different colors: a, b - the focus was marked by five primary experts out of six, four assigned it a solid type and one - a semi-solid one. The arbiter disagreed with their opinion, recognizing the find as benign calcification; c, d - All six primary assessors and the arbiter classified the lesion as potentially malignant solid.

Download (389KB)

Indexing metadata

4. Fig. 3. Agreement between primary experts: a - for representatives of the original cohort of 15 radiographers; b - for replacement radiographers. The data for the expert with ID 000 ++ are not given due to the small number of lesions noted. For each radiologist, the first column corresponds to the number of lesions uniquely marked by that specialist (none of the other five experts recognized this finding). The following are columns corresponding to cases where the lesion identified by the radiologist was noted by one, two, three, four and five other primary experts. The scheduling did not take into account the approval of the arbiter, as well as the differences of opinion between radiologists about the type of lesion.

Download (176KB)

Indexing metadata

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register