Deep learning for automated scoring of immunohistochemically stained tumour tissue sections – Validation across tumour types based on patient outcomes
Wanja Kildal,
Karolina Cyll,
Joakim Kalsnes,
Rakibul Islam,
Frida M. Julbø,
Manohar Pradhan,
Elin Ersvær,
Neil Shepherd,
Ljiljana Vlatkovic,
Xavier Tekpli,
Øystein Garred,
Gunnar B. Kristensen,
Hanne A. Askautrud,
Tarjei S. Hveem,
Håvard E. Danielsen,
Tone F. Bathen,
Elin Borgen,
Anne-Lise Børresen-Dale,
Olav Engebråten,
Britt Fritzman,
Olaf Johan Hartman-Johnsen,
Øystein Garred,
Jürgen Geisler,
Gry Aarum Geitvik,
Solveig Hofvind,
Rolf Kåresen,
Anita Langerød,
Ole Christian Lingjærde,
Gunhild M. Mælandsmo,
Bjørn Naume,
Hege G. Russnes,
Kristine Kleivi Sahlberg,
Torill Sauer,
Helle Kristine Skjerven,
Ellen Schlichting,
Therese Sørlie
Affiliations
Wanja Kildal
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway; Corresponding author.
Karolina Cyll
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Joakim Kalsnes
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Rakibul Islam
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Frida M. Julbø
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Manohar Pradhan
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Elin Ersvær
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Neil Shepherd
Gloucestershire Cellular Pathology Laboratory, Gloucester, GL53 7AN, UK
Ljiljana Vlatkovic
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Xavier Tekpli
Department of Medical Genetics, Institute of Clinical Medicine, Faculty of Medicine, University of Oslo and Oslo University Hospital, NO-0450, Oslo, Norway
Øystein Garred
Department of Pathology, Oslo University Hospital, NO-0424, Oslo, Norway
Gunnar B. Kristensen
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Hanne A. Askautrud
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Tarjei S. Hveem
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway
Håvard E. Danielsen
Institute for Cancer Genetics and Informatics, Oslo University Hospital, NO-0424, Oslo, Norway; Nuffield Division of Clinical Laboratory Sciences, University of Oxford, Oxford, OX3 9DU, UK
We aimed to develop deep learning (DL) models to detect protein expression in immunohistochemically (IHC) stained tissue-sections, and to compare their accuracy and performance with manually scored clinically relevant proteins in common cancer types.Five cancer patient cohorts (colon, two prostate, breast, and endometrial) were included. We developed separate DL models for scoring IHC-stained tissue-sections with nuclear, cytoplasmic, and membranous staining patterns. For training, we used images with annotations of cells with positive and negative staining from the colon cohort stained for Ki-67 and PMS2 (nuclear model), the prostate cohort 1 stained for PTEN (cytoplasmic model) and β-catenin (membranous model). The nuclear DL model was validated for MSH6 in the colon, MSH6 and PMS2 in the endometrium, Ki-67 and CyclinB1 in prostate, and oestrogen and progesterone receptors in the breast cancer cohorts. The cytoplasmic DL model was validated for PTEN and Mapre2, and the membranous DL model for CD44 and Flotillin1, all in prostate cohorts. When comparing the results of manual and DL scores in the validation sets, using manual scores as the ground truth, we observed an average correct classification rate of 91.5 % (76.9–98.5 %) for the nuclear model, 85.6 % (73.3–96.6 %) for the cytoplasmic model, and 78.4 % (75.5–84.3 %) for the membranous model. In survival analyses, manual and DL scores showed similar prognostic impact, with similar hazard ratios and p-values for all DL models. Our findings demonstrate that DL models offer a promising alternative to manual IHC scoring, providing efficiency and reproducibility across various data sources and markers.