Assessment of machine learning methods for automatic tumor segmentation
Abstract
The definition of target volumes and organs at risk (OARs) is a critical part of radiotherapy planning. In routine practice, this is typically done manually by clinical experts who contour the structures in medical images prior to dosimetric planning. This is a time-consuming and labor-intensive task. Moreover, manual contouring is inherently a subjective task and substantial contour variability can occur, potentially impacting on radiotherapy treatment and image-derived biomarkers. Automatic segmentation (auto-segmentation) of target volumes and
OARs has the potential to save time and resources while reducing contouring variability. Recently, auto-segmentation of OARs using machine learning methods has been integrated into the clinical workflow by several institutions and such tools have been made commercially available by major vendors. The use of machine learning methods for auto-segmentation of target volumes including the gross tumor volume (GTV) is less mature at present but is the focus of extensive ongoing research.
The primary aim of this thesis was to investigate the use of machine learning methods for auto-segmentation of the GTV in medical images. Manual GTV contours constituted the ground truth in the analyses. Volumetric overlap and distance-based metrics were used to quantify auto-segmentation performance. Four different
image datasets were evaluated. The first dataset, analyzed in papers I–II, consisted of positron emission tomography (PET) and contrast-enhanced computed tomography (ceCT) images of 197 patients with head and neck cancer (HNC). The ceCT images of this dataset were also included in paper IV. Two datasets were analyzed separately in paper III, namely (i) PET, ceCT, and low-dose CT (ldCT) images of 86 patients with anal cancer (AC), and (ii) PET, ceCT, ldCT, and T2 and diffusion-weighted (T2W and DW, respectively) MR images of a subset (n = 36) of the aforementioned AC patients. The last dataset consisted of ceCT images
of 36 canine patients with HNC and was analyzed in paper IV.
In paper I, three approaches to auto-segmentation of the GTV in patients with HNC were evaluated and compared, namely conventional PET thresholding, classical machine learning algorithms, and deep learning using a 2-dimensional (2D) U-Net convolutional neural network (CNN). For the latter two approaches the effect of imaging modality on auto-segmentation performance was also assessed. Deep learning based on multimodality PET/ceCT image input resulted in superior agreement with the manual ground truth contours, as quantified by geometric overlap and distance-based performance evaluation metrics calculated on a per patient basis. Moreover, only deep learning provided adequate performance for segmentation based solely on ceCT images. For segmentation based on PET-only, all three approaches provided adequate segmentation performance, though deep learning ranked first, followed by classical machine learning, and PET thresholding. In paper II, deep learning-based auto-segmentation of the GTV in patients with HNC using a 2D U-Net architecture was evaluated more thoroughly by introducing new structure-based performance evaluation metrics and including qualitative expert evaluation of the resulting auto-segmentation quality. As in paper I, multimodal PET/ceCT image input provided superior segmentation performance, compared to the single modality CNN models. The structure-based metrics showed quantitatively that the PET signal was vital for the sensitivity of the CNN models, as the superior PET/ceCT-based model identified 86 % of all malignant GTV structures whereas the ceCT-based model only identified 53 % of these structures. Furthermore, the majority of the qualitatively evaluated auto-segmentations (~ 90 %) generated by the best PET/ceCT-based CNN were given a quality score corresponding to substantial clinical value. Based on papers I and II, deep learning with multimodality PET/ceCT image input would be the recommended approach for auto-segmentation of the GTV in human patients with HNC.
In paper III, deep learning-based auto-segmentation of the GTV in patients with AC was evaluated for the first time, using a 2D U-Net architecture. Furthermore, an extensive comparison of the impact of different single modality and multimodality combinations of PET, ceCT, ldCT, T2W, and/or DW image input on quantitative auto-segmentation performance was conducted. For both the 86-patient and 36-patient datasets, the models based on PET/ceCT provided the highest mean overlap with the manual ground truth contours. For this task, however, comparable auto-segmentation quality was obtained for solely ceCT-based CNN models. The CNN model based solely on T2W images also obtained acceptable auto-segmentation performance and was ranked as the second-best single modality model for the 36-patient dataset. These results indicate that deep learning could prove a versatile future tool for auto-segmentation of the GTV in patients with AC.
Paper IV investigated for the first time the applicability of deep learning-based auto-segmentation of the GTV in canine patients with HNC, using a 3-dimensional (3D) U-Net architecture and ceCT image input. A transfer learning approach where CNN models were pre-trained on the human HNC data and subsequently fine-tuned on canine data was compared to training models from scratch on canine data. These two approaches resulted in similar auto-segmentation performances, which on average was comparable to the overlap metrics obtained for ceCT-based auto-segmentation in human HNC patients. Auto-segmentation in canine HNC patients appeared particularly promising for nasal cavity tumors, as the average overlap with manual contours was 25 % higher for this subgroup, compared to the average for all included tumor sites.
In conclusion, deep learning with CNNs provided high-quality GTV autosegmentations for all datasets included in this thesis. In all cases, the best-performing deep learning models resulted in an average overlap with manual contours which was comparable to the reported interobserver agreements between human experts performing manual GTV contouring for the given cancer type and imaging modality. Based on these findings, further investigation of deep learning-based auto-segmentation of the GTV in the given diagnoses would be highly warranted. Definisjon av målvolum og risikoorganer er en kritisk del av planleggingen av strålebehandling. I praksis gjøres dette vanligvis manuelt av kliniske eksperter som tegner inn strukturenes konturer i medisinske bilder før dosimetrisk planlegging. Dette er en tids- og arbeidskrevende oppgave. Manuell inntegning er også subjektiv, og betydelig variasjon i inntegnede konturer kan forekomme. Slik variasjon kan potensielt påvirke strålebehandlingen og bildebaserte biomarkører. Automatisk segmentering (auto-segmentering) av målvolum og risikoorganer kan potensielt spare tid og ressurser samtidig som konturvariasjonen reduseres. Autosegmentering av risikoorganer ved hjelp av maskinlæringsmetoder har nylig blitt implementert som del av den kliniske arbeidsflyten ved flere helseinstitusjoner, og slike verktøy er kommersielt tilgjengelige hos store leverandører av medisinsk teknologi. Auto-segmentering av målvolum inkludert tumorvolumet gross tumor volume (GTV) ved hjelp av maskinlæringsmetoder er per i dag mindre teknologisk modent, men dette området er fokus for omfattende pågående forskning.
Hovedmålet med denne avhandlingen var å undersøke bruken av maskinlæringsmetoder for auto-segmentering av GTV i medisinske bilder. Manuelle GTVinntegninger utgjorde grunnsannheten (the ground truth) i analysene. Mål på volumetrisk overlapp og avstand mellom sanne og predikerte konturer ble brukt til å kvantifisere kvaliteten til de automatisk genererte GTV-konturene. Fire forskjellige bildedatasett ble evaluert. Det første datasettet, analysert i artikkel I–II, bestod av positronemisjonstomografi (PET) og kontrastforsterkede computertomografi (ceCT) bilder av 197 pasienter med hode/halskreft. ceCT-bildene i dette datasettet ble også inkludert i artikkel IV. To datasett ble analysert separat i artikkel III, nemlig (i) PET, ceCT og lavdose CT (ldCT) bilder av 86 pasienter med analkreft, og (ii) PET, ceCT, ldCT og T2- og diffusjonsvektet (henholdsvis T2W og DW) MR-bilder av en undergruppe (n = 36) av de ovennevnte analkreftpasientene. Det siste datasettet, som bestod av ceCT-bilder av 36 hunder med hode/halskreft, ble analysert i artikkel IV.