Designing ground truth for Machine Learning - conceptualisation of a collaborative design process between medical professionals and data scientists
European Society for Socially Embedded Technologies (EUSSET)
The development of Machine Learning (ML) models is a complex process consisting of several iterative steps like problem definition, data collection and processing, feature engineering, model training, and evaluation. While the amount of research on ML model development is growing, little is known about the design process of ground truth in datasets that serve as the backbone of many ML-based systems. Design choices made before the labelling process often become invisible, and the ground truth becomes an infrastructural part of the data, which prevents it from being inspected in the event of problems at the later stages of the data science cycle. I conducted observations of the collaborative work of radiologists and data scientists on ground truth design. I report on the adopted process divided into three stages: Stage 1 - assessment of data requirements and labelling practices; Stage 2 - design and evaluation of label structure; and Stage 3 - design and evaluation of labelling tool. Moreover, I introduce two activities of Stage 2: ideation and stress test to design high-quality labels. At last, I pose outstanding questions to unpack the tensions and motivations observed during the ethnographic work.