Interpretable Machine Learning

4.3 Risk Factors for Cervical Cancer (Classification)

The cervical cancer dataset contains indicators and risk factors for predicting if a woman will get cervical cancer. The features contain demographics (e.g. age), habits, and medical history. The data can be downloaded from the UCI Machine Learning repository and is described by K. Fernandes, Cardoso, and Fernandes (2017) ¹².

The subset of features, which are used in the examples are:

Age in years
Number of sexual partners
First sexual intercourse (age in years)
Number of pregnancies
Smokes yes (1) or no (1)
Smokes (years)
Hormonal Contraceptives yes (1) or no (0)
Hormonal Contraceptives (years)
IUD: Intrauterine device yes (1) or no (1)
IUD (years): Number of years with an intrauterine device
STDs: Ever had a sexually transmitted disease? Yes (1) or no (0)
STDs (number): Number of sexually transmitted diseases.
STDs: Number of diagnosis
STDs: Time since first diagnosis
STDs: Time since last diagnosis
Biopsy: Biopsy results “Healthy” or “Cancer”. Target outcome.

As the biopsy serves as the gold standard for diagnosing cervical cancer, the classification task in this book used the biopsy outcome as the target. Missing values for each column were imputed by the mode (most frequent value), which is probably a bad solution, because the value of the answer might be correlated with the probability for a value being missing. There is probably a bias, because the questions are of a very private nature. But this is not a book about missing data imputation, so the mode imputation will suffice!

Fernandes, Kelwin, Jaime S Cardoso, and Jessica Fernandes. 2017. “Transfer Learning with Partial Observability Applied to Cervical Cancer Screening.” In Iberian Conference on Pattern Recognition and Image Analysis, 243–50. Springer.↩