Society for Surgery of the Alimentary Tract

SSAT Home SSAT Home Past & Future Meetings Past & Future Meetings
Facebook X Linkedin YouTube

Back to 2025 Abstracts


SOCIAL DETERMINANTS IN HEALTH IN COLORECTAL CANCER: A MACHINE LEARNING ANALYSIS USING A BRAZILIAN POPULATION-BASED DATABASE
Felipe Mendes Delpino, Lucas Hernandes Correa, Marcelo P. Teivelis, Marina Martins Siqueira, Gabriely Rangel Pereira, Nelson Wolosker, Sergio Eduardo Alonso Araujo, Francisco Tustumi*
Sociedade Beneficente Israelita Brasileira Albert Einstein, Sao Paulo, São Paulo, Brazil

Introduction: Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide. Identifying social determinants in health associated with higher mortality risk can significantly impact healthcare management and resource allocation. Machine learning (ML) models offer promising tools by integrating large clinical and socioeconomic data. This study aims to develop and validate an ML-based model to predict mortality in patients with CRC using a large, population-based dataset from Brazil. Methods: An analysis was performed using data from the Fundação Oncocentro, the primary cancer database in São Paulo state, the most populous state in Brazil with 44.4 million inhabitants. The dataset included all colorectal adenocarcinoma cases diagnosed between 2000 and 2023. The primary outcome was all-cause mortality. The Python software was used for analysis. Seven ML algorithms were tested, and data were split into 70% for training and 30% for testing. We applied the RandomUnderSampler technique to balance classes and RandomizedSearchCV to optimize hyperparameters. The models were interpreted using SHAP (Shapley Additive Explanations), allowing us to identify the most influential variables. Predictor variables encompassed socioeconomic (time to initiate treatment, Human Development Index of the patient's municipality, healthcare access, and education), demographic (sex, age), and clinical (TNM stage, surgery, chemotherapy, radiotherapy, immunotherapy) features. The model performance was measured based on ROC curves analysis and estimation of the area under curve (AUC). Results: After excluding records with missing data, the final dataset comprised approximately 66000 patients. The Random Forest model demonstrated superior AUC (0.80). This model's accuracy and precision were 0.719 and 0.611, respectively. SHAP analysis revealed that in addition to the demographic and clinical variables, such as age, TNM stage, and treatment, socioeconomic variables also impacted the prediction of mortality in CRC. Human Development Index (importance [I]: 0.066), time to initiate treatment (I: 0.067), healthcare access (private vs public) (I: 0.048), and educational level (I: 0.042) had significant weight in this model. The Gradient Boosting, XGBoost, LightGBM, and CatBoost models achieved an AUC of 0.79, the Decision Tree model had an AUC of 0.76, and Logistic Regression achieved an AUC of 0.72. Conclusion: This study highlights the relevance of sociodemographic factors alongside traditional clinical metrics in mortality prediction by applying ML models in CRC patients. Integrating clinical and sociodemographic variables allowed for robust predictions, with the Random Forest model achieving the highest performance. This approach can be valuable for guiding clinical decision-making and health policy development.


ROC curves for the predictive models, with the area under the curve (AUC) indicated for each model.

Clinical and socioeconomic features predicting mortality in the Random Forest model features. "Time to treatment" is defined as the interval between diagnosis and the start of treatment. Healthcare access is categorized into private or secure insurance plans and public care. HDI represents the Human Development Index of the patient's municipality.
Back to 2025 Abstracts