Improved Survival Prediction for Pancreatic Cancer Using Machine Learning and Regression
Stuart H. Floyd*1,2, Sergio a. Alvarez3, Carolina Ruiz2, John Hayward2, Mary Sullivan1, Jennifer F. Tseng1, Giles F. Whalen1
1Surgical Oncology, University of Massachusetts Medical School, Worcester, MA; 2Computer Science, Worcester Polytechnic Institute, Worcester, MA; 3Computer Science, Boston College, Chestnut Hill, MA
Purpose: To assess the value of novel artificial intelligence techniques for prediction of survival time of patients with pancreatic cancer when used instead of, or in conjunction with, multivariate logistic regression.
Methods: A clinical database was assembled containing retrospective records of 60 patients treated by resection for pancreatic adenocarcinoma at a single academic institution. Each patient record is described by 189 fields comprising information about preliminary outlook, personal and family medical history, diagnostic tests, tumor pathology, treatment course, surgical proceedings, and length of survival. Survival times were binned into three ranges: less than 6 months, 6 to 12 months, and over 12 months. A statistical preprocessing technique was applied to extract the 19 variables correlated best with survival time. Two machine learning algorithms, Artificial Neural Networks (ANN) and Naive Bayesian Networks (NBN), were used to construct predictive models of survival based on these 19 variables. A multivariate logistic regression (LR) model was constructed independently. A smaller set containing 10 of the 19 predictive variables was identified through the machine learning technique of Support Vector Machines (SVM) and used for construction of a second set of predictive models. Predictive performance of all models was measured by their correct classification rates for survival using ten repetitions of ten-fold cross-validation. Statistical significance was assessed using a resampled t-test.
Results: The correct classification rate of NBN (57.2%) was significantly better (p < 0.05) than that of LR (42.5%) when model construction was performed over the larger set of 19 predictive variables. ANN also yielded a higher correct classification rate (49%) than LR, but this difference is not statistically significant. When the smaller set of 10 predictive variables identified by SVM was used, the correct classification rate attained by LR (61.3%) was significantly improved (p < 0.05) as compared with LR over the larger set of 19 variables. No statistically significant differences in classification rates were observed among the different predictive models constructed over the reduced set of 10 variables.
Conclusions: Machine learning techniques such as Naive Bayesian Networks can achieve more accurate predictions of survival time for pancreatic cancer patients than multivariate logistic regression. Furthermore, data preprocessing using the machine learning technique of Support Vector Machines can significantly boost the performance of logistic regression.
2007 Program and Abstracts | 2007 Posters