The Application of Automatic Document Classification to Cancer Staging for Esophageal Pathological Reports
Yung-Han Sun*1,3, Chih-Cheng Hsieh1,2, Chun-Hsien Chen3
1Department of Surgery, Taipei Veterans General Hospital, Taipei, Taiwan; 2School of Medicine, National Yang-Ming University, Taipei, Taiwan; 3Department of Information Management, Chang Gung University, Taoyuan, Taiwan
Backgrounds: More than 40,000 Taiwan residents died of cancers in 2009 according to the statistics of Department of Health, Taiwan. Due to the advance in medical experience and knowledge over the last decade, the prognosis of cancer patients has been significantly improved and there are more drugs as well as alternative treatments to help patients at relatively late stage of cancers. Cancer staging is an important indicator for assessing the effects of cancer treatment and prognosis. Its effectiveness may be affected by the interpretation proficiency of cancer registration staff who read the pathological reports of cancer patients. However, the manual interpretation process is somewhat inefficient and time consuming. The aim of this study was to explore the effectiveness of computationally converting pathological reports of esophageal cancer into cancer staging reports by using efficient document classification techniques. Materials and Methods: Pathological reports of 234 patients undergone esophagectomy from year 2000 to 2008 in Division of Thoracic Surgery, Taipei-Veterans General Hospital, Taiwan were collected in this study. The reports were computationally converted into weighted frequency vectors of keywords by using text mining techniques to analyze cancer staging related keywords in the reports. Then, J48 decision tree induction algorithm, a supervised learning algorithm, was used to evaluate the performance of our document classification model for automatic cancer staging based on the 234 vectors. Results: The average prediction accuracy rate for cell type could reach 95.3%, and those for T, N and M status reach 84.47%, 92.72% and 94.87% respectively. Conclusions: In esophageal cancer, using the J48 decision tree induction algorithm, the average prediction accuracy rate is high, the model may efficiently and effectively assist the physicians or cancer registration staffs to improve the accuracy rate of cancer pathological stage and reduce the time-consuming stage in the large number of data processing in studies.
Results of the automatic document classifications to different catagories of esophageal cancer staging
Hold-out method | 5-fold cross validation | 10-fold cross validation | |
Cell type | 95.30% | 95.73% | 95.74% |
T status | 84.47% | 85.04% | 86.14% |
N status | 92.74% | 92.75% | 92.72% |
M status | 94.87% | 94.88% | 94.89% |
The shown data were averaged by 10 times replications using J48 decision tree induction algorithm
Back to 2011 Program