• Users Online: 277
  • Print this page
  • Email this page


 
 Table of Contents  
ORIGINAL ARTICLE
Year : 2021  |  Volume : 5  |  Issue : 3  |  Page : 331-334

Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score


Department of Computer Science, Indus Institute of Information and Communication Technology, Ahmedabad, Gujarat, India

Date of Submission28-Jun-2021
Date of Acceptance29-Jul-2021
Date of Web Publication7-Sep-2021

Correspondence Address:
Disha Harshadbhai Parekh
Department of Computer Science, Indus University, Ahmedabad, Gujarat
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/bbrj.bbrj_131_21

Rights and Permissions
  Abstract 


Background: Biomedical field has gained a lot of interest from active researchers today. Treating various diseases prevailing among the world has believed to bring huge insight in the today's research world. Second, advancement in technology has eased the work of researchers to justify their work. Machine learning (ML) is an approach being used by bioengineers today to predict diseases and to even aid them in drug discovery. Methods: Considering both the points, one of the most serious diseases, that is breast cancer here, is predicted using ML approaches. Breast cancer is classified as either benign or malignant which is to be predicted with the help of ML classifiers. A very famous dataset Wisconsin Breast Cancer Dataset is used here and is trained by three classifiers mainly support vector machine, general linear model, and neural network (NNET) against testing dataset. Testing the breast cancer prediction was carried out keeping in mind the accuracy of each of the classifiers. Results: This study is involving a generic code in R language. Conclusions: The study intends to show the usage of NNETs in breast cancer prediction using single-layered structure.

Keywords: Accuracy, breast cancer, F1-score, general linear model, machine learning classifiers, neural network, R, receiver operating characteristic, support vector machine, Wisconsin Dataset


How to cite this article:
Parekh DH, Dahiya V. Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score. Biomed Biotechnol Res J 2021;5:331-4

How to cite this URL:
Parekh DH, Dahiya V. Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score. Biomed Biotechnol Res J [serial online] 2021 [cited 2021 Dec 1];5:331-4. Available from: https://www.bmbtrj.org/text.asp?2021/5/3/331/325610




  Introduction Top


Biological system defines cells as the most important part of any living being. These cells, usually defined as “healthy cells,” split up and regenerate themselves in a very organized and controlled manner. Here, the word controlled is utmost important because when a cell divides itself uncontrollably, leading to over production of cells, resulting in cancer. A tumor, word associated with any cancers, is a mass composed of group of such abnormally produced cells. Cancers, under health records, have been considered the most abrupt, yet a very common disease in humanity. There are several types of cancers observed across the world, while still many are surging up and are yet clinically not easily diagnosed. Mostly, any cancers form tumors, but one must understand that not all tumors are cancerous. To understand this with more clarity, there are terms such as benign and malignant. In simple language, benign are noncancerous tumors, while malignant are cancerous tumors. Hence, benign kind of tumors do not spread across any other parts of the body and do not generate any new tumor, while those of malignant type grow uncontrollably more than healthier cells and interfere with normal functioning of the body by drawing nutrients from healthy tissues of the body. According to the study done by WebMD in 2020,[1] there are several types of cancers observed worldwide. On the basis of the study, major types of cancers are five: carcinoma, sarcoma, melanoma, lymphoma and leukemia.[8],[9] Understanding the types of cancers, symptoms and their subtypes are out of scope of this paper, but those interested may find it in a paper.[2]

Breast cancer is very commonly observed disease today, among several women of different age groups. Researchers and scientists today are majorly working in bringing innovation in detecting breast cancers and discovering drugs that help overcome breast cancer during early stages. Observing the passion for this area, we have compared and formulated few machine learning (ML) techniques that help in early detection of breast cancer. Basically, implementing ML algorithms in healthcare and biomedical field has shown a tremendous increment and interest of the scientists and researchers around the globe. ML identifies three major types of learning: supervised, un-supervised, and reinforcement learning. After obtaining certain required knowledge on these learning paradigms, we learned various algorithms to implement on cancer dataset. To show our model and to showcase results on ML algorithms, we have used a very famous Wisconsin Breast Cancer Dataset which is openly available on UCI ML website. These data consist of two versions basically, that is, diagnostic and prognostic. Depending on our research pursuit, we have obtained diagnostic dataset. Algorithm implementation has been done in R Language using R Studio. R is a programming language and free software environment for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Hence, manufacturer, company name, city, etc would not be effective as it is a software tool. Programming codes can be obtained in Appendix A of this paper.

This paper shows related work in Chapter II, research methodology in Chapter III, results in Chapter IV, and conclusion with future work in Chapter V.


  Related Work Top


This section shows the work done by researchers and scientists on the cancer dataset. To gain the insights on other novel works carried out across the world, we surfed many articles and papers, out of which five papers were distinguished to be of the same interest as ours. The findings and results with its implementation on a type of dataset are compared in [Table 1].
Table 1: Comparative analysis of few papers with their ML methods and accuracy

Click here to view


The papers studied helped analyze the usage of train versus test dataset as 70/30 in almost all the papers. The dataset mostly observed was Wisconsin Breast Cancer Diagnostic Dataset (wdbc) available on a UCI ML repository website.[6]


  Methods Top


This chapter mainly consists of the steps involved in the research of analyzing breast cancer dataset aided by ML models. The steps with code/snippet have also been shown and can be found on: Let us examine the steps involved in the research.

This study involves use of R Programming Language, which is a statistical tool with more than 10,000 libraries. From the analysis of surveys done, we found the use of Wisconsin Diagnostic Dataset by several researchers. Hence, for the paper, we have used the same dataset for analyzing breast cancer. The dataset contains 11 fields with approximately 700 observations. This dataset is available online at UCI ML Library.[6]

The study focuses purely on diagnostic dataset, and the result obtained is on the basis of the data framed as either malignant or benign. M in the dataset stands for Malignant which means a fair chance of cancer which spreads rapidly, while B stands for Benign which means it is going to be a noncancerous tumor that grows gradually.

This section shows the code of several steps involved in collecting, cleaning, processing, and analyzing the data. We have shown the code as well in R so as to aid the researchers to understand and implement it practically. Let us examine the steps involved.

Step-1: Data acquisition

The dataset we have used is wdbc. To extract it online, the code written in R is shown in [Figure 1]. Second, the data can be imported also after saving it on your local drive. The dataset contained columns with generic names, and hence, we have changed the names of each column which is also indicated in the code.
Figure 1: Importing Wisconsin Diagnostic Breast Cancer Dataset

Click here to view


Step-2: Data preparation

Data preparation is an essential stage while processing raw data. The dataset used here contains few NA or null values, which needs to be processed before we actually start training our model. After importing the data online with the code as shown in [Figure 1], the process of removing null fields was carried out, and the code mentioned is as shown n [Figure 2].
Figure 2: Data preparation

Click here to view


Step-3: Data training

This step deals with training normalized data to test and train set data. The data help in preparing the data models according to the algorithm we intend to use. Here, the training of data set was done as 70:30; 70% being train set and 30% being the test set. The code in R is shown in [Figure 3].
Figure 3: Preparing train set and test set data

Click here to view


Step-4: Developing predictive data models using machine learning algorithms

In this step, we have used the trained set to prepare data model using three different algorithms. The experiment involved using three ML algorithms, mainly, support vector machine (SVM), general linear model (GLM), and neural networks (NNET). The predictive data models developed using these three algorithms were further used to generate confusion matrix and compare the results of each of these algorithms. The code for each algorithm is shown in [Figure 4] for GLM, [Figure 5] for SVM, and [Figure 6] for NNET. Here, we are applying the training and testing of a model on the same dataset which is not considered to be a realistic measure of model performance. Hence, here, we have used 10-fold stratified cross-validation. This has helped to acquire correlations between each algorithm.
Figure 4: Predictive Model 1 using general linear model

Click here to view
Figure 5: Predictive Model 2 using support vector machine

Click here to view
Figure 6: Predictive Model 3 using single-layered neural network

Click here to view


A basic SVM approach was used decades ago[7] to obtain the highest accuracy of 97.3%, and then, the variant of SVM was used known as least square SVM that uses linear equations for SVM and had got the accuracy of 98.5%. Hence, at the first instance, we have used SVM approach on the trained data. Linear regression is a technique to model the relationship between one dependent and one or more independent variable. Thus, as a second algorithm, we have applied logistic regression using GLM technique on the trained data. Finally, we used an NNET of size 5 to calculate the best algorithm out of all three. The code for these predictive models is shown in [Figure 7].
Figure 7: Combining predictive models to gain better performance

Click here to view



  Result Top


After executing individual code of each algorithm, the result obtained for each is listed in [Table 2]. From the results obtained, we could analyze the confusion matrix with better true-positive and true-negative values and the accuracy of SVM was found to be best. Although the purpose of the experiment was to analyze single-layered NNET, the accuracy obtained was less, but this further can be implemented with combination of other classifiers such as LSTM or Bi-LSTM, and we definitely be able to improve on the accuracy level.
Table 2: Experimental results of each classifier

Click here to view


After obtaining the results of each, a graph of an R plot was generated which included comparative study of each of the algorithm as shown in [Figure 8]. It includes the receiver operating characteristic curve which demonstrates the association between sensitivity and specificity, in general.
Figure 8: ROC Curves for SVM, GLM and Neural Network

Click here to view


Later to improve the performance and efficiency of each algorithm, the predictions of all three classifiers were combined, and the result obtained in terms of F1-score of each was found to be 96.66% for GLM, 97.27% for SVM, and 96.96% for NNET. The result is shown in the last column of [Table 2].


  Conclusion Top


The study carried out on Wisconsin Breast Cancer Dataset has led to understand various classifiers to predict breast cancer. An extensive research survey was carried out, and after studying thoroughly about several classifiers, we took three ML classifiers, mainly GLM, SVM, and single-layered NNET. From the study, we can conclude that upon combining the predictions of each classifier, we could bring a rational improvement in the F1-score of each classifier. Further improvement on single-layered NNET can be enhanced by adding/combining it with some other classifiers. In future, we intend to combine neural network with some other classifiers to enhance the accuracy ratio and thus increase the performance of model.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.



 
  References Top

1.
Agarap AF. On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset. ICMLSC 2018, February 2–4, 2018; 2019.  Back to cited text no. 1
    
2.
Can Hou MD, Xiaorong Zhong DM, Ping He DM, Bin Xu M, Sha Diao M. Predicting breast cancer in Chinese women using machine learning techniques: Algorithm Development. JMIR Med Inform June, 2020; Volume – 8, pages:1-11.  Back to cited text no. 2
    
3.
Omar Ibrahim Obaid MA. Evaluating the performance of machine learning techniques in the classification of Wisconsin breast cancer. Int J Eng Technol 2018; Volume – 7, No. 4.36, Pages: 160-166.  Back to cited text no. 3
    
4.
Sara Alghunaim HH. On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access 2019; Volume - 7, Pages:91535-46.  Back to cited text no. 4
    
5.
Tawseef Ayoub Shaikh RA. Applying Machine Learning Algorithms for Early Diagnosis and Prediction of Breast Cancer Risk. Springer Nature Singapore, Proceedings of 2nd International Conference on Communication, Computing and Networking; 2019. p. 589-98.  Back to cited text no. 5
    
6.
Wisconsin Breast Cancer Dataset. (n.d.). Retrieved from UCI Machine Learning Repository: Available from: https://archive.ics.uci.edu/ml/ datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 , accessed on 21st March, 2020.  Back to cited text no. 6
    
7.
Bennett K, Blue J. A Support Vector Machine Approach to Decision Trees. In Proceedings of the IEEE International Joint Conference on Neural Networks; 1998. p. 2396-401.  Back to cited text no. 7
    
8.
Institute, N. C. (n.d.). Available from: https://training.seer.cancer.gov/disease/categories/classification.html , accessed on: 12th June, 2021.  Back to cited text no. 8
    
9.
WebMD, G. P; January, 2020. Available from: https://www.webmd.com/cancer/guide/understanding-cancer-basics , accessed on 1st June, 2021.  Back to cited text no. 9
    


    Figures

  [Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8]
 
 
    Tables

  [Table 1], [Table 2]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
Abstract
Introduction
Related Work
Methods
Result
Conclusion
References
Article Figures
Article Tables

 Article Access Statistics
    Viewed350    
    Printed4    
    Emailed0    
    PDF Downloaded39    
    Comments [Add]    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]