Biomedical and Biotechnology Research Journal (BBRJ)

ORIGINAL ARTICLE
Year
: 2021  |  Volume : 5  |  Issue : 3  |  Page : 331--334

Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score


Disha Harshadbhai Parekh, Vishal Dahiya 
 Department of Computer Science, Indus Institute of Information and Communication Technology, Ahmedabad, Gujarat, India

Correspondence Address:
Disha Harshadbhai Parekh
Department of Computer Science, Indus University, Ahmedabad, Gujarat
India

Abstract

Background: Biomedical field has gained a lot of interest from active researchers today. Treating various diseases prevailing among the world has believed to bring huge insight in the today's research world. Second, advancement in technology has eased the work of researchers to justify their work. Machine learning (ML) is an approach being used by bioengineers today to predict diseases and to even aid them in drug discovery. Methods: Considering both the points, one of the most serious diseases, that is breast cancer here, is predicted using ML approaches. Breast cancer is classified as either benign or malignant which is to be predicted with the help of ML classifiers. A very famous dataset Wisconsin Breast Cancer Dataset is used here and is trained by three classifiers mainly support vector machine, general linear model, and neural network (NNET) against testing dataset. Testing the breast cancer prediction was carried out keeping in mind the accuracy of each of the classifiers. Results: This study is involving a generic code in R language. Conclusions: The study intends to show the usage of NNETs in breast cancer prediction using single-layered structure.



How to cite this article:
Parekh DH, Dahiya V. Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score.Biomed Biotechnol Res J 2021;5:331-334


How to cite this URL:
Parekh DH, Dahiya V. Predicting breast cancer using machine learning classifiers and enhancing the output by combining the predictions to generate optimal F1-score. Biomed Biotechnol Res J [serial online] 2021 [cited 2022 Jan 18 ];5:331-334
Available from: https://www.bmbtrj.org/text.asp?2021/5/3/331/325610


Full Text



 Introduction



Biological system defines cells as the most important part of any living being. These cells, usually defined as “healthy cells,” split up and regenerate themselves in a very organized and controlled manner. Here, the word controlled is utmost important because when a cell divides itself uncontrollably, leading to over production of cells, resulting in cancer. A tumor, word associated with any cancers, is a mass composed of group of such abnormally produced cells. Cancers, under health records, have been considered the most abrupt, yet a very common disease in humanity. There are several types of cancers observed across the world, while still many are surging up and are yet clinically not easily diagnosed. Mostly, any cancers form tumors, but one must understand that not all tumors are cancerous. To understand this with more clarity, there are terms such as benign and malignant. In simple language, benign are noncancerous tumors, while malignant are cancerous tumors. Hence, benign kind of tumors do not spread across any other parts of the body and do not generate any new tumor, while those of malignant type grow uncontrollably more than healthier cells and interfere with normal functioning of the body by drawing nutrients from healthy tissues of the body. According to the study done by WebMD in 2020,[1] there are several types of cancers observed worldwide. On the basis of the study, major types of cancers are five: carcinoma, sarcoma, melanoma, lymphoma and leukemia.[8],[9] Understanding the types of cancers, symptoms and their subtypes are out of scope of this paper, but those interested may find it in a paper.[2]

Breast cancer is very commonly observed disease today, among several women of different age groups. Researchers and scientists today are majorly working in bringing innovation in detecting breast cancers and discovering drugs that help overcome breast cancer during early stages. Observing the passion for this area, we have compared and formulated few machine learning (ML) techniques that help in early detection of breast cancer. Basically, implementing ML algorithms in healthcare and biomedical field has shown a tremendous increment and interest of the scientists and researchers around the globe. ML identifies three major types of learning: supervised, un-supervised, and reinforcement learning. After obtaining certain required knowledge on these learning paradigms, we learned various algorithms to implement on cancer dataset. To show our model and to showcase results on ML algorithms, we have used a very famous Wisconsin Breast Cancer Dataset which is openly available on UCI ML website. These data consist of two versions basically, that is, diagnostic and prognostic. Depending on our research pursuit, we have obtained diagnostic dataset. Algorithm implementation has been done in R Language using R Studio. R is a programming language and free software environment for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Hence, manufacturer, company name, city, etc would not be effective as it is a software tool. Programming codes can be obtained in Appendix A of this paper.

This paper shows related work in Chapter II, research methodology in Chapter III, results in Chapter IV, and conclusion with future work in Chapter V.

 Related Work



This section shows the work done by researchers and scientists on the cancer dataset. To gain the insights on other novel works carried out across the world, we surfed many articles and papers, out of which five papers were distinguished to be of the same interest as ours. The findings and results with its implementation on a type of dataset are compared in [Table 1].{Table 1}

The papers studied helped analyze the usage of train versus test dataset as 70/30 in almost all the papers. The dataset mostly observed was Wisconsin Breast Cancer Diagnostic Dataset (wdbc) available on a UCI ML repository website.[6]

 Methods



This chapter mainly consists of the steps involved in the research of analyzing breast cancer dataset aided by ML models. The steps with code/snippet have also been shown and can be found on: Let us examine the steps involved in the research.

This study involves use of R Programming Language, which is a statistical tool with more than 10,000 libraries. From the analysis of surveys done, we found the use of Wisconsin Diagnostic Dataset by several researchers. Hence, for the paper, we have used the same dataset for analyzing breast cancer. The dataset contains 11 fields with approximately 700 observations. This dataset is available online at UCI ML Library.[6]

The study focuses purely on diagnostic dataset, and the result obtained is on the basis of the data framed as either malignant or benign. M in the dataset stands for Malignant which means a fair chance of cancer which spreads rapidly, while B stands for Benign which means it is going to be a noncancerous tumor that grows gradually.

This section shows the code of several steps involved in collecting, cleaning, processing, and analyzing the data. We have shown the code as well in R so as to aid the researchers to understand and implement it practically. Let us examine the steps involved.

Step-1: Data acquisition

The dataset we have used is wdbc. To extract it online, the code written in R is shown in [Figure 1]. Second, the data can be imported also after saving it on your local drive. The dataset contained columns with generic names, and hence, we have changed the names of each column which is also indicated in the code.{Figure 1}

Step-2: Data preparation

Data preparation is an essential stage while processing raw data. The dataset used here contains few NA or null values, which needs to be processed before we actually start training our model. After importing the data online with the code as shown in [Figure 1], the process of removing null fields was carried out, and the code mentioned is as shown n [Figure 2].{Figure 2}

Step-3: Data training

This step deals with training normalized data to test and train set data. The data help in preparing the data models according to the algorithm we intend to use. Here, the training of data set was done as 70:30; 70% being train set and 30% being the test set. The code in R is shown in [Figure 3].{Figure 3}

Step-4: Developing predictive data models using machine learning algorithms

In this step, we have used the trained set to prepare data model using three different algorithms. The experiment involved using three ML algorithms, mainly, support vector machine (SVM), general linear model (GLM), and neural networks (NNET). The predictive data models developed using these three algorithms were further used to generate confusion matrix and compare the results of each of these algorithms. The code for each algorithm is shown in [Figure 4] for GLM, [Figure 5] for SVM, and [Figure 6] for NNET. Here, we are applying the training and testing of a model on the same dataset which is not considered to be a realistic measure of model performance. Hence, here, we have used 10-fold stratified cross-validation. This has helped to acquire correlations between each algorithm.{Figure 4}{Figure 5}{Figure 6}

A basic SVM approach was used decades ago[7] to obtain the highest accuracy of 97.3%, and then, the variant of SVM was used known as least square SVM that uses linear equations for SVM and had got the accuracy of 98.5%. Hence, at the first instance, we have used SVM approach on the trained data. Linear regression is a technique to model the relationship between one dependent and one or more independent variable. Thus, as a second algorithm, we have applied logistic regression using GLM technique on the trained data. Finally, we used an NNET of size 5 to calculate the best algorithm out of all three. The code for these predictive models is shown in [Figure 7].{Figure 7}

 Result



After executing individual code of each algorithm, the result obtained for each is listed in [Table 2]. From the results obtained, we could analyze the confusion matrix with better true-positive and true-negative values and the accuracy of SVM was found to be best. Although the purpose of the experiment was to analyze single-layered NNET, the accuracy obtained was less, but this further can be implemented with combination of other classifiers such as LSTM or Bi-LSTM, and we definitely be able to improve on the accuracy level.{Table 2}

After obtaining the results of each, a graph of an R plot was generated which included comparative study of each of the algorithm as shown in [Figure 8]. It includes the receiver operating characteristic curve which demonstrates the association between sensitivity and specificity, in general.{Figure 8}

Later to improve the performance and efficiency of each algorithm, the predictions of all three classifiers were combined, and the result obtained in terms of F1-score of each was found to be 96.66% for GLM, 97.27% for SVM, and 96.96% for NNET. The result is shown in the last column of [Table 2].

 Conclusion



The study carried out on Wisconsin Breast Cancer Dataset has led to understand various classifiers to predict breast cancer. An extensive research survey was carried out, and after studying thoroughly about several classifiers, we took three ML classifiers, mainly GLM, SVM, and single-layered NNET. From the study, we can conclude that upon combining the predictions of each classifier, we could bring a rational improvement in the F1-score of each classifier. Further improvement on single-layered NNET can be enhanced by adding/combining it with some other classifiers. In future, we intend to combine neural network with some other classifiers to enhance the accuracy ratio and thus increase the performance of model.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

References

1Agarap AF. On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset. ICMLSC 2018, February 2–4, 2018; 2019.
2Can Hou MD, Xiaorong Zhong DM, Ping He DM, Bin Xu M, Sha Diao M. Predicting breast cancer in Chinese women using machine learning techniques: Algorithm Development. JMIR Med Inform June, 2020; Volume – 8, pages:1-11.
3Omar Ibrahim Obaid MA. Evaluating the performance of machine learning techniques in the classification of Wisconsin breast cancer. Int J Eng Technol 2018; Volume – 7, No. 4.36, Pages: 160-166.
4Sara Alghunaim HH. On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access 2019; Volume - 7, Pages:91535-46.
5Tawseef Ayoub Shaikh RA. Applying Machine Learning Algorithms for Early Diagnosis and Prediction of Breast Cancer Risk. Springer Nature Singapore, Proceedings of 2nd International Conference on Communication, Computing and Networking; 2019. p. 589-98.
6Wisconsin Breast Cancer Dataset. (n.d.). Retrieved from UCI Machine Learning Repository: Available from: https://archive.ics.uci.edu/ml/ datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 , accessed on 21st March, 2020.
7Bennett K, Blue J. A Support Vector Machine Approach to Decision Trees. In Proceedings of the IEEE International Joint Conference on Neural Networks; 1998. p. 2396-401.
8Institute, N. C. (n.d.). Available from: https://training.seer.cancer.gov/disease/categories/classification.html , accessed on: 12th June, 2021.
9WebMD, G. P; January, 2020. Available from: https://www.webmd.com/cancer/guide/understanding-cancer-basics , accessed on 1st June, 2021.