top of page

Machine Learning

Predicting The Severity (Benign or Malignant) with features from Mammography screenings, a method for detecting breast cancer.

Dataset

Project on Github

Summary

The goal of this project was to implement machine learning algorithms that I've learned from youtube tutorials, online resources, and a Udemy course. I also wanted to practice data cleaning and creating and using a deep neural network. I was able to practice using libraries like matplotlib and NumPy to visualize data and deal with outliers and questionable data points as well as use algorithms (Logistic Regression, Naive Bayes, Decision Trees, RandomForest, K Nearest Neighbor, Neural Networks, etc) from sklearn and tensorflow/keras to make models and predictions and machine learning/data science concepts (like train/test, k fold cross-validation, feature engineering) to test the accuracy fo my model and train it on the dataset.

1)I first imported the respective libraries and displayed the data. After I made the data frame using pandas, I saw the dataset's columns were not labeled. I added the names for each column corresponding to the features (independent variables) and label(dependent variable aka the severity). I checked to see some basic information of the dataset like the max, min, standard deviation, and more of each feature and saw that the max for the BI-Rad Assessment was 55. This was an erroneous data point because the BI-Rad Assessment is on a scale from 1 to 5 so I simply dropped this data point. I also saw some randomness in data points with missing values, these were removed from further analysis.

2)Next, I created two variables, features, and labels, and set the features variable to the data frame with only the feature columns (BI-Rad Assessment, Age, Shape, Margin, Density) and set the labels variable to the label column, Severity. I then normalized the features data using sklearn's StandardScaler so my models performed better.

3)Then I used K Fold Cross-Validation with a K size of 10 to split my data into buckets for training and testing my models. Here are the results using the following models.

LogisticRegression: Mean Score: 0.8068468997942991

Decision Tree: Mean Score: 0.7427563914193358

Random Forest: Mean Score: 0.7607846018219219

K Nearest Neighbors with 50 Neighbors (1-49 neighbors all gave similar scores): Score - 0.7922421392888628

Multinomial Naive Bayes: Mean Score: 0.7814722303849545

Support Vector Machine with Linear Kernel and Penalty Parameter of 1.0: Mean Score: 0.7983690861004995

Support Vector Machine with Polynomial Kernel and Penalty Parameter of 1.0: Mean Score: 0.7910960916838085

Support Vector Machine with Sigmoid Kernel and Penalty Parameter of 1.0: Mean Score: 0.7379077284748751

Support Vector Machine with RBF Kernel and Penalty Parameter of 1.0: Mean Score: 0.8019688510138113

Neural Network with a 4 input layer with relu activation function and 2 output layer with sigmoid activation function (works well for this case since it's a binary classification) and adam optimizer.

Score: 0.8032618284225463

4)Final analysis: most of the models gave a range between 0.75 and 0.80 accuracies besides the Decision Tree and SVM with Sigmoid Kernel. The Logistic Regression model, a pretty simple algorithm/model, did pretty well, which makes sense this is a binary classification of benign(0) or malignant (1). The Neural Network using Keras and SVM model with RBF Kernel gave higher accuracies as well.

bottom of page