PREPROCESSING

Image Flipping: We wrote code to flip all the images so they faced left to write. Basically, our simple algorithm looked at the average intensity in different vertical regions of each image, and flipped the image if it had higher intensity on the right side.

Gabor Filtering: A Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane, with the orientation and other features able to be chosen by the user. This is particularly useful for this project, as after manipulating some parameters, you can choose the filter to be tailored to extracting circularly symmetric areas out of the mammograms, which generally correspond to tumors.

Histogram Equalization: Each flipped image was processed using histogram equalization, which normalizes it based on its cumulative distribution function for intensity. Essentially, the process increases contrast by heightening higher intensities and also spreading out the most common frequencies.

Final Product: Text

FEATURE EXTRACTION:

INTENSITY HISTOGRAM

The intensity histogram gives a vector of intensity bins (1-256) and a corresponding vector indicating the number of pixels in each bin. The count vector was normalized according to the size of the image and then used to calculate various statistics.

Mean: This feature captures the average value in the count vector. Though malignant images have small regions of high intensity (the tumors), mean intensity was found to be higher for normal images. This may be because perhaps because they generally have larger spreads of bright pixels, so there are fewer zeros in the count matrix.
Variance: Also extracted from the intensity distributions of the images, variance characterizes clustering. A high variance corresponds to a greater spread about the mean. This feature was found to be higher in malignant images.
Kurtosis: This indicates the sharpness of the frequency distribution peak, since kurtosis is amplified when more outliers are present. The normal images had higher kurtosis than the malignant ones, likely because they have more of a spread.

Final Product: Text

FEATURE EXTRACTION: GLCM MATRIX

A gray-level co-occurrence (GLCM) matrix captures spatial distributions of different intensities, also known as "texture", in a grayscale image. Element (i,j) of a GLCM matrix gives the number of times a pixel of intensity i is directly to the left of a pixel of intensity j. The GLCM is a powerful tool for distinguishing between malignant and healthy images because the problem is less about average intensity and more about spatial relationships between pixels.

Contrast: This measures the difference between the lowest and highest values of an adjacent set of pixels, thus quantifying local variation. The contrast was found to be higher for malignant images.
Cluster prominence: This is a measure of the asymmetry of an image, and is found to be higher in malignant images.
Energy: A measure of uniformity between pixel pairs, energy is highest for periodic or repetitive images. It was found to be slightly higher for normal images.
Information measure of correlation: This measures the linear dependency of gray levels on surrounding pixels or point, and is slightly higher in magnitude for normal image.
Inverse difference moment normalized: Capturing the local homogeneity of an image, this feature is very slightly higher for normal images.

Final Product: Text

FEATURE EXTRACTION:
FOURIER ANALYSIS

We first tried adding the DC term of each image as feature. In this case, the ensemble classifier (with 1000 trees) correctly grouped 15/15 normal image and 11/15 malignant (without Gabor filtering - we incorporated that later), therefore reducing accuracy. We only looked at the ensemble classifier results, because it consistently yields the best results. With both the DC term and average FFT coefficient for [200:500, 200:500] added as features, the same classifier correctly grouped 14/15 for both categories. This improved the average accuracy, and even if it gives one more false positive, it gives fewer false negatives, which is important when it comes to tumor diagnosis. Lastly, with just the average FFT coefficient as a feature, the ensemble classifier correctly grouped 12/15 malignant and 15/15 normal.

Overall, including arbitrary Fourier features yielded mixed results. Our final product includes both Fourier features. At first, we weren't sure about using Fourier analysis as a feature because classifying tumors is more about looking at spatial relationships, but including the average intensity in a central region and the DC term improved the ensemble classifier results. This might be because tumor pictures have a concentrated peak in intensity that translates into the Fourier features. Also, the Fourier features must follow the same malignant/healthy patterns as our other features, adding Fourier analysis to those other features improves accuracy. Most likely, the patterns the Fourier features follow just supplement existing feature differences between malignant and healthy images.

Final Product: Text

CLASSIFICATION

Classifiers That Worked:

Tree Classifier
Ensemble Classifier

Classifiers That Weren't As Accurate:

K-Nearest Neighbors Classifier
SVM Classifier
Kernel Classifier
Naive - Bayes Classifier

Why Did the Tree and Ensemble Classifiers Work the Best?

The ensemble model, which gives the highest accuracy, just incorporates the results of multiple tree models, so these are very related classifiers. More trees generally corresponded to greater accuracy, although very high numbers of trees reduced accuracy (~10000).
Several of the less successful models look at spatial distributions of features. The fact that these didn't work as well might suggest that our malignant/normal features are highly nonlinear, meaning the patterns the two different classes form are difficult to distinguish in the feature space.

Final Product: Text

CLASSIFICATION TREE

This classifier generates a “tree” of decision/test nodes and leaf nodes structured like the image below, that classify input data. Each test node (including the root node) in a decision tree in our project compares the value of a single feature associated with an input image to a threshold, and then the image is sent to the left or right node depending on whether it is above or below the threshold. This process repeats at each test node until a leaf node is reached. When our model generates a decision tree, the tests at the nodes are generated randomly, and then all of the training data is sent through the tree. Each leaf node then becomes a classifier based on which class made up a majority of the images that reached that node. The model that generates these trees uses a metric that essentially detects how close these proportions are to one, the Gini coefficients, to optimize the tests so the tree is good at classifying the images in the training data. The Gini coefficients are optimized using a Bayesian optimization algorithm.

Screen Shot 2019-04-18 at 5.24.06 PM.png

Final Product: Image

ENSEMBLE CLASSIFICATION

The ensemble classifier uses a process called bootstrap aggregation, or bagging, to achieve higher accuracy via tree-based classifications. In this process, a number of random subsets of the training data set (data points taken with replacement) are generated. These “bootstrap” samples are then used as the training data for individual decision tree training, and the classification results of all of these decision trees are aggregated in the overall model. This method prevents overfitting the data by leaving various data points out of certain trees in the model.

Screen Shot 2019-04-18 at 5.14.44 PM.png

Final Product: Image

K-NEAREST NEIGHBORS CLASSIFIER

K Nearest Neighbors is a very simple classification model. When given a value of K, the number of nearest neighbors, it will find that number of closest points within the feature space to a given test point in the feature space, and will classify the test point based on which class makes up a majority of the identified nearest neighbors’ classes. To break ties, it finds the neighbor of smallest index and applies that neighbor’s class.

Screen Shot 2019-04-18 at 5.40.37 PM.png

Final Product: Image

SVM CLASSIFIER

Intended for low-dimensional one or two class spaces, the SVM classifier generates a line/plane/hyperplane as a boundary between regions of a feature space, and uses a score function f(x) = x’*B + b, where b represents the model bias, B is a vector orthogonal to the plane, x is a training datum, and x’ is x transposed. Since f(x) is the distance from point x to the hyperplane, the objective of the model generator is to minimize B’s magnitude, which should maximize the margin separating the two classes. If the classes are inseparable, then it penalizes the function with “slack variables” and factors those into the minimization problem.

Final Product: Image

KERNEL CLASSIFIER

Our Kernel classifier uses the Gaussian Kernel function (f(xi) = e^[(x-xi)/2σ^2]), which maps its input training data to a higher dimensional vector space, and then uses linear classification models to classify the feature data in these new higher-dimensional spaces. The variable xi is a certain data point in the feature space. The model used in the higher-dimensional space is a binary SVM model.

Final Product: Image

NAIVE - BAYES CLASSIFIER

Naive-Bayes is a probabilistic model that relies on conditional probabilities to determine classification parameters. Specifically, it relies upon Bayes’ theorem, from which one can infer that the probability that a given vector is in a certain class given that the vector’s values (posterior) is equal to the probability any vector is in that class (prior probability) times the likelihood a vector is that vector given that it’s in that class (likelihood) divided by the probability that it is that vector (evidence). The evidence is ignored as for each vector, it is a constant. The remaining term (prior x likelihood) is equal to the joint probability of the class and the vector itself, which by the definition of conditional probability, can be written as shown below.

The “naive” assumption of the NB model is that each conditional probability is only equal to the probability of the feature xi conditioned on the class. More clearly stated, the assumption is that each vector feature is independent. Thus, the posterior probability is calculated as the prior probability times the product of all the probabilities of each vector feature conditioned on the class, then divided by the evidence p(x). The decision rule used by the classifier assigns the class of the highest posterior probability to the test vector.

Screen Shot 2019-04-21 at 9.41.08 PM.png

Final Product: Image

FEATURE SELECTION ALGORITM

Due to the 17 unused GLCM features as first described by Robert Haralick, we developed a basic algorithm that would iteratively insert one of the 17 features into the final feature matrix, and see if the accuracy would improve. We found 2 features that, when added, did not change accuracy, while the other 15 led to a less accurate classification.

Final Product: Image

ACCURACY

Final Features: 3 image histogram stats, 2 FFT features, 5 GLCM features

KNN(x) indicates a KNN model with x nearest neighbors. Also, the ensemble model was specified to use 1,000 decision trees (we tested 1, 10, 100, & 1000).

Screen Shot 2019-04-22 at 5.49.23 PM.png

Final Product: Welcome

PREPROCESSING

FEATURE EXTRACTION:

INTENSITY HISTOGRAM

FEATURE EXTRACTION: GLCM MATRIX

FEATURE EXTRACTION: FOURIER ANALYSIS

CLASSIFICATION

CLASSIFICATION TREE

ENSEMBLE CLASSIFICATION

K-NEAREST NEIGHBORS CLASSIFIER

SVM CLASSIFIER

KERNEL CLASSIFIER

NAIVE - BAYES CLASSIFIER

FEATURE SELECTION ALGORITM

ACCURACY

FEATURE EXTRACTION:
FOURIER ANALYSIS