Handwritten Digit Classification

1 Data Description

MNIST (“Modified National Institute of Standards and Technology”) dataset

The dataset I used is MNIST dataset which I found at the Kaggle competition “Classify handwritten digits using the famous MNIST data”. The digit is displayed in 28 X 28 pixel which leads to 784 feature values. Each feature contains integer from 0 to 255 (in grey color scale) The original dataset contained images that represent digits from 0 to 9, but in the assignment I performed binary classification using the images of digits of 0 and 1.

Before diving into analysis, I checked the data and found many columns only with value 0. I got rid of these columns and as a result 590 columns(features) were left. I used this dataset for the following model comparisons.

There are 4132 entries representing image 0 and 4684 entries representing image 1. Since the number of these two are similar, I didn’t have to do some extra data cleaning.

Here’s a link to the Kaggle competition with the dataset: Digit-Recognizer

2 Learning Methods

I used the following three classification algorithms for classifying handwritten digits problem.

2.1 Learning Method 1: Gaussian Kernel Radial Basis Function Network

There are two meta-parameters for this method: ‘epochs’ and ‘bandwidth’. From former practices using the Gaussian Kernel Radial Basis Function Network, 100 epochs (which means the number of RBF neurons) worked well. Therefore, I set the ‘epochs =100’ and tried different ‘bandwidth’ as parameters. Following is the parameters I tried on internal cross-validation.

parameters_RBF_gaussian = (
  {'bandwidth': 10},
  {'bandwidth': 50},
  {'bandwidth': 100}
)

2.2 Learning Method 2: Sigmoid Kernel Radial Basis Function Network

parameters_RBF_sigmoid = (
  {'stepsize': 0.01, 'regwgt': 0.01},
  {'stepsize': 0.1, 'regwgt': 0.01},
  {'stepsize': 0.01, 'regwgt': 0.1},
  {'stepsize': 0.1, 'regwgt': 0.1}
)

I used K-mean clustering to select the centers. After calculating centroids and I made Phi matrix which is the transformed matrix. I used sigmoid transfer and got the error term.

I used sigmoid transformation and transformed the data matrix X into Phi matrix. I calculatedthe weight by w = (Θ^TΘ)^(-1) Θ^T y. Then for the test data, I transformed the data matrix into Phimatrix again, and predicted y = Θw

2.3 Learning Method 3: Logistic Regression with no regularizer

3 Meta parameter selection technique

3.1 internal: k-fold Cross Validation

I used 10-fold CV because it gives almost unbiased estimate of accuracy. For training set i, I tested metaparameters using internal k-fold CV. By looking at the mean errors for different set of parameters, I selected the best metaparameters.

3.2 external: Bootstrap (repeated sampling)

I used bootstrap because it is usually better than empirically k-fold. Using the best parameters found from internal CV, I learned on the entire training set i. Then I found the test error on testset i. This is the estimate of error for train/test split i. I executed this over 35 splits and got 35 estimates of error.

4 Modeling and Prediction

I set the ‘trainsize = 1000’ and ‘testsize = 5000’ for the external cross validation with ‘numruns = 35’. For each numrun I used internal_CV function that I implemented to find the best parameter for the algorithm using 10-fold cross validation.

bestparams = internal_CV(learner, parameters_RBF_gaussian, 10, trainset)

The above line means that for the learner given set of paramters, use 10-fold cross validation on the trainset. This returns the best parameter for the model. I used this bestparams in the model learning.

learner.learn(trainset[0], trainset[1])

Then, using the trained model, I predicted using the following line.

predictions = learner.predict(testset[0])

After getting the error,

error = geterror(testset[1], predictions)

I saved them in the errors dictionary.

5 Comparison of algorithms

I found the confidence interval and used t-test to compare algorithms. I performed testing hypothesis about the mean difference of the error for two methods each. I used one-sided hypothesis testing to compare the errors of two methods. For the comparison, I specified the information of experimental unit, populations, number of measurements, parameters of interest for the problem(), and stated the null and alternative hypothesis.

5.1 Comparison1. Sigmoid Kernel Radial Basis Function Network vs. Gaussian Kernel Radial Basis Function Network

  • Experimental unit : error (each experimental unit belongs to one external CV of Bootstrap)
  • Populations : 1 population (Handwriting dataset)
  • 1 sample problem (each pair of error is computed from same trainset, and thus I can reduce this to 1-sample by taking difference from them.)
  • 2 measurements taken on each experimental unit (error computed by Gaussian Kernel Radial Basis Function Network and error computed by Sigmoid Kernel Radial Basis Function Network)
  • Parameters of interest for this problem : Let Gi denote the i th error computed by Gaussian Kernel Radial Basis Function Network and Si denote the i th error computed by Sigmoid Kernel Radial Basis Function Network. Let Xi = Gi – Si, the (error i computed from Gaussian Kernel Radial Basis Function Network) – (error i computed from Sigmoid Kernel Radial Basis Function Network). I am interested in the mean of X1, X2, … , X35 which I will set as µ.
  • H0 :  µ <= 0, H1: µ > 0
  • Assumptions 1 Normality: The size of the data is large (35). By Central Limit Theorem I can assume that set of diffence of errors X1;X2; :::;X35 are from normal distribution.
  • Assumptions 2 Randomness of the Samples: Each trainset is selected by Bootstrap resampling method. Therefore each of the trainingset (size:1000) and testset (size:5000) are chosen randomly.

Confidence Interval (two-tailed confidence interval using r)

I am interested in one-tailed problem, but when constructing confidence interval two-tailed confidence interval is widely accepted.

> RBF_Gaussian = c(0.9, 1.46, 1.02, 1.58, 1.2, 1.0, 0.96, 0.9,
1.5, 0.9, 1.32, 0.96, 1.86, 0.98, 1.16, 0.78, 1.58, 1.18, 1.04,
0.84, 1.02, 1.96, 0.92, 1.18, 1.04, 0.94, 0.86, 0.96, 0.58,
0.92, 0.92, 1.02, 1.34, 1.0, 0.9)

> RBF_Sigmoid = c(0.18, 0.24, 0.06, 0.12, 0.1, 0.06, 0.2, 0.24,
0.22, 0.18, 0.12, 0.18, 0.26, 0.14, 0.14, 0.08, 0.24, 0.12,
0.12, 0.26, 0.3, 0.14, 0.76, 0.34, 0.18, 0.22, 0.3, 0.24,
0.26, 0.12, 0.06, 0.16, 0.16, 0.06, 0.18)

> t.test(RBF_Gaussian - RBF_Sigmoid, conf.level = 0.99)$conf.int
[1] 0.7585858 1.0665570
attr(,"conf.level")
[1] 0.99

The range of confidence interval is (0.7585858, 1.0665570). I can say that with 99 percent confidence, (error computed from RBF_Gaussian) minus (error computed from RBF_Sigmoid) lies in the above range. Since 0 is not in the range, I can strongly say that mean error computed from RBF_Gaussian is larger than the mean error computed from RBF_Sigmoid.

t-test (One-sample, One-tailed t-test using r)

> t.test(RBF_Gaussian - RBF_Sigmoid, conf.level = 0.99, alt = "greater")

One Sample t-test
data: RBF_Gaussian - RBF_Sigmoid
t = 16.169, df = 34, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
99 percent confidence interval:
0.7747974 Inf
sample estimates:
mean of x
0.9125714

P-value is 2.2e-16 which is almost 0. This means that I have a strong evidence to reject H0. Therefore I can conclude that the mean error computed from RBF_Gaussian is larger than the mean error computed from RBF_Sigmoid. In other words, RBF_Sigmoid predicts better than RBF_Gaussian.

5.2 Comparison2. Gaussian Kernel Radial Basis Function Network vs. Logistic Regression

  • Experimental unit : error (each experimental unit belongs to one external CV of Bootstrap)
  • Populations : 1 population (Handwriting dataset)
  • 1 sample problem (each pair of error is computed from same trainset, and thus I can reduce this to 1-sample by taking difference from them.)
  • 2 measurements taken on each experimental unit (error computed by Gaussian Kernel Radial Basis Function Network and error computed by Logistic Regression)
  • Parameters of interest for this problem : Let Li denote the i th error computed by Logistic Regression and Gi denote the i th error computed by Gaussian Kernel Radial Basis Function Network. Let Xi = Li – Gi, the (error i computed from Logistic Regression) – (error i computed from Gaussian Kernel Radial Basis Function Network). I am interested in the mean of X1,X2, … , X35 which I will set as µ.
  • H0 : µ <= 0 , H1: µ > 0
  • Assumptions 1 Normality: The size of the data is large (35). By Central Limit Theorem I can assume that set of difference of errors X1, X2, … , X35 are from normal distribution.
  • Assumptions 2 Randomness of the Samples: Each trainset is selected by Bootstrap resampling method. Therefore each of the trainingset (size:1000) and testset (size:5000) are chosen randomly.

Confidence Interval (two-tailed confidence interval using r)

I am interested in one-tailed problem, but when constructing confidence interval two-tailed confidence interval is widely accepted.

> Logistic_Regression = c(3.36, 2.28, 2.42, 1.68, 3.1, 2.8,
2.44, 2.62, 2.24, 2.7, 2.88, 3.02, 2.46, 2.04, 1.98, 2.32,
2.34, 2.74, 2.6, 2.72, 4.08, 2.82, 2.42, 2.28, 2.74, 2.34,
3.2, 3.86, 3.74, 3.02, 2.36, 2.76, 2.06, 2.8, 3.36)

> RBF_Gaussian = c(0.9, 1.46, 1.02, 1.58, 1.2, 1.0, 0.96, 0.9,
1.5, 0.9, 1.32, 0.96, 1.86, 0.98, 1.16, 0.78, 1.58, 1.18, 1.04,
0.84, 1.02, 1.96, 0.92, 1.18, 1.04, 0.94, 0.86, 0.96, 0.58,
0.92, 0.92, 1.02, 1.34, 1.0, 0.9)

> t.test(Logistic_Regression - RBF_Gaussian, conf.level = 0.99)$conf.int
[1] 1.271136 1.923150
attr(,"conf.level")
[1] 0.99

The range of confidence interval is (1.271136, 1.923150). I can say that with 99 percent confidence, (error computed from Logistic_Regression) minus (error computed from RBF_Gaussian) lies in the above range. Since 0 is not in the range, I can strongly say that mean error computed from Logistic_Regression is larger than the mean error computed from RBF_Gaussian.

t-test (One-tailed t-test using r)

> t.test(Logistic_Regression - RBF_Gaussian, conf.level = 0.99, alt = "greater")
One Sample t-test
data: Logistic_Regression - RBF_Gaussian
t = 13.367, df = 34, p-value = 2.146e-15
alternative hypothesis: true mean is greater than 0
99 percent confidence interval:
1.305458 Inf
sample estimates:
mean of x
1.597143

P-value is 2.146e-15 which is almost 0. This means that I have a strong evidence to reject H0. Therefore I can conclude that the mean error computed from Logistic_Regression is larger than the mean error computed from RBF_Gaussian. In other words, RBF_Gaussian predicts better than Logistic Regression.

From above two t-tests, I can strongly argue that RBF_Sigmoid predicts better than RBF_Gaussian and also RBF_Gaussian predicts better than Logistic Regression. According to the transitive relation, I conclude that RBF_Sigmoid predicts the best followed by RBF_Gaussian and then Logistic Regression.

6 Conclusion (Performance of each method)

For classifying the handwritten digits of 0 and 1, Sigmoid Kernel Radial Basis Function Network predicted the best among other methods followed by Gaussian Kernel Radial Basis Function Network and then Logistic Regression.

  • Sigmoid Kernel RBF Network: 0.193 (mean error)
  • Gaussian Kernel RBF Network: 1.105
  • Logistic Regression: 2.702