## 1 Data Description

**MNIST (“Modified National Institute of Standards and Technology”) dataset**

The dataset I used is MNIST dataset which I found at the Kaggle competition “Classify handwritten digits using the famous MNIST data”. The digit is displayed in 28 X 28 pixel which leads to 784 feature values. Each feature contains integer from 0 to 255 (in grey color scale) The original dataset contained images that represent digits from 0 to 9, but in the assignment I performed binary classification using the images of digits of 0 and 1.

Before diving into analysis, I checked the data and found many columns only with value 0. I got rid of these columns and as a result 590 columns(features) were left. I used this dataset for the following model comparisons.

There are 4132 entries representing image 0 and 4684 entries representing image 1. Since the number of these two are similar, I didn’t have to do some extra data cleaning.

Here’s a link to the Kaggle competition with the dataset: Digit-Recognizer

## 2 Learning Methods

I used the following three classification algorithms for classifying handwritten digits problem.

**2.1 Learning Method 1: Gaussian Kernel Radial Basis Function Network**

There are two meta-parameters for this method: ‘epochs’ and ‘bandwidth’. From former practices using the Gaussian Kernel Radial Basis Function Network, 100 epochs (which means the number of RBF neurons) worked well. Therefore, I set the ‘epochs =100’ and tried different ‘bandwidth’ as parameters. Following is the parameters I tried on internal cross-validation.

parameters_RBF_gaussian = ( {'bandwidth': 10}, {'bandwidth': 50}, {'bandwidth': 100} )

**2.2 Learning Method 2: Sigmoid Kernel Radial Basis Function Network**

parameters_RBF_sigmoid = ( {'stepsize': 0.01, 'regwgt': 0.01}, {'stepsize': 0.1, 'regwgt': 0.01}, {'stepsize': 0.01, 'regwgt': 0.1}, {'stepsize': 0.1, 'regwgt': 0.1} )

I used K-mean clustering to select the centers. After calculating centroids and I made Phi matrix which is the transformed matrix. I used sigmoid transfer and got the error term.

I used sigmoid transformation and transformed the data matrix X into Phi matrix. I calculatedthe weight by w = (Θ^TΘ)^(-1) Θ^T y. Then for the test data, I transformed the data matrix into Phimatrix again, and predicted y = Θw

**2.3 Learning Method 3: Logistic Regression with no regularizer**

## 3 Meta parameter selection technique

**3.1 internal: k-fold Cross Validation**

I used 10-fold CV because it gives almost unbiased estimate of accuracy. For training set i, I tested metaparameters using internal k-fold CV. By looking at the mean errors for different set of parameters, I selected the best metaparameters.

**3.2 external: Bootstrap (repeated sampling)**

I used bootstrap because it is usually better than empirically k-fold. Using the best parameters found from internal CV, I learned on the entire training set i. Then I found the test error on testset i. This is the estimate of error for train/test split i. I executed this over 35 splits and got 35 estimates of error.

## 4 Modeling and Prediction

I set the ‘trainsize = 1000’ and ‘testsize = 5000’ for the external cross validation with ‘numruns = 35’. For each numrun I used internal_CV function that I implemented to find the best parameter for the algorithm using 10-fold cross validation.

bestparams = internal_CV(learner, parameters_RBF_gaussian, 10, trainset)

The above line means that for the learner given set of paramters, use 10-fold cross validation on the trainset. This returns the best parameter for the model. I used this bestparams in the model learning.

learner.learn(trainset[0], trainset[1])

Then, using the trained model, I predicted using the following line.

predictions = learner.predict(testset[0])

After getting the error,

error = geterror(testset[1], predictions)

I saved them in the errors dictionary.

## 5 Comparison of algorithms

I found the confidence interval and used t-test to compare algorithms. I performed testing hypothesis about the mean difference of the error for two methods each. I used one-sided hypothesis testing to compare the errors of two methods. For the comparison, I specified the information of experimental unit, populations, number of measurements, parameters of interest for the problem(), and stated the null and alternative hypothesis.

**5.1 Comparison1. Sigmoid Kernel Radial Basis Function Network vs. Gaussian Kernel Radial Basis Function Network**

- Experimental unit : error (each experimental unit belongs to one external CV of Bootstrap)
- Populations : 1 population (Handwriting dataset)
- 1 sample problem (each pair of error is computed from same trainset, and thus I can reduce this to 1-sample by taking difference from them.)
- 2 measurements taken on each experimental unit (error computed by Gaussian Kernel Radial Basis Function Network and error computed by Sigmoid Kernel Radial Basis Function Network)
- Parameters of interest for this problem : Let Gi denote the i th error computed by Gaussian Kernel Radial Basis Function Network and Si denote the i th error computed by Sigmoid Kernel Radial Basis Function Network. Let Xi = Gi – Si, the (error i computed from Gaussian Kernel Radial Basis Function Network) – (error i computed from Sigmoid Kernel Radial Basis Function Network). I am interested in the mean of X1, X2, … , X35 which I will set as µ.
- H0 : µ <= 0, H1: µ > 0
- Assumptions 1 Normality: The size of the data is large (35). By Central Limit Theorem I can assume that set of diffence of errors X1;X2; :::;X35 are from normal distribution.
- Assumptions 2 Randomness of the Samples: Each trainset is selected by Bootstrap resampling method. Therefore each of the trainingset (size:1000) and testset (size:5000) are chosen randomly.

**Confidence Interval (two-tailed confidence interval using r)**

I am interested in one-tailed problem, but when constructing confidence interval two-tailed confidence interval is widely accepted.

> RBF_Gaussian = c(0.9, 1.46, 1.02, 1.58, 1.2, 1.0, 0.96, 0.9, 1.5, 0.9, 1.32, 0.96, 1.86, 0.98, 1.16, 0.78, 1.58, 1.18, 1.04, 0.84, 1.02, 1.96, 0.92, 1.18, 1.04, 0.94, 0.86, 0.96, 0.58, 0.92, 0.92, 1.02, 1.34, 1.0, 0.9) > RBF_Sigmoid = c(0.18, 0.24, 0.06, 0.12, 0.1, 0.06, 0.2, 0.24, 0.22, 0.18, 0.12, 0.18, 0.26, 0.14, 0.14, 0.08, 0.24, 0.12, 0.12, 0.26, 0.3, 0.14, 0.76, 0.34, 0.18, 0.22, 0.3, 0.24, 0.26, 0.12, 0.06, 0.16, 0.16, 0.06, 0.18) > t.test(RBF_Gaussian - RBF_Sigmoid, conf.level = 0.99)$conf.int [1] 0.7585858 1.0665570 attr(,"conf.level") [1] 0.99

The range of confidence interval is (0.7585858, 1.0665570). I can say that with 99 percent confidence, (error computed from RBF_Gaussian) minus (error computed from RBF_Sigmoid) lies in the above range. Since 0 is not in the range, I can strongly say that mean error computed from RBF_Gaussian is larger than the mean error computed from RBF_Sigmoid.

**t-test (One-sample, One-tailed t-test using r)**

> t.test(RBF_Gaussian - RBF_Sigmoid, conf.level = 0.99, alt = "greater") One Sample t-test data: RBF_Gaussian - RBF_Sigmoid t = 16.169, df = 34, p-value < 2.2e-16 alternative hypothesis: true mean is greater than 0 99 percent confidence interval: 0.7747974 Inf sample estimates: mean of x 0.9125714

P-value is 2.2e-16 which is almost 0. This means that I have a strong evidence to reject H0. Therefore I can conclude that the mean error computed from RBF_Gaussian is larger than the mean error computed from RBF_Sigmoid. In other words, RBF_Sigmoid predicts better than RBF_Gaussian.

**5.2 Comparison2. Gaussian Kernel Radial Basis Function Network vs. Logistic Regression**

- Experimental unit : error (each experimental unit belongs to one external CV of Bootstrap)
- Populations : 1 population (Handwriting dataset)
- 1 sample problem (each pair of error is computed from same trainset, and thus I can reduce this to 1-sample by taking difference from them.)
- 2 measurements taken on each experimental unit (error computed by Gaussian Kernel Radial Basis Function Network and error computed by Logistic Regression)
- Parameters of interest for this problem : Let Li denote the i th error computed by Logistic Regression and Gi denote the i th error computed by Gaussian Kernel Radial Basis Function Network. Let Xi = Li – Gi, the (error i computed from Logistic Regression) – (error i computed from Gaussian Kernel Radial Basis Function Network). I am interested in the mean of X1,X2, … , X35 which I will set as µ.
- H0 : µ <= 0 , H1: µ > 0
- Assumptions 1 Normality: The size of the data is large (35). By Central Limit Theorem I can assume that set of difference of errors X1, X2, … , X35 are from normal distribution.
- Assumptions 2 Randomness of the Samples: Each trainset is selected by Bootstrap resampling method. Therefore each of the trainingset (size:1000) and testset (size:5000) are chosen randomly.

**Confidence Interval (two-tailed confidence interval using r)**

I am interested in one-tailed problem, but when constructing confidence interval two-tailed confidence interval is widely accepted.

> Logistic_Regression = c(3.36, 2.28, 2.42, 1.68, 3.1, 2.8, 2.44, 2.62, 2.24, 2.7, 2.88, 3.02, 2.46, 2.04, 1.98, 2.32, 2.34, 2.74, 2.6, 2.72, 4.08, 2.82, 2.42, 2.28, 2.74, 2.34, 3.2, 3.86, 3.74, 3.02, 2.36, 2.76, 2.06, 2.8, 3.36) > RBF_Gaussian = c(0.9, 1.46, 1.02, 1.58, 1.2, 1.0, 0.96, 0.9, 1.5, 0.9, 1.32, 0.96, 1.86, 0.98, 1.16, 0.78, 1.58, 1.18, 1.04, 0.84, 1.02, 1.96, 0.92, 1.18, 1.04, 0.94, 0.86, 0.96, 0.58, 0.92, 0.92, 1.02, 1.34, 1.0, 0.9) > t.test(Logistic_Regression - RBF_Gaussian, conf.level = 0.99)$conf.int [1] 1.271136 1.923150 attr(,"conf.level") [1] 0.99

The range of confidence interval is (1.271136, 1.923150). I can say that with 99 percent confidence, (error computed from Logistic_Regression) minus (error computed from RBF_Gaussian) lies in the above range. Since 0 is not in the range, I can strongly say that mean error computed from Logistic_Regression is larger than the mean error computed from RBF_Gaussian.

**t-test (One-tailed t-test using r)**

> t.test(Logistic_Regression - RBF_Gaussian, conf.level = 0.99, alt = "greater") One Sample t-test data: Logistic_Regression - RBF_Gaussian t = 13.367, df = 34, p-value = 2.146e-15 alternative hypothesis: true mean is greater than 0 99 percent confidence interval: 1.305458 Inf sample estimates: mean of x 1.597143

P-value is 2.146e-15 which is almost 0. This means that I have a strong evidence to reject H0. Therefore I can conclude that the mean error computed from Logistic_Regression is larger than the mean error computed from RBF_Gaussian. In other words, RBF_Gaussian predicts better than Logistic Regression.

From above two t-tests, I can strongly argue that RBF_Sigmoid predicts better than RBF_Gaussian and also RBF_Gaussian predicts better than Logistic Regression. According to the transitive relation, I conclude that RBF_Sigmoid predicts the best followed by RBF_Gaussian and then Logistic Regression.

## 6 Conclusion (Performance of each method)

For classifying the handwritten digits of 0 and 1, Sigmoid Kernel Radial Basis Function Network predicted the best among other methods followed by Gaussian Kernel Radial Basis Function Network and then Logistic Regression.

- Sigmoid Kernel RBF Network: 0.193 (mean error)
- Gaussian Kernel RBF Network: 1.105
- Logistic Regression: 2.702