Statistics/Data Mining Applications: October 2012

Wednesday, October 24, 2012

[Python] Generating Histograms

import matplotlib.pyplot as plt

from numpy.random import normal
a=normal(10,2,size=1000)
plt.hist(a,bins=10)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

With the option 'normed=True', it will return the normalized histogram, i.e, probability distribution plot.
plt.hist(a,bins=10,normed=True)
plt.show()

With the option 'cumulative=True', it will return the cumulative distribution plot.
plt.hist(a,bins=10,normed=True,cumulative=True)
plt.show()

[Pyhon] Data Analysis toolkit 'pandas'

pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

One example:
suppose there are two data sets and I’d like to do an inner join between these two tables, the generic coding logic might look like below:

data1= [['id1','Kevin'],['id2','John'],['id3','Mike']]
data2 = [['id1',31],['id2',28],['id3',34],]
join = []
for b in data2:
    for a in data1:
        if b[0]==a[0]:
            list = [x for x in b]
            list.append(a[1])
            join.append(list)
for j in join:
    print j

The output is as follows:
['id1', 31, 'Kevin']
['id2', 28, 'John']
['id3', 34, 'Mike']

With pandas, the code will be much shorter:
from pandas import *
new_data1=DataFrame(data1, columns = ['id','name'])
new_data2=DataFrame(data2, columns = ['id','age'])
join2 = merge(new_data1,new_data2, on = 'id', how='inner')
print join2

The output is as follows:
    id   name age
0 id1 Kevin   31
1 id2   John   28
2 id3   Mike   34

Tuesday, October 2, 2012

[Python] K-nearest Neighbor classification

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space.

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of "k"-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, for example, cross-validation (for example, choose the value of k by minimizing mis-classification rate). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.

A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weigh the classification taking into account the distance from the test point to each of its k nearest neighbors.

Python code:
import numpy as np
import matplotlib.pyplot as plt
import mlpy
np.random.seed(0)
mean1, cov1, n1 = [1, 5], [[1,1],[1,2]], 200 # 200 samples of class 1
x1 = np.random.multivariate_normal(mean1, cov1, n1)
y1 = np.ones(n1, dtype=np.int)
mean2, cov2, n2 = [2.5, 2.5], [[1,0],[0,1]], 300 # 300 samples of class 2
x2 = np.random.multivariate_normal(mean2, cov2, n2)
y2 = 2 * np.ones(n2, dtype=np.int)
mean3, cov3, n3 = [5, 8], [[0.5,0],[0,0.5]], 200 # 200 samples of class 3
x3 = np.random.multivariate_normal(mean3, cov3, n3)
y3 = 3 * np.ones(n3, dtype=np.int)
x = np.concatenate((x1, x2, x3), axis=0) # concatenate the samples
y = np.concatenate((y1, y2, y3))
knn = mlpy.KNN(k=3)
knn.learn(x, y)
knn.nclasses() # return the number of classes
xmin, xmax = x[:,0].min()-1, x[:,0].max()+1
ymin, ymax = x[:,1].min()-1, x[:,1].max()+1
xx, yy = np.meshgrid(np.arange(xmin, xmax, 0.1), np.arange(ymin, ymax, 0.1))
xnew = np.c_[xx.ravel(), yy.ravel()]    # testing data
ynew = knn.pred(xnew).reshape(xx.shape)   # Predict KNN model on a test point
ynew[ynew == 0] = 1 # set the samples with no unique classification to 1
fig = plt.figure(1)
cmap = plt.set_cmap(plt.cm.Paired)
plot1 = plt.pcolormesh(xx, yy, ynew)   # plot the separated regions with different color
plot2 = plt.scatter(x[:,0], x[:,1], c=y) # plot the scatter plot of the data
plt.show()