Statistics/Data Mining Applications: September 2012

Wednesday, September 26, 2012

［Python］Elastic Net Classifier

import numpy as np
import matplotlib.pyplot as plt
import mlpy
np.random.seed(0)
mean1, cov1, n1 = [1, 5], [[1,1],[1,2]], 200 # 200 samples of class 1
x1 = np.random.multivariate_normal(mean1, cov1, n1)
y1 = np.ones(n1, dtype=np.int)
mean2, cov2, n2 = [2.5, 2.5], [[1,0],[0,1]], 300 # 300 samples of class -1
x2 = np.random.multivariate_normal(mean2, cov2, n2)
y2 = -np.ones(n2, dtype=np.int)
x = np.concatenate((x1, x2), axis=0) # concatenate the samples
y = np.concatenate((y1, y2))
en = mlpy.ElasticNetC(lmb=0.01, eps=0.001)
en.learn(x, y)
w = en.w()
b = en.bias()
en.iters()
xx = np.arange(np.min(x[:,0]), np.max(x[:,0]), 0.01)
yy = - (w[0] * xx + b) / w[1] # separator line
fig = plt.figure(1) # plot
plot1 = plt.plot(x1[:, 0], x1[:, 1], 'ob', x2[:, 0], x2[:, 1], 'or')
plot2 = plt.plot(xx, yy, '--k')
plt.show()

[Python] Linear Discriminant Analysis Classifier (LDAC)

Discriminant analysis has continuous independent variables and a categorical dependent variable。 LDA is also closely related to PCA and factor analysis in that they both look for linear combinations of variables which best explain the data (LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.)

1. Binary classification
import numpy as np
import matplotlib.pyplot as plt
import mlpy
np.random.seed(0)
mean1, cov1, n1 = [1, 5], [[1,1],[1,2]], 200 # 200 samples of class 1
x1 = np.random.multivariate_normal(mean1, cov1, n1)
y1 = np.ones(n1, dtype=np.int)
mean2, cov2, n2 = [2.5, 2.5], [[1,0],[0,1]], 300 # 300 samples of class -1
x2 = np.random.multivariate_normal(mean2, cov2, n2)
y2 = -np.ones(n2, dtype=np.int)
x = np.concatenate((x1, x2), axis=0) # concatenate the samples
y = np.concatenate((y1, y2))
ldac = mlpy.LDAC()
ldac.learn(x, y)
w = ldac.w() # return the coefficient
b = ldac.bias() # return the bias
xx = np.arange(np.min(x[:,0]), np.max(x[:,0]), 0.01)
yy = - (w[0] * xx + b) / w[1] # separator line
fig = plt.figure(1) # plot
plot1 = plt.plot(x1[:, 0], x1[:, 1], 'ob', x2[:, 0], x2[:, 1], 'or') # 'o' means circle marker, 'b' means blue, and 'r' means red
plot2 = plt.plot(xx, yy, '--k')
plt.show()

2. Multi-class classification
mean1, cov1, n1 = [1, 25], [[1,1],[1,2]], 200 # 200 samples of class 0
x1 = np.random.multivariate_normal(mean1, cov1, n1)
y1 = np.zeros(n1, dtype=np.int)
mean2, cov2, n2 = [2.5, 22.5], [[1,0],[0,1]], 300 # 300 samples of class 1
x2 = np.random.multivariate_normal(mean2, cov2, n2)
y2 = np.ones(n2, dtype=np.int)
mean3, cov3, n3 = [5, 28], [[0.5,0],[0,0.5]], 200 # 200 samples of class 2
x3 = np.random.multivariate_normal(mean3, cov3, n3)
y3 = 2 * np.ones(n3, dtype=np.int)
x = np.concatenate((x1, x2, x3), axis=0) # concatenate the samples
y = np.concatenate((y1, y2, y3))
ldac = mlpy.LDAC()
ldac.learn(x, y)
w = ldac.w()
b = ldac.bias()
xx = np.arange(np.min(x[:,0]), np.max(x[:,0]), 0.01)
yy1 = (xx* (w[1][0]-w[0][0]) + b[1] - b[0]) / (w[0][1]-w[1][1])
yy2 = (xx* (w[2][0]-w[0][0]) + b[2] - b[0]) / (w[0][1]-w[2][1])
yy3 = (xx* (w[2][0]-w[1][0]) + b[2] - b[1]) / (w[1][1]-w[2][1])
fig = plt.figure(1) # plot
plot1 = plt.plot(x1[:, 0], x1[:, 1], 'ob', x2[:, 0], x2[:, 1], 'or', x3[:, 0], x3[:, 1], 'og')
plot2 = plt.plot(xx, yy1, '--k')
plot3 = plt.plot(xx, yy2, '--k')
plot4 = plt.plot(xx, yy3, '--k')
plt.show()

Tuesday, September 25, 2012

[Python] Least Angle Regression （LARS）

   Suppose we expect a response variable to be determined by a linear combination of a subset of potential covariates. Then the LARS algorithm provides a means of producing an estimate of which variables to include, as well as their coefficients.
    Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one's correlations with the residual.

Advantages:
1. It is computationally just as fast as forward selection.
2. It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model.
3. If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
4. It is easily modified to produce solutions for other estimators, like the LASSO.
5. It is effective in contexts where p >> n (IE, when the number of dimensions is significantly greater than the number of points).

Disadvantages:
1. With any amount of noise in the dependent variable and with high dimensional multicollinear independent variables, there is no reason to believe that the selected variables will have a high probability of being the actual underlying causal variables. This problem is not unique to LARS, as it is a general problem with variable selection approaches that seek to find underlying deterministic components. Yet, because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise.
2. Since almost all high dimensional data in the real world will just by chance exhibit some fair degree of collinearity across at least some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data.

Python code:
import numpy as np
import mlpy
import matplotlib.pyplot as plt # required for plotting'
diabetes = np.loadtxt("diabetes.data", skiprows=1)   #http://www.stanford.edu/~hastie/Papers/LARS/diabetes.data
x = diabetes[:, :-1]
y = diabetes[:, -1]
x -= np.mean(x, axis=0) # center x
x /= np.sqrt(np.sum((x)**2, axis=0)) # normalize x
y -= np.mean(y) # center y
lars = mlpy.LARS()
lars.learn(x, y)
lars.steps() # number of steps performed
lars.beta()
lars.beta0()
est = lars.est() # returns all LARS estimates
beta_sum = np.sum(np.abs(est), axis=1)
fig = plt.figure(1)
plot1 = plt.plot(beta_sum, est)
xl = plt.xlabel(r'$\sum{|\beta_j|}$')
yl = plt.ylabel(r'$\beta_j$')
plt.show()

[Python] Ordinary Linear Regression

import numpy as np
import mlpy
import matplotlib.pyplot as plt # required for plotting'
np.random.seed(0)
mean, cov, n = [1, 5], [[1,1],[1,2]], 200 # specify the distribution parameters
d = np.random.multivariate_normal(mean, cov, n) # generate a sequence of vectors
x, y = d[:, 0].reshape(-1, 1), d[:, 1] # reshape() Gives a new shape to an array without changing its data
ols=mlpy.OLS()
ols.learn(x,y)
xx = np.arange(np.min(x), np.max(x), 0.01).reshape(-1, 1)
yy=ols.pred(xx)
fig=plt.figure(1)
plot=plt.plot(x,y,'o',xx,yy,'--k') #plot with fitted line added, '--' means dashed line and 'k' means color to be black
plt.show()

[Python] package mlpy and matplotlib

import numpy as np # scientific computing package
import mlpy
import matplotlib.pyplot as plt # required for plotting'
iris = np.loadtxt('iris.csv', delimiter=',') # http://mlpy.sourceforge.net/docs/3.5/_downloads/iris.csv
x, y = iris[:, :4], iris[:, 4].astype(np.int)
x.shape, y.shape

# Principal Component Analysis
pca = mlpy.PCA() # new PCA instance
pca.learn(x) # learn from data
z = pca.transform(x, k=2) # choose 2 principal components

#plot the principal components
plt.set_cmap(plt.cm.Paired)
fig1 = plt.figure(1)
title = plt.title("PCA on iris dataset")
plot = plt.scatter(z[:, 0], z[:, 1], c=y)
labx = plt.xlabel("First component")
laby = plt.ylabel("Second component")
plt.show()

Thursday, September 13, 2012

[Python] Topic Model: Latent Semantic Indexing

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts=[[word for word in text if word not in tokens_once] for text in texts]

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')
print dictionary.token2id

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split()) #The function doc2bow() simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.
print new_vec

corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use
print corpus

#Corpus Streaming – One Document at a Time
class MyCorpus(object):
def __iter__(self):
for line in open('mycorpus.txt'):
   # assume there's one document per line, tokens separated by whitespace
   yield dictionary.doc2bow(line.lower().split())

#the corpus is now much more memory friendly, because at most one vector resides in RAM at a time
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
for vector in corpus_memory_friendly: # load one vector into memory at a time
print vector

# topics and transformations
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

tfidf = models.TfidfModel(corpus) # step 1: initialize a model
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print doc

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(2)   # the number of topics are self-chosen
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print doc

lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')

#available transformations
model = tfidfmodel.TfidfModel(bow_corpus, normalize=True)   #tf-idf
model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300) #LSI
model = rpmodel.RpModel(tfidf_corpus, num_topics=500) #Random Projection
model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, num_topics=100)    #LDA
model = hdpmodel.HdpModel(bow_corpus, id2word=dictionary)   #Hierarchical Dirichlet Process, HDP

# similarity queries
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')

vec_lsi = lsi[new_vec] # convert the query to LSI space
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
sims = index[vec_lsi] # perform a similarity query against the corpus
print list(enumerate(sims)) # print (document_number, document_similarity) 2-tuples
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims # print sorted (document number, similarity score) 2-tuples