Wednesday, October 24, 2012

[Pyhon] Data Analysis toolkit 'pandas'

pandas is well suited for many different kinds of data:
  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

One example:
suppose there are two data sets and I’d like to do an inner join between these two tables, the generic coding logic might look like below:

data1= [['id1','Kevin'],['id2','John'],['id3','Mike']]
data2 = [['id1',31],['id2',28],['id3',34],]
join = []
for b in data2:
    for a in data1:
        if b[0]==a[0]:
            list = [x for x in b]
            list.append(a[1])
            join.append(list)
for j in join:
    print j



The output is as follows:
['id1', 31, 'Kevin']
['id2', 28, 'John']
['id3', 34, 'Mike']


With pandas, the code will be much shorter:
from pandas import *
new_data1=DataFrame(data1, columns = ['id','name'])
new_data2=DataFrame(data2, columns = ['id','age'])
join2 = merge(new_data1,new_data2, on = 'id', how='inner')
print join2


The output is as follows:
    id   name  age
0  id1  Kevin   31
1  id2   John   28
2  id3   Mike   34


















No comments:

Post a Comment