#This is going to create some sample data that we can work with for our analysis.
import pandas as pd
import numpy as np
#Create a dataframe from a random numpy array
#http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html
df = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['a', 'b', 'c', 'd'] )
print df
df2 = pd.DataFrame(df) # This creates a new dataframe from df
df3 = pd.DataFrame(df)
df4 = pd.DataFrame((20+np.random.randn(10, 4)*5), columns=['e', 'f', 'g', 'h'] )
print df2
print df3
print df4
#Notice how it automatically generates an index 0-9.
#Here we can see that there is a similar structure to R, with selecting the desired columns by passing a list.
print df.ix[:,0:2] #All rows, column 0-2
print df.ix[:,[0,2,3]] #All rows, column 0-2
print df.ix[:,'a':'b'] #All rows, column a-b
print df.ix[:,['a','c','d']] #All rows, columns a, c, d
columns = ['a','c','d']
icolumns= [0,2,3]
print df.ix[:,columns ] #All rows, columns a, c, d
print df.ix[:,icolumns ] #All rows, columns a, c, d
#Here, we can remove columns specifically from a dataframe useing the drop method.
#df.drop([Column Name or list],inplace=True,axis=1)
df2.drop(['a','c'], inplace=True, axis=1)
print df2
Similarly, we also might want to select out rows, and we can utilize the same syntax.
print df2
## Selecting rows
print df.ix[0:5,:] #Select rows 1-5
print df.ix[0:5,] #Select rows 1-5
print df.ix[0:5] #Select rows 1-5
print df.ix[[1,2,4]] #Select rows 1, 2, and 4
#We can now generate a vector based on a critera and then use this for selection
select = df['a']>=20
print select
print df.ix[select] #Notice by including only one variable we are selecting rows and all columns.
select2 = (df['a']>20) & (df['c'] < 30) #More complex criteria
print select2
print df.ix[select2]
These toy examples are fine, but for this exercise we will use the Titaantic dataset. We want to split this out in a few different ways.
1P. Subset only the survived, sex, and age columns and create a new dataframe with the results.
2P. Drop the ticket and the cabin and create a new dataframe with the results.
3P. Split the dataset into a train dataframe having the first 80% of the records and a test dataframe having the last 20%.
4P. Randomely sample the dataset so that there are 2 samples.
5P. Create one dataframe with all of the males and a second with all of the females.
import csv
import urllib2
url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv'
response = urllib2.urlopen(url)
cr = pd.read_csv(response)
cr
In R we used rbind for data that had the same structure but with new rows. Here we can do the same with the concat
method. You can read more about concat here. If a key is not specified, it will use the default index of the dataframe. As you can see in the documentation though, it is also possible to specify a key.
dfbyrow=pd.concat([df, df3]) #This is equivalent to a rowbind in R.
print dfbyrow
dfbyrow.append(df3)
print dfbyrow
# View how the index here from df has been reset and incremented while in the earlier example the index was kept.
addition = df.append(df, ignore_index=True)
print result
#Merging additional columns also uses the concat function
dfbycolumns = pd.concat([df, df4], axis=1, join='inner')
print dfbycolumns
#Here we are generating a small dataframe to be used in merging so you can see the differences in specifying inner & outer,
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='inner')
print dfbycolumns
#Here, the outer does the equivalent of a left outer join for this dataset.
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='outer')
print dfbycolumns
import csv
import urllib2
url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv'
response = urllib2.urlopen(url)
titantic = pd.read_csv(response)
titantic
6P. You will find another dataset on github that contains additional rows of individuals here. (In reality they are duplicates but let's say we didn't know that). Provide the code to first import thist dataset and then combine it with the origional titantic dataframe into a new one called titantic_
7P. You will find another dataset with additional variables that may help in the analysis here. Note. This is a little bit more tricky, as there currently isn't a key in the origional dataset. As a result, do the following substeps: (a). Provide code to verify the number of records in the origional and new dataset. (b). Generate a key for the origional dataset in the pattern of the new dataset. (c). Provide code to merge the two datasets titanticcats