Technology Fundamentals for Business Analytics Lab 4

Overview

Subsetting data, merging data, and aggregating data in Python.

In [145]:
#This is going to create some sample data that we can work with for our analysis. 

import pandas as pd
import numpy as np
 
#Create a dataframe from a random numpy array 
#http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html
df = pd.DataFrame((20+np.random.randn(10, 4)*5),  columns=['a', 'b', 'c', 'd'] )
print df
df2 = pd.DataFrame(df) # This creates a new dataframe from df 
df3 = pd.DataFrame(df)
df4 = pd.DataFrame((20+np.random.randn(10, 4)*5),  columns=['e', 'f', 'g', 'h'] )
print df2
print df3
print df4

#Notice how it automatically generates an index 0-9. 
           a          b          c          d
0  20.966500  26.149632  18.416406   5.737357
1  18.108707  15.533413  31.860974  18.567691
2  26.906641  25.743868  22.561520  22.533756
3  25.783053  19.113882  17.197225  15.789697
4  24.028175   8.235366  21.432091  24.244823
5  20.022652  18.769690  13.452382  21.823068
6  20.806974  22.350539  20.195054  13.126667
7  20.459724  20.323686  22.118194  17.553785
8  13.578215  20.332386  26.693867  16.566077
9  24.931810  17.613739  21.605902  20.668964
           a          b          c          d
0  20.966500  26.149632  18.416406   5.737357
1  18.108707  15.533413  31.860974  18.567691
2  26.906641  25.743868  22.561520  22.533756
3  25.783053  19.113882  17.197225  15.789697
4  24.028175   8.235366  21.432091  24.244823
5  20.022652  18.769690  13.452382  21.823068
6  20.806974  22.350539  20.195054  13.126667
7  20.459724  20.323686  22.118194  17.553785
8  13.578215  20.332386  26.693867  16.566077
9  24.931810  17.613739  21.605902  20.668964
           a          b          c          d
0  20.966500  26.149632  18.416406   5.737357
1  18.108707  15.533413  31.860974  18.567691
2  26.906641  25.743868  22.561520  22.533756
3  25.783053  19.113882  17.197225  15.789697
4  24.028175   8.235366  21.432091  24.244823
5  20.022652  18.769690  13.452382  21.823068
6  20.806974  22.350539  20.195054  13.126667
7  20.459724  20.323686  22.118194  17.553785
8  13.578215  20.332386  26.693867  16.566077
9  24.931810  17.613739  21.605902  20.668964
           e          f          g          h
0  15.602570  13.959673  20.825387  22.639154
1  16.899620  27.812643  14.706201  19.191837
2  14.567181  21.955734  21.035244  14.765904
3  19.816062  18.350397  14.726465  18.150497
4  33.768035   9.751505  29.089492  25.721509
5  22.641559  26.843551  22.691911  19.679208
6  23.127548  17.296965  26.012906  25.210397
7  18.213551  13.677088  21.306722  21.756867
8  13.716335  18.647608  19.078827  29.794494
9  17.225318  16.816098  26.180554  29.732283

Selecting columns

The Panda's package includes with it functionality to do a wide variety of different methods. You can read more about the documentation for the ix method here.

In [45]:
#Here we can see that there is a similar structure to R, with selecting the desired columns by passing a list.

print df.ix[:,0:2]     #All rows, column 0-2
print df.ix[:,[0,2,3]]     #All rows, column 0-2
print df.ix[:,'a':'b'] #All rows, column a-b
print df.ix[:,['a','c','d']] #All rows, columns a, c, d
columns = ['a','c','d']
icolumns= [0,2,3]
print df.ix[:,columns ] #All rows, columns a, c, d
print df.ix[:,icolumns ] #All rows, columns a, c, d
           a          b
0   9.621240  18.132585
1  18.138869  20.814594
2  22.087314   5.119496
3  30.887575  19.966378
4  30.573405  16.890415
5  18.468208  21.384847
6  20.376342  17.546125
7  10.140928  25.543481
8  15.397550  17.977359
9  15.409230  16.876306
           a          c          d
0   9.621240  22.988847  18.548486
1  18.138869  17.016862  15.822263
2  22.087314  17.934707  20.088930
3  30.887575  17.157069  15.753578
4  30.573405  22.917900  25.146258
5  18.468208  23.592275  15.926295
6  20.376342  26.749858  20.687370
7  10.140928  20.881006  14.882369
8  15.397550  16.533816  25.428877
9  15.409230  12.364177  23.109493
           a          b
0   9.621240  18.132585
1  18.138869  20.814594
2  22.087314   5.119496
3  30.887575  19.966378
4  30.573405  16.890415
5  18.468208  21.384847
6  20.376342  17.546125
7  10.140928  25.543481
8  15.397550  17.977359
9  15.409230  16.876306
           a          c          d
0   9.621240  22.988847  18.548486
1  18.138869  17.016862  15.822263
2  22.087314  17.934707  20.088930
3  30.887575  17.157069  15.753578
4  30.573405  22.917900  25.146258
5  18.468208  23.592275  15.926295
6  20.376342  26.749858  20.687370
7  10.140928  20.881006  14.882369
8  15.397550  16.533816  25.428877
9  15.409230  12.364177  23.109493
           a          c          d
0   9.621240  22.988847  18.548486
1  18.138869  17.016862  15.822263
2  22.087314  17.934707  20.088930
3  30.887575  17.157069  15.753578
4  30.573405  22.917900  25.146258
5  18.468208  23.592275  15.926295
6  20.376342  26.749858  20.687370
7  10.140928  20.881006  14.882369
8  15.397550  16.533816  25.428877
9  15.409230  12.364177  23.109493
           a          c          d
0   9.621240  22.988847  18.548486
1  18.138869  17.016862  15.822263
2  22.087314  17.934707  20.088930
3  30.887575  17.157069  15.753578
4  30.573405  22.917900  25.146258
5  18.468208  23.592275  15.926295
6  20.376342  26.749858  20.687370
7  10.140928  20.881006  14.882369
8  15.397550  16.533816  25.428877
9  15.409230  12.364177  23.109493
In [117]:
#Here, we can remove columns specifically from a dataframe useing the drop method.
#df.drop([Column Name or list],inplace=True,axis=1)
df2.drop(['a','c'], inplace=True, axis=1)
print df2
           b          d
0  22.836678  20.393262
1   9.621118  21.057799
2  24.759717  15.643264
3  24.508768  18.295099
4  18.945111  22.177052
5  15.870739  17.891313
6  14.351874  24.533532
7  27.078430  16.055526
8  10.151625  26.425613
9  13.873434  13.262932

Selecting Rows

Similarly, we also might want to select out rows, and we can utilize the same syntax.

In [105]:
print df2
           a          b          c          d
0  25.527686  22.407352  22.317725  20.534511
1  16.093873  24.891329  19.787384  18.769327
2  19.956117  17.632472  16.891735  26.756932
3  26.076711  19.088475  28.210971  19.038794
4  14.086152  20.004384  21.171145  18.870483
5  28.200848  18.926232  13.378511  11.371261
6  26.140128  18.097206  25.776707  20.295516
7  20.780529  19.037016  16.275165  15.460639
8  23.273232  25.909209  15.742647  32.962132
9  21.805960  20.098517  16.789451  16.357269
In [46]:
## Selecting rows
print df.ix[0:5,:]     #Select rows 1-5
print df.ix[0:5,]      #Select rows 1-5
print df.ix[0:5]       #Select rows 1-5
print df.ix[[1,2,4]]   #Select rows 1, 2, and 4
           a          b          c          d
0   9.621240  18.132585  22.988847  18.548486
1  18.138869  20.814594  17.016862  15.822263
2  22.087314   5.119496  17.934707  20.088930
3  30.887575  19.966378  17.157069  15.753578
4  30.573405  16.890415  22.917900  25.146258
5  18.468208  21.384847  23.592275  15.926295
           a          b          c          d
0   9.621240  18.132585  22.988847  18.548486
1  18.138869  20.814594  17.016862  15.822263
2  22.087314   5.119496  17.934707  20.088930
3  30.887575  19.966378  17.157069  15.753578
4  30.573405  16.890415  22.917900  25.146258
5  18.468208  21.384847  23.592275  15.926295
           a          b          c          d
0   9.621240  18.132585  22.988847  18.548486
1  18.138869  20.814594  17.016862  15.822263
2  22.087314   5.119496  17.934707  20.088930
3  30.887575  19.966378  17.157069  15.753578
4  30.573405  16.890415  22.917900  25.146258
5  18.468208  21.384847  23.592275  15.926295
           a          b          c          d
1  18.138869  20.814594  17.016862  15.822263
2  22.087314   5.119496  17.934707  20.088930
4  30.573405  16.890415  22.917900  25.146258
In [55]:
#We can now generate a vector based on a critera and then use this for selection
select = df['a']>=20
print select
print df.ix[select] #Notice by including only one variable we are selecting rows and all columns.

select2 = (df['a']>20) & (df['c'] < 30)  #More complex criteria
print select2
print df.ix[select2]
0    False
1    False
2     True
3     True
4     True
5    False
6     True
7    False
8    False
9    False
Name: a, dtype: bool
           a          b          c          d
2  22.087314   5.119496  17.934707  20.088930
3  30.887575  19.966378  17.157069  15.753578
4  30.573405  16.890415  22.917900  25.146258
6  20.376342  17.546125  26.749858  20.687370
0    False
1    False
2     True
3     True
4     True
5    False
6     True
7    False
8    False
9    False
dtype: bool

Exercises

These toy examples are fine, but for this exercise we will use the Titaantic dataset. We want to split this out in a few different ways.

1P. Subset only the survived, sex, and age columns and create a new dataframe with the results.

2P. Drop the ticket and the cabin and create a new dataframe with the results.

3P. Split the dataset into a train dataframe having the first 80% of the records and a test dataframe having the last 20%.

4P. Randomely sample the dataset so that there are 2 samples.

5P. Create one dataframe with all of the males and a second with all of the females.

In [144]:
import csv
import urllib2

url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv'
response = urllib2.urlopen(url)
cr = pd.read_csv(response)
cr
Out[144]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
10 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S
13 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
14 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S
15 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S
16 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q
17 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S
19 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S
21 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
22 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q
23 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
24 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S
25 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S
26 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
28 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
861 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 NaN S
862 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 0 17466 25.9292 D17 S
863 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 NaN S
865 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 NaN S
866 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 NaN C
867 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
868 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 NaN S
870 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 NaN S
871 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
872 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
873 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 NaN S
874 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 NaN C
875 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 NaN C
876 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 NaN S
877 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 NaN S
878 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
880 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 NaN S
881 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 NaN S
882 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 NaN S
883 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NaN S
884 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 NaN S
885 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 NaN Q
886 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 NaN S
887 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
890 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

Merging Data

In R we used rbind for data that had the same structure but with new rows. Here we can do the same with the concat method. You can read more about concat here. If a key is not specified, it will use the default index of the dataframe. As you can see in the documentation though, it is also possible to specify a key.

In [133]:
 

dfbyrow=pd.concat([df, df3])  #This is equivalent to a rowbind in R. 
print dfbyrow

dfbyrow.append(df3)
print dfbyrow
           a          b          c          d
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
           a          b          c          d
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
           a          b          c          d
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
           a          b          c          d
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
0  25.289622  18.179422  14.356260  19.259194
1  18.116701  16.448435  21.732393  12.654283
2  12.682688  17.281966  25.048048  17.929218
3  21.582657  16.690173  27.146375  26.151657
4  12.767195  22.253991  24.193309  18.315352
5  27.359973  20.633188  16.102421  19.323444
6  18.104457  10.378091  26.475561  25.473413
7  22.144911  23.449593  13.033869  10.413575
8  13.486282  16.818658  25.171337  13.634255
9  24.588644  20.248421  11.477827  14.714172
In [139]:
# View how the index here from df has been reset and incremented while in the earlier example the index was kept. 
addition = df.append(df, ignore_index=True)
print result
            a          b          c          d
0   25.289622  18.179422  14.356260  19.259194
1   18.116701  16.448435  21.732393  12.654283
2   12.682688  17.281966  25.048048  17.929218
3   21.582657  16.690173  27.146375  26.151657
4   12.767195  22.253991  24.193309  18.315352
5   27.359973  20.633188  16.102421  19.323444
6   18.104457  10.378091  26.475561  25.473413
7   22.144911  23.449593  13.033869  10.413575
8   13.486282  16.818658  25.171337  13.634255
9   24.588644  20.248421  11.477827  14.714172
10  25.289622  18.179422  14.356260  19.259194
11  18.116701  16.448435  21.732393  12.654283
12  12.682688  17.281966  25.048048  17.929218
13  21.582657  16.690173  27.146375  26.151657
14  12.767195  22.253991  24.193309  18.315352
15  27.359973  20.633188  16.102421  19.323444
16  18.104457  10.378091  26.475561  25.473413
17  22.144911  23.449593  13.033869  10.413575
18  13.486282  16.818658  25.171337  13.634255
19  24.588644  20.248421  11.477827  14.714172
In [127]:
#Merging additional columns also uses the concat function

dfbycolumns = pd.concat([df, df4], axis=1, join='inner')
print dfbycolumns
           a          b          c          d          e          f  \
0  25.289622  18.179422  14.356260  19.259194  14.261130  13.030898   
1  18.116701  16.448435  21.732393  12.654283  19.314614  14.275413   
2  12.682688  17.281966  25.048048  17.929218  27.116462  24.018898   
3  21.582657  16.690173  27.146375  26.151657  20.057643  21.835530   
4  12.767195  22.253991  24.193309  18.315352  27.831429  20.197306   
5  27.359973  20.633188  16.102421  19.323444  16.312991  17.699751   
6  18.104457  10.378091  26.475561  25.473413   8.668287  19.783580   
7  22.144911  23.449593  13.033869  10.413575  26.230189  21.859006   
8  13.486282  16.818658  25.171337  13.634255  23.176259  15.032589   
9  24.588644  20.248421  11.477827  14.714172  18.104855  17.333459   

           g          h  
0  12.785679  11.116347  
1   9.961737  28.673931  
2  15.457469  31.437990  
3  13.647138  25.170408  
4  21.066324  18.478853  
5   8.974080  26.271553  
6  14.592943  22.046170  
7  24.837221  26.182974  
8  17.706703  15.145252  
9  22.473041  21.062288  
In [131]:
#Here we are generating a small dataframe to be used in merging so you can see the differences in specifying inner & outer, 
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='inner')
print dfbycolumns
           a          b          c          d          a          b  \
0  25.289622  18.179422  14.356260  19.259194  25.289622  18.179422   
1  18.116701  16.448435  21.732393  12.654283  18.116701  16.448435   
2  12.682688  17.281966  25.048048  17.929218  12.682688  17.281966   
3  21.582657  16.690173  27.146375  26.151657  21.582657  16.690173   
4  12.767195  22.253991  24.193309  18.315352  12.767195  22.253991   

           c          d  
0  14.356260  19.259194  
1  21.732393  12.654283  
2  25.048048  17.929218  
3  27.146375  26.151657  
4  24.193309  18.315352  
In [132]:
#Here, the outer does the equivalent of a left outer join for this dataset. 
shortdf=df[0:5]
dfbycolumns = pd.concat([df, shortdf], axis=1, join='outer')
print dfbycolumns
           a          b          c          d          a          b  \
0  25.289622  18.179422  14.356260  19.259194  25.289622  18.179422   
1  18.116701  16.448435  21.732393  12.654283  18.116701  16.448435   
2  12.682688  17.281966  25.048048  17.929218  12.682688  17.281966   
3  21.582657  16.690173  27.146375  26.151657  21.582657  16.690173   
4  12.767195  22.253991  24.193309  18.315352  12.767195  22.253991   
5  27.359973  20.633188  16.102421  19.323444        NaN        NaN   
6  18.104457  10.378091  26.475561  25.473413        NaN        NaN   
7  22.144911  23.449593  13.033869  10.413575        NaN        NaN   
8  13.486282  16.818658  25.171337  13.634255        NaN        NaN   
9  24.588644  20.248421  11.477827  14.714172        NaN        NaN   

           c          d  
0  14.356260  19.259194  
1  21.732393  12.654283  
2  25.048048  17.929218  
3  27.146375  26.151657  
4  24.193309  18.315352  
5        NaN        NaN  
6        NaN        NaN  
7        NaN        NaN  
8        NaN        NaN  
9        NaN        NaN  
In [141]:
import csv
import urllib2

url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv'
response = urllib2.urlopen(url)
titantic = pd.read_csv(response)
In [142]:
titantic
Out[142]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
10 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S
13 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
14 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S
15 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S
16 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q
17 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S
19 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S
21 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
22 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q
23 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
24 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S
25 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S
26 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
28 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
861 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 NaN S
862 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 0 17466 25.9292 D17 S
863 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 NaN S
865 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 NaN S
866 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 NaN C
867 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
868 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 NaN S
870 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 NaN S
871 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
872 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
873 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 NaN S
874 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 NaN C
875 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 NaN C
876 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 NaN S
877 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 NaN S
878 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
880 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 NaN S
881 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 NaN S
882 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 NaN S
883 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NaN S
884 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 NaN S
885 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 NaN Q
886 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 NaN S
887 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
890 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

6P. You will find another dataset on github that contains additional rows of individuals here. (In reality they are duplicates but let's say we didn't know that). Provide the code to first import thist dataset and then combine it with the origional titantic dataframe into a new one called titantic_.

7P. You will find another dataset with additional variables that may help in the analysis here. Note. This is a little bit more tricky, as there currently isn't a key in the origional dataset. As a result, do the following substeps: (a). Provide code to verify the number of records in the origional and new dataset. (b). Generate a key for the origional dataset in the pattern of the new dataset. (c). Provide code to merge the two datasets titanticcats.