We will provide here a number of different processes for subsetting, linking, and aggregation of data. There are a number of different reasons why you might want to do each of these processes.
We have already seen some level of selection in the “slicing” of vectors and matrix by location. This is an extremely important way
#Let's start by creating some data.
#This creates a vector with 10 numbers
v=(1:10)
v
## [1] 1 2 3 4 5 6 7 8 9 10
#This creates a matrix 4,
m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4)
colnames(m)<-(c("a","b","c","d"))
#This creates a dataframe
df<- as.data.frame(m)
#We could also put together
df2<- data.frame(v, m)
#Select Specific Columns
df[,1:2] #All rows, column 1-2
## a b
## 1 1.093735 24.11081
## 2 12.326459 19.82742
## 3 14.846989 15.85290
## 4 18.654462 26.46828
## 5 10.511888 15.82481
## 6 19.104696 13.31706
## 7 16.067400 23.31764
## 8 17.601777 24.33212
## 9 28.328676 25.23951
## 10 15.135773 20.77190
df[1:2] #All rows, column 1-2
## a b
## 1 1.093735 24.11081
## 2 12.326459 19.82742
## 3 14.846989 15.85290
## 4 18.654462 26.46828
## 5 10.511888 15.82481
## 6 19.104696 13.31706
## 7 16.067400 23.31764
## 8 17.601777 24.33212
## 9 28.328676 25.23951
## 10 15.135773 20.77190
df[,c(1,3)] #We can also make the selection a vector. This dynamically creates a vector.
## a c
## 1 1.093735 30.89777
## 2 12.326459 25.32011
## 3 14.846989 29.24470
## 4 18.654462 16.71059
## 5 10.511888 21.67423
## 6 19.104696 18.76297
## 7 16.067400 26.65222
## 8 17.601777 19.43131
## 9 28.328676 26.76862
## 10 15.135773 25.46423
columns.v<-c(1,3)
df[,columns.v] #We can also make the selection a vector. This dynamically creates a vector.
## a c
## 1 1.093735 30.89777
## 2 12.326459 25.32011
## 3 14.846989 29.24470
## 4 18.654462 16.71059
## 5 10.511888 21.67423
## 6 19.104696 18.76297
## 7 16.067400 26.65222
## 8 17.601777 19.43131
## 9 28.328676 26.76862
## 10 15.135773 25.46423
df[,c("a","c")] #We can also do it by column name.
## a c
## 1 1.093735 30.89777
## 2 12.326459 25.32011
## 3 14.846989 29.24470
## 4 18.654462 16.71059
## 5 10.511888 21.67423
## 6 19.104696 18.76297
## 7 16.067400 26.65222
## 8 17.601777 19.43131
## 9 28.328676 26.76862
## 10 15.135773 25.46423
columns.b <- names(df) %in% c("a", "b", "c")
df[,columns.b]
## a b c
## 1 1.093735 24.11081 30.89777
## 2 12.326459 19.82742 25.32011
## 3 14.846989 15.85290 29.24470
## 4 18.654462 26.46828 16.71059
## 5 10.511888 15.82481 21.67423
## 6 19.104696 13.31706 18.76297
## 7 16.067400 23.31764 26.65222
## 8 17.601777 24.33212 19.43131
## 9 28.328676 25.23951 26.76862
## 10 15.135773 20.77190 25.46423
#We can also drop specific columns
df[,c(-1,-4)] #drop column a and d
## b c
## 1 24.11081 30.89777
## 2 19.82742 25.32011
## 3 15.85290 29.24470
## 4 26.46828 16.71059
## 5 15.82481 21.67423
## 6 13.31706 18.76297
## 7 23.31764 26.65222
## 8 24.33212 19.43131
## 9 25.23951 26.76862
## 10 20.77190 25.46423
df[!columns.b] #drop all but d
## d
## 1 20.29334
## 2 27.35121
## 3 14.82092
## 4 12.65235
## 5 22.05243
## 6 21.25187
## 7 13.64450
## 8 19.69229
## 9 17.36006
## 10 21.20786
#There is also a specific subset function
subset(df, select=c(a, b) )
## a b
## 1 1.093735 24.11081
## 2 12.326459 19.82742
## 3 14.846989 15.85290
## 4 18.654462 26.46828
## 5 10.511888 15.82481
## 6 19.104696 13.31706
## 7 16.067400 23.31764
## 8 17.601777 24.33212
## 9 28.328676 25.23951
## 10 15.135773 20.77190
At times we may want to split our sample for analysis. We may just want to do it in order or we may want the process to be more random.
df[1:5,] #The first 5 rows and all columns.
## a b c d
## 1 1.093735 24.11081 30.89777 20.29334
## 2 12.326459 19.82742 25.32011 27.35121
## 3 14.846989 15.85290 29.24470 14.82092
## 4 18.654462 26.46828 16.71059 12.65235
## 5 10.511888 15.82481 21.67423 22.05243
df[c(1,3,8),] #Just a selection of a few rows
## a b c d
## 1 1.093735 24.11081 30.89777 20.29334
## 3 14.846989 15.85290 29.24470 14.82092
## 8 17.601777 24.33212 19.43131 19.69229
rows<-c(1,3,8)
df[rows,]
## a b c d
## 1 1.093735 24.11081 30.89777 20.29334
## 3 14.846989 15.85290 29.24470 14.82092
## 8 17.601777 24.33212 19.43131 19.69229
#For a random sample, let's first create a random vector of the row numbers.
n<-nrow(df)
sample <- sample(10, 10, replace=F)
sample
## [1] 6 8 1 7 9 5 10 2 3 4
#notice how the array is being used to sample out the rows.
df[sample[1:5],]
## a b c d
## 6 19.104696 13.31706 18.76297 21.25187
## 8 17.601777 24.33212 19.43131 19.69229
## 1 1.093735 24.11081 30.89777 20.29334
## 7 16.067400 23.31764 26.65222 13.64450
## 9 28.328676 25.23951 26.76862 17.36006
df[sample[6:10],]
## a b c d
## 5 10.51189 15.82481 21.67423 22.05243
## 10 15.13577 20.77190 25.46423 21.20786
## 2 12.32646 19.82742 25.32011 27.35121
## 3 14.84699 15.85290 29.24470 14.82092
## 4 18.65446 26.46828 16.71059 12.65235
#Remember we can select out the a variable of a dataframe using df$a
df[ which(df$a>=20.0), ]
## a b c d
## 9 28.32868 25.23951 26.76862 17.36006
df[ which(df$a>=20 & df$b>=18.0), ] # & is the and sign
## a b c d
## 9 28.32868 25.23951 26.76862 17.36006
df[ which(df$a>=20 | df$b>=18.0), ] # | is the or sign
## a b c d
## 1 1.093735 24.11081 30.89777 20.29334
## 2 12.326459 19.82742 25.32011 27.35121
## 4 18.654462 26.46828 16.71059 12.65235
## 7 16.067400 23.31764 26.65222 13.64450
## 8 17.601777 24.33212 19.43131 19.69229
## 9 28.328676 25.23951 26.76862 17.36006
## 10 15.135773 20.77190 25.46423 21.20786
#We can also use the subset function to select out columes
subset(df, a >= 20 )
## a b c d
## 9 28.32868 25.23951 26.76862 17.36006
subset(df, a >= 20 & df$b>=18.0 )
## a b c d
## 9 28.32868 25.23951 26.76862 17.36006
subset(df, a >= 20 & df$b>=18.0 )
## a b c d
## 9 28.32868 25.23951 26.76862 17.36006
#We can also combine column and row selection as follows
subset(df, a >= 20 & df$b>=18.0, select=c(a, b) )
## a b
## 9 28.32868 25.23951
These toy examples are fine, but for this exercise we will use the Titaantic dataset. We want to split this out in a few different ways.
1R. Subset only the survived, sex, and age columns and create a new dataframe with the results.
2R. Drop the ticket and the cabin and create a new dataframe with the results.
3R. Split the dataset into a train dataframe having the first 80% of the records and a test dataframe having the last 20%.
4R. Randomely sample the dataset so that there are 2 samples.
5R. Create one dataframe with all of the males and a second with all of the females.
If your are having trouble getting file into RStudio, here are some options.
If sharing is working between the guest virual machine and the host, the easiest way is to sync your Git repository and then load the data from the shared drive. Look at the repository directory.
r setwd('/vagrant/data') list.files()
## [1] "titantic_morecolumns.csv" "titantic_morerows.csv" ## [3] "titantic_train.csv"
r titantic <- read.csv('titantic_train.csv', header = TRUE )
You can import a dataframe from github directly using the gui with import dataset->from web url.
https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv Note. During the import process on the top left you can specify the dataframe name. By default it will come through as titantic_train. You can create a new one called titantic using the code below. titantic<-titantic_train
R doesen’t seem to have a real robust method of dealing with files and https, which I have seen cause some problems. Instead, you can vagrant ssh
into the /home/vagrant
directory of the virtual machine this is default. Then enter wget https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv
from the terminal. This will download the file directly to the linux virtual machine.
##You only need to set working directory if you changed the working directory to something else. This is default.
setwd('/home/vagrant')
list.files()
## [1] "1_twitter.ipynb"
## [2] "_Appendix B - OAuth Primer.ipynb"
## [3] "BeautifulSoup.ipynb"
## [4] "Chapter 0 - Preface.ipynb"
## [5] "Chapter 1 - Mining Twitter.ipynb"
## [6] "Chapter 4 - Mining Google+.ipynb"
## [7] "Chapter 9 - Twitter Cookbook.ipynb"
## [8] "Class 3 More Python Basics. .ipynb"
## [9] "data"
## [10] "downjason.ipynb.json"
## [11] "index.html"
## [12] "index.html.1"
## [13] "install.sh"
## [14] "Lab2.ipynb"
## [15] "lab2solution.ipynb"
## [16] "Lab2-webmining.ipynb"
## [17] "Lab 3 - Twitter-Copy1.ipynb"
## [18] "Lab 3 - Twitter.ipynb"
## [19] "Lab3_Twitter_solution.ipynb"
## [20] "lab4.html"
## [21] "Lab4.ipynb"
## [22] "lab4.Rmd"
## [23] "nestedforloop.R"
## [24] "R"
## [25] "spark_mooc_version"
## [26] "spark_notebook.py"
## [27] "titantic_train.csv"
## [28] "titantic_train.csv.1"
## [29] "Untitled1.ipynb"
## [30] "Untitled2.ipynb"
## [31] "Untitled3.ipynb"
## [32] "Untitled.ipynb"
titantic <- read.csv('titantic_train.csv', header = TRUE )
Merging datasets from different sources can be very common for a data scientist. You might need to
key=(1:10)
#This creates a matrix 4,
#Here we are passing the row names and column names as a list.
m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("a","b","c","d")))
m2<- matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h")))
m3<- matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h")))
#This creates a dataframe where the same key is across both
df<- data.frame(key,m)
df2<- data.frame(key,m2)
key=(11:20)
df3<- data.frame(key,m3)
#slight alterhate syntax, using column bind first.
df4<-data.frame(cbind(key,m3))
#First let's merge rows for the same columns by combining df3 with rbind. Key is that these have to have the same columns
rbind(df2,df3)
## key e f g h
## 1 1 998.5404 995.5929 999.6785 996.9909
## 2 2 999.8719 999.9656 1002.2861 1002.3753
## 3 3 993.9311 1003.7309 1004.2870 1003.8519
## 4 4 1000.8596 992.3083 994.3755 1000.5741
## 5 5 997.6221 996.0246 996.4049 1000.3972
## 6 6 1006.3881 998.4654 1003.0841 1001.7868
## 7 7 1005.4714 1005.0491 1002.3106 996.8054
## 8 8 999.1908 1009.5933 992.5427 993.7408
## 9 9 1007.5241 1004.4034 1003.6821 1010.8393
## 10 10 993.4326 1001.4877 1003.6487 996.8948
## 11 11 1000.5011 1004.2833 1010.4017 998.0157
## 21 12 1001.8712 999.8168 1003.5189 992.4346
## 31 13 1006.2561 999.2939 992.9610 999.3590
## 41 14 1003.7306 1004.7246 995.9031 1000.6784
## 51 15 1003.3538 997.9591 1005.6200 1001.8227
## 61 16 1003.4176 1004.0613 991.4655 999.8996
## 71 17 1007.3612 999.7960 1012.6482 997.6021
## 81 18 998.5478 996.3340 989.7876 998.9176
## 91 19 993.1003 1004.0298 1007.9833 989.9835
## 101 20 995.3712 1005.8793 1005.0117 1006.3844
#let's say we have a dataframe that doesn't have the "e" column
df5<-df3[,-2] #Drop the e column
By then entering the rbind(df2,df5)
command we will get the error: Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match
So instead we have to go ahead and readd the column as NA.
df5$e=NA
df5
## key f g h e
## 1 11 1004.2833 1010.4017 998.0157 NA
## 2 12 999.8168 1003.5189 992.4346 NA
## 3 13 999.2939 992.9610 999.3590 NA
## 4 14 1004.7246 995.9031 1000.6784 NA
## 5 15 997.9591 1005.6200 1001.8227 NA
## 6 16 1004.0613 991.4655 999.8996 NA
## 7 17 999.7960 1012.6482 997.6021 NA
## 8 18 996.3340 989.7876 998.9176 NA
## 9 19 1004.0298 1007.9833 989.9835 NA
## 10 20 1005.8793 1005.0117 1006.3844 NA
#Notice that it doesn't matter that columns are not in the same order.
rbind(df2,df5)
## key e f g h
## 1 1 998.5404 995.5929 999.6785 996.9909
## 2 2 999.8719 999.9656 1002.2861 1002.3753
## 3 3 993.9311 1003.7309 1004.2870 1003.8519
## 4 4 1000.8596 992.3083 994.3755 1000.5741
## 5 5 997.6221 996.0246 996.4049 1000.3972
## 6 6 1006.3881 998.4654 1003.0841 1001.7868
## 7 7 1005.4714 1005.0491 1002.3106 996.8054
## 8 8 999.1908 1009.5933 992.5427 993.7408
## 9 9 1007.5241 1004.4034 1003.6821 1010.8393
## 10 10 993.4326 1001.4877 1003.6487 996.8948
## 11 11 NA 1004.2833 1010.4017 998.0157
## 21 12 NA 999.8168 1003.5189 992.4346
## 31 13 NA 999.2939 992.9610 999.3590
## 41 14 NA 1004.7246 995.9031 1000.6784
## 51 15 NA 997.9591 1005.6200 1001.8227
## 61 16 NA 1004.0613 991.4655 999.8996
## 71 17 NA 999.7960 1012.6482 997.6021
## 81 18 NA 996.3340 989.7876 998.9176
## 91 19 NA 1004.0298 1007.9833 989.9835
## 101 20 NA 1005.8793 1005.0117 1006.3844
That merged some rows with the same structure. But let’s say we have to merge columns. For that we need a common key.
df6 <- merge(df,df2,by="key")
df6
## key a b c d e f g
## 1 1 23.22678 18.10999 13.00721 15.34173 998.5404 995.5929 999.6785
## 2 2 17.20738 16.08315 14.93269 16.33102 999.8719 999.9656 1002.2861
## 3 3 12.11747 17.19723 24.50265 16.99374 993.9311 1003.7309 1004.2870
## 4 4 21.07022 27.85910 19.68360 17.63353 1000.8596 992.3083 994.3755
## 5 5 20.58616 19.61494 16.60385 24.35142 997.6221 996.0246 996.4049
## 6 6 22.85258 24.49959 22.92062 12.03308 1006.3881 998.4654 1003.0841
## 7 7 21.77243 25.71287 14.68309 29.89312 1005.4714 1005.0491 1002.3106
## 8 8 22.27872 24.31091 20.87042 13.28731 999.1908 1009.5933 992.5427
## 9 9 15.59008 19.54362 22.09654 10.67335 1007.5241 1004.4034 1003.6821
## 10 10 16.86036 21.89450 20.94156 26.21374 993.4326 1001.4877 1003.6487
## h
## 1 996.9909
## 2 1002.3753
## 3 1003.8519
## 4 1000.5741
## 5 1000.3972
## 6 1001.7868
## 7 996.8054
## 8 993.7408
## 9 1010.8393
## 10 996.8948
#If we try to merge by a key were there is no match, it will result in 0 rows.
df7 <- merge(df,df3,by="key")
df7
## [1] key a b c d e f g h
## <0 rows> (or 0-length row.names)
But we can override this. Let’s say we want to include data from one or more dataframe even where there is no match. This type of situation is called an outer join when working with SQL.
df8 <- merge(df,df3,by="key",all.x=TRUE) #x is df4 It will add NA for other (like left outer join)
df9 <- merge(df,df3,by="key", all.y=TRUE) #y is df3 (like right outer join)
df10 <- merge(df,df3,by="key",all.x=TRUE, all.y=TRUE) #y is df3 (like full outer join)
6R. You will find another dataset on github that contains additional rows of individuals here. (In reality they are duplicates but let’s say we didn’t know that). Provide the code to first import thist dataset and then combine it with the origional titantic dataframe into a new one called titantic_
7R. You will find another dataset with additional variables that may help in the analysis here. Note. This is a little bit more tricky, as there currently isn’t a key in the origional dataset. As a result, do the following substeps: (a). Provide code to verify the number of records in the origional and new dataset. (b). Generate a key for the origional dataset in the pattern of the new dataset. (c). Provide code to merge the two datasets titantic_cats_
Finally, we are going to aggregate data.
#This looks across all men and wormen on the titantic to average who survived.
#aggregate(titantic$survived, by=list(titantic$sex), FUN=mean, na.rm=FALSE)
#You can further recode a variable based on whether the
#titantic$child <- ifelse(titantic$age > 18, c("adult"), c("child"))
#aggregate(titantic$survived, by=list(titantic$sex, titantic$child), FUN=mean, na.rm=FALSE)
#This is a different recoding scheme where we are changing the value for the selected rows were the condition is met
#titantic$child[titantic$age > 18] <- "adult"
#titantic$child[(titantic$age > 2 & titantic$age <= 18)] <- "child"
#titantic$child[titantic$age <= 2] <- "infant"
#aggregate(titantic$survived, by=list(titantic$sex, titantic$child), FUN=mean, na.rm=FALSE)