Technology Fundamentals for Business Analytics

Overview

We will provide here a number of different processes for subsetting, linking, and aggregation of data. There are a number of different reasons why you might want to do each of these processes.

Subsetting - Selecting Columns

We have already seen some level of selection in the “slicing” of vectors and matrix by location. This is an extremely important way

#Let's start by creating some data.
#This creates a vector with 10 numbers
v=(1:10)
v

##  [1]  1  2  3  4  5  6  7  8  9 10

#This creates a matrix 4,
m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4)
colnames(m)<-(c("a","b","c","d"))

#This creates a dataframe
df<- as.data.frame(m)

#We could also put together 
df2<- data.frame(v, m)


#Select Specific Columns
df[,1:2] #All rows, column 1-2

##            a        b
## 1   1.093735 24.11081
## 2  12.326459 19.82742
## 3  14.846989 15.85290
## 4  18.654462 26.46828
## 5  10.511888 15.82481
## 6  19.104696 13.31706
## 7  16.067400 23.31764
## 8  17.601777 24.33212
## 9  28.328676 25.23951
## 10 15.135773 20.77190

df[1:2]  #All rows, column 1-2

##            a        b
## 1   1.093735 24.11081
## 2  12.326459 19.82742
## 3  14.846989 15.85290
## 4  18.654462 26.46828
## 5  10.511888 15.82481
## 6  19.104696 13.31706
## 7  16.067400 23.31764
## 8  17.601777 24.33212
## 9  28.328676 25.23951
## 10 15.135773 20.77190

df[,c(1,3)] #We can also make the selection a vector.  This dynamically creates a vector.

##            a        c
## 1   1.093735 30.89777
## 2  12.326459 25.32011
## 3  14.846989 29.24470
## 4  18.654462 16.71059
## 5  10.511888 21.67423
## 6  19.104696 18.76297
## 7  16.067400 26.65222
## 8  17.601777 19.43131
## 9  28.328676 26.76862
## 10 15.135773 25.46423

columns.v<-c(1,3)
df[,columns.v] #We can also make the selection a vector.  This dynamically creates a vector.

##            a        c
## 1   1.093735 30.89777
## 2  12.326459 25.32011
## 3  14.846989 29.24470
## 4  18.654462 16.71059
## 5  10.511888 21.67423
## 6  19.104696 18.76297
## 7  16.067400 26.65222
## 8  17.601777 19.43131
## 9  28.328676 26.76862
## 10 15.135773 25.46423

df[,c("a","c")] #We can also do it by column name.

##            a        c
## 1   1.093735 30.89777
## 2  12.326459 25.32011
## 3  14.846989 29.24470
## 4  18.654462 16.71059
## 5  10.511888 21.67423
## 6  19.104696 18.76297
## 7  16.067400 26.65222
## 8  17.601777 19.43131
## 9  28.328676 26.76862
## 10 15.135773 25.46423

columns.b <- names(df) %in% c("a", "b", "c")
df[,columns.b]

##            a        b        c
## 1   1.093735 24.11081 30.89777
## 2  12.326459 19.82742 25.32011
## 3  14.846989 15.85290 29.24470
## 4  18.654462 26.46828 16.71059
## 5  10.511888 15.82481 21.67423
## 6  19.104696 13.31706 18.76297
## 7  16.067400 23.31764 26.65222
## 8  17.601777 24.33212 19.43131
## 9  28.328676 25.23951 26.76862
## 10 15.135773 20.77190 25.46423

#We can also drop specific columns
df[,c(-1,-4)] #drop column a and d

##           b        c
## 1  24.11081 30.89777
## 2  19.82742 25.32011
## 3  15.85290 29.24470
## 4  26.46828 16.71059
## 5  15.82481 21.67423
## 6  13.31706 18.76297
## 7  23.31764 26.65222
## 8  24.33212 19.43131
## 9  25.23951 26.76862
## 10 20.77190 25.46423

df[!columns.b]  #drop all but d

##           d
## 1  20.29334
## 2  27.35121
## 3  14.82092
## 4  12.65235
## 5  22.05243
## 6  21.25187
## 7  13.64450
## 8  19.69229
## 9  17.36006
## 10 21.20786

#There is also a specific subset function
subset(df, select=c(a, b) )

##            a        b
## 1   1.093735 24.11081
## 2  12.326459 19.82742
## 3  14.846989 15.85290
## 4  18.654462 26.46828
## 5  10.511888 15.82481
## 6  19.104696 13.31706
## 7  16.067400 23.31764
## 8  17.601777 24.33212
## 9  28.328676 25.23951
## 10 15.135773 20.77190

Selecting Rows

At times we may want to split our sample for analysis. We may just want to do it in order or we may want the process to be more random.

df[1:5,]  #The first 5 rows and all columns.

##           a        b        c        d
## 1  1.093735 24.11081 30.89777 20.29334
## 2 12.326459 19.82742 25.32011 27.35121
## 3 14.846989 15.85290 29.24470 14.82092
## 4 18.654462 26.46828 16.71059 12.65235
## 5 10.511888 15.82481 21.67423 22.05243

df[c(1,3,8),]  #Just a selection of a few rows

##           a        b        c        d
## 1  1.093735 24.11081 30.89777 20.29334
## 3 14.846989 15.85290 29.24470 14.82092
## 8 17.601777 24.33212 19.43131 19.69229

rows<-c(1,3,8)
df[rows,]

##           a        b        c        d
## 1  1.093735 24.11081 30.89777 20.29334
## 3 14.846989 15.85290 29.24470 14.82092
## 8 17.601777 24.33212 19.43131 19.69229

#For a random sample, let's first create a random vector of the row numbers.
n<-nrow(df)
sample <- sample(10, 10, replace=F)
sample

##  [1]  6  8  1  7  9  5 10  2  3  4

#notice how the array is being used to sample out the rows. 
df[sample[1:5],]

##           a        b        c        d
## 6 19.104696 13.31706 18.76297 21.25187
## 8 17.601777 24.33212 19.43131 19.69229
## 1  1.093735 24.11081 30.89777 20.29334
## 7 16.067400 23.31764 26.65222 13.64450
## 9 28.328676 25.23951 26.76862 17.36006

df[sample[6:10],]

##           a        b        c        d
## 5  10.51189 15.82481 21.67423 22.05243
## 10 15.13577 20.77190 25.46423 21.20786
## 2  12.32646 19.82742 25.32011 27.35121
## 3  14.84699 15.85290 29.24470 14.82092
## 4  18.65446 26.46828 16.71059 12.65235

#Remember we can select out the a variable of a dataframe using df$a
df[ which(df$a>=20.0), ]

##          a        b        c        d
## 9 28.32868 25.23951 26.76862 17.36006

df[ which(df$a>=20 & df$b>=18.0), ]  # & is the and sign

##          a        b        c        d
## 9 28.32868 25.23951 26.76862 17.36006

df[ which(df$a>=20 | df$b>=18.0), ]  # | is the or sign

##            a        b        c        d
## 1   1.093735 24.11081 30.89777 20.29334
## 2  12.326459 19.82742 25.32011 27.35121
## 4  18.654462 26.46828 16.71059 12.65235
## 7  16.067400 23.31764 26.65222 13.64450
## 8  17.601777 24.33212 19.43131 19.69229
## 9  28.328676 25.23951 26.76862 17.36006
## 10 15.135773 20.77190 25.46423 21.20786

#We can also use the subset function to select out columes
subset(df, a >= 20 )

##          a        b        c        d
## 9 28.32868 25.23951 26.76862 17.36006

subset(df, a >= 20 & df$b>=18.0 )

##          a        b        c        d
## 9 28.32868 25.23951 26.76862 17.36006

subset(df, a >= 20 & df$b>=18.0 )

##          a        b        c        d
## 9 28.32868 25.23951 26.76862 17.36006

#We can also combine column and row selection as follows
subset(df, a >= 20 & df$b>=18.0, select=c(a, b) )

##          a        b
## 9 28.32868 25.23951

Exercises

These toy examples are fine, but for this exercise we will use the Titaantic dataset. We want to split this out in a few different ways.

1R. Subset only the survived, sex, and age columns and create a new dataframe with the results.

2R. Drop the ticket and the cabin and create a new dataframe with the results.

3R. Split the dataset into a train dataframe having the first 80% of the records and a test dataframe having the last 20%.

4R. Randomely sample the dataset so that there are 2 samples.

5R. Create one dataframe with all of the males and a second with all of the females.

Getting CSV Files into RStudio

If your are having trouble getting file into RStudio, here are some options.

OPTION 1

If sharing is working between the guest virual machine and the host, the easiest way is to sync your Git repository and then load the data from the shared drive. Look at the repository directory.

r setwd('/vagrant/data') list.files()

## [1] "titantic_morecolumns.csv" "titantic_morerows.csv" ## [3] "titantic_train.csv"

r titantic <- read.csv('titantic_train.csv', header = TRUE )

OPTION 2

You can import a dataframe from github directly using the gui with import dataset->from web url.
https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv Note. During the import process on the top left you can specify the dataframe name. By default it will come through as titantic_train. You can create a new one called titantic using the code below. titantic<-titantic_train

OPTION 3

R doesen’t seem to have a real robust method of dealing with files and https, which I have seen cause some problems. Instead, you can vagrant ssh into the /home/vagrant directory of the virtual machine this is default. Then enter wget https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv from the terminal. This will download the file directly to the linux virtual machine.

##You only need to set working directory if you changed the working directory to something else. This is default.
setwd('/home/vagrant') 
list.files()

##  [1] "1_twitter.ipynb"                   
##  [2] "_Appendix B - OAuth Primer.ipynb"  
##  [3] "BeautifulSoup.ipynb"               
##  [4] "Chapter 0 - Preface.ipynb"         
##  [5] "Chapter 1 - Mining Twitter.ipynb"  
##  [6] "Chapter 4 - Mining Google+.ipynb"  
##  [7] "Chapter 9 - Twitter Cookbook.ipynb"
##  [8] "Class 3 More Python Basics. .ipynb"
##  [9] "data"                              
## [10] "downjason.ipynb.json"              
## [11] "index.html"                        
## [12] "index.html.1"                      
## [13] "install.sh"                        
## [14] "Lab2.ipynb"                        
## [15] "lab2solution.ipynb"                
## [16] "Lab2-webmining.ipynb"              
## [17] "Lab 3 - Twitter-Copy1.ipynb"       
## [18] "Lab 3 - Twitter.ipynb"             
## [19] "Lab3_Twitter_solution.ipynb"       
## [20] "lab4.html"                         
## [21] "Lab4.ipynb"                        
## [22] "lab4.Rmd"                          
## [23] "nestedforloop.R"                   
## [24] "R"                                 
## [25] "spark_mooc_version"                
## [26] "spark_notebook.py"                 
## [27] "titantic_train.csv"                
## [28] "titantic_train.csv.1"              
## [29] "Untitled1.ipynb"                   
## [30] "Untitled2.ipynb"                   
## [31] "Untitled3.ipynb"                   
## [32] "Untitled.ipynb"

titantic <- read.csv('titantic_train.csv', header = TRUE )

Merging data

Merging datasets from different sources can be very common for a data scientist. You might need to

key=(1:10)
 
#This creates a matrix 4,
#Here we are passing the row names and column names as a list. 
m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("a","b","c","d")))
m2<- matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h")))
m3<- matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h")))

#This creates a dataframe where the same key is across both
df<-  data.frame(key,m)
df2<- data.frame(key,m2)
key=(11:20)
df3<- data.frame(key,m3)


#slight alterhate syntax, using column bind first.
df4<-data.frame(cbind(key,m3))

#First let's merge rows for the same columns by combining df3 with rbind.  Key is that these have to have the same columns
rbind(df2,df3)

##     key         e         f         g         h
## 1     1  998.5404  995.5929  999.6785  996.9909
## 2     2  999.8719  999.9656 1002.2861 1002.3753
## 3     3  993.9311 1003.7309 1004.2870 1003.8519
## 4     4 1000.8596  992.3083  994.3755 1000.5741
## 5     5  997.6221  996.0246  996.4049 1000.3972
## 6     6 1006.3881  998.4654 1003.0841 1001.7868
## 7     7 1005.4714 1005.0491 1002.3106  996.8054
## 8     8  999.1908 1009.5933  992.5427  993.7408
## 9     9 1007.5241 1004.4034 1003.6821 1010.8393
## 10   10  993.4326 1001.4877 1003.6487  996.8948
## 11   11 1000.5011 1004.2833 1010.4017  998.0157
## 21   12 1001.8712  999.8168 1003.5189  992.4346
## 31   13 1006.2561  999.2939  992.9610  999.3590
## 41   14 1003.7306 1004.7246  995.9031 1000.6784
## 51   15 1003.3538  997.9591 1005.6200 1001.8227
## 61   16 1003.4176 1004.0613  991.4655  999.8996
## 71   17 1007.3612  999.7960 1012.6482  997.6021
## 81   18  998.5478  996.3340  989.7876  998.9176
## 91   19  993.1003 1004.0298 1007.9833  989.9835
## 101  20  995.3712 1005.8793 1005.0117 1006.3844

#let's say we have a dataframe that doesn't have the "e" column
df5<-df3[,-2] #Drop the e column

By then entering the rbind(df2,df5) command we will get the error: Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match

So instead we have to go ahead and readd the column as NA.

df5$e=NA
df5

##    key         f         g         h  e
## 1   11 1004.2833 1010.4017  998.0157 NA
## 2   12  999.8168 1003.5189  992.4346 NA
## 3   13  999.2939  992.9610  999.3590 NA
## 4   14 1004.7246  995.9031 1000.6784 NA
## 5   15  997.9591 1005.6200 1001.8227 NA
## 6   16 1004.0613  991.4655  999.8996 NA
## 7   17  999.7960 1012.6482  997.6021 NA
## 8   18  996.3340  989.7876  998.9176 NA
## 9   19 1004.0298 1007.9833  989.9835 NA
## 10  20 1005.8793 1005.0117 1006.3844 NA

#Notice that it doesn't matter that columns are not in the same order.
rbind(df2,df5)

##     key         e         f         g         h
## 1     1  998.5404  995.5929  999.6785  996.9909
## 2     2  999.8719  999.9656 1002.2861 1002.3753
## 3     3  993.9311 1003.7309 1004.2870 1003.8519
## 4     4 1000.8596  992.3083  994.3755 1000.5741
## 5     5  997.6221  996.0246  996.4049 1000.3972
## 6     6 1006.3881  998.4654 1003.0841 1001.7868
## 7     7 1005.4714 1005.0491 1002.3106  996.8054
## 8     8  999.1908 1009.5933  992.5427  993.7408
## 9     9 1007.5241 1004.4034 1003.6821 1010.8393
## 10   10  993.4326 1001.4877 1003.6487  996.8948
## 11   11        NA 1004.2833 1010.4017  998.0157
## 21   12        NA  999.8168 1003.5189  992.4346
## 31   13        NA  999.2939  992.9610  999.3590
## 41   14        NA 1004.7246  995.9031 1000.6784
## 51   15        NA  997.9591 1005.6200 1001.8227
## 61   16        NA 1004.0613  991.4655  999.8996
## 71   17        NA  999.7960 1012.6482  997.6021
## 81   18        NA  996.3340  989.7876  998.9176
## 91   19        NA 1004.0298 1007.9833  989.9835
## 101  20        NA 1005.8793 1005.0117 1006.3844

That merged some rows with the same structure. But let’s say we have to merge columns. For that we need a common key.

df6 <- merge(df,df2,by="key")
df6

##    key        a        b        c        d         e         f         g
## 1    1 23.22678 18.10999 13.00721 15.34173  998.5404  995.5929  999.6785
## 2    2 17.20738 16.08315 14.93269 16.33102  999.8719  999.9656 1002.2861
## 3    3 12.11747 17.19723 24.50265 16.99374  993.9311 1003.7309 1004.2870
## 4    4 21.07022 27.85910 19.68360 17.63353 1000.8596  992.3083  994.3755
## 5    5 20.58616 19.61494 16.60385 24.35142  997.6221  996.0246  996.4049
## 6    6 22.85258 24.49959 22.92062 12.03308 1006.3881  998.4654 1003.0841
## 7    7 21.77243 25.71287 14.68309 29.89312 1005.4714 1005.0491 1002.3106
## 8    8 22.27872 24.31091 20.87042 13.28731  999.1908 1009.5933  992.5427
## 9    9 15.59008 19.54362 22.09654 10.67335 1007.5241 1004.4034 1003.6821
## 10  10 16.86036 21.89450 20.94156 26.21374  993.4326 1001.4877 1003.6487
##            h
## 1   996.9909
## 2  1002.3753
## 3  1003.8519
## 4  1000.5741
## 5  1000.3972
## 6  1001.7868
## 7   996.8054
## 8   993.7408
## 9  1010.8393
## 10  996.8948

#If we try to merge by a key were there is no match, it will result in 0 rows.
df7 <- merge(df,df3,by="key")
df7

## [1] key a   b   c   d   e   f   g   h  
## <0 rows> (or 0-length row.names)

But we can override this. Let’s say we want to include data from one or more dataframe even where there is no match. This type of situation is called an outer join when working with SQL.

df8 <- merge(df,df3,by="key",all.x=TRUE)   #x is df4 It will add NA for other (like left outer join)
df9 <- merge(df,df3,by="key", all.y=TRUE)  #y is df3 (like right outer join)
df10 <- merge(df,df3,by="key",all.x=TRUE, all.y=TRUE)  #y is df3 (like full outer join)

6R. You will find another dataset on github that contains additional rows of individuals here. (In reality they are duplicates but let’s say we didn’t know that). Provide the code to first import thist dataset and then combine it with the origional titantic dataframe into a new one called titantic_.

7R. You will find another dataset with additional variables that may help in the analysis here. Note. This is a little bit more tricky, as there currently isn’t a key in the origional dataset. As a result, do the following substeps: (a). Provide code to verify the number of records in the origional and new dataset. (b). Generate a key for the origional dataset in the pattern of the new dataset. (c). Provide code to merge the two datasets titantic_cats_.

Aggregage Data

Finally, we are going to aggregate data.

#This looks across all men and wormen on the titantic to average who survived.
#aggregate(titantic$survived, by=list(titantic$sex), FUN=mean, na.rm=FALSE)

#You can further recode a variable based on whether the 
#titantic$child <- ifelse(titantic$age > 18, c("adult"), c("child")) 
#aggregate(titantic$survived, by=list(titantic$sex, titantic$child), FUN=mean, na.rm=FALSE)

#This is a different recoding scheme where we are changing the value for the selected rows were the condition is met
#titantic$child[titantic$age > 18] <- "adult"
#titantic$child[(titantic$age > 2 & titantic$age <= 18)] <- "child"
#titantic$child[titantic$age <= 2] <- "infant"
#aggregate(titantic$survived, by=list(titantic$sex, titantic$child), FUN=mean, na.rm=FALSE)