Review

Getting External Data Into Rstudio Server

Getting CSV Files into RStudio

If your are having trouble getting file into RStudio, here are some options.

OPTION 1

If sharing is working between the guest virual machine and the host, the easiest way is to sync your Git repository and then load the data from the shared drive. Look at the repository directory.

setwd('/vagrant/data')
list.files()

## [1] "df_none.csv"              "~$df_NONE.xlsx"          
## [3] "df_NONE.xlsx"             "test.csv"                
## [5] "titantic_morecolumns.csv" "titantic_morerows.csv"   
## [7] "titantic_train.csv"       "train.csv"               
## [9] "train.xlsx"

titanic <- read.csv('titantic_train.csv', header = TRUE )

OPTION 2

You can import a dataframe from github directly using the gui with import dataset->from web url.
https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv Note. During the import process on the top left you can specify the dataframe name. By default it will come through as titantic_train. You can create a new one called titantic using the code below. titanic<-titantic_train

OPTION 3

R doesen’t seem to have a real robust method of dealing with files and https, which I have seen cause some problems. Instead, you can vagrant ssh into the /home/vagrant directory of the virtual machine this is default. Then enter wget https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv from the terminal. This will download the file directly to the linux virtual machine.

##You only need to set working directory if you changed the working directory to something else. This is default.
setwd('/home/vagrant') 
list.files()

##  [1] "1_twitter.ipynb"                   
##  [2] "_Appendix B - OAuth Primer.ipynb"  
##  [3] "BeautifulSoup.ipynb"               
##  [4] "Chapter 0 - Preface.ipynb"         
##  [5] "Chapter 1 - Mining Twitter.ipynb"  
##  [6] "Chapter 4 - Mining Google+.ipynb"  
##  [7] "Chapter 9 - Twitter Cookbook.ipynb"
##  [8] "Class 3 More Python Basics. .ipynb"
##  [9] "data"                              
## [10] "downjason.ipynb.json"              
## [11] "example.html"                      
## [12] "example.Rmd"                       
## [13] "index.html"                        
## [14] "index.html.1"                      
## [15] "install.sh"                        
## [16] "Lab2.ipynb"                        
## [17] "lab2solution.ipynb"                
## [18] "Lab2-webmining.ipynb"              
## [19] "Lab 3 - Twitter-Copy1.ipynb"       
## [20] "Lab 3 - Twitter.ipynb"             
## [21] "Lab3_Twitter_solution.ipynb"       
## [22] "lab4.html"                         
## [23] "Lab4.ipynb"                        
## [24] "lab4.Rmd"                          
## [25] "Lab4-Solution.ipynb"               
## [26] "Lab6.Rmd"                          
## [27] "model-figure"                      
## [28] "model.md"                          
## [29] "model.Rpres"                       
## [30] "nestedforloop.R"                   
## [31] "R"                                 
## [32] "spark_mooc_version"                
## [33] "spark_notebook.py"                 
## [34] "Titanic.ipynb"                     
## [35] "titantic_train.csv"                
## [36] "titantic_train.csv.1"              
## [37] "Untitled1.ipynb"                   
## [38] "Untitled2.ipynb"                   
## [39] "Untitled3.ipynb"                   
## [40] "Untitled.ipynb"

titanic <- read.csv('titantic_train.csv', header = TRUE )

Three Ways to Subset Data

# This is a function that 
males<-subset(titanic, sex=='male' )
females<-subset(titanic, sex=='female' )

#Males/Females Array via a vector listing the desired rows in an array.  
malesarray<-which(titanic$sex=='male')
malesarray

##   [1]   1   5   6   7   8  13  14  17  18  21  22  24  27  28  30  31  34
##  [18]  35  36  37  38  43  46  47  49  51  52  55  56  58  60  61  63  64
##  [35]  65  66  68  70  71  73  74  75  76  77  78  79  81  82  84  87  88
##  [52]  90  91  92  93  94  95  96  97  98 100 102 103 104 105 106 108 109
##  [69] 111 113 116 117 118 119 121 122 123 125 126 127 128 130 131 132 135
##  [86] 136 138 139 140 144 145 146 147 149 150 151 153 154 155 156 158 159
## [103] 160 161 163 164 165 166 169 170 171 172 174 175 176 177 179 180 182
## [120] 183 184 186 188 189 190 192 194 197 198 201 202 203 204 205 207 208
## [137] 210 211 213 214 215 218 220 221 222 223 224 225 226 227 228 229 232
## [154] 233 235 237 239 240 243 244 245 246 249 250 251 253 254 261 262 263
## [171] 264 266 267 268 271 272 274 278 279 281 282 283 284 285 286 287 288
## [188] 289 293 295 296 297 299 302 303 305 306 309 314 315 318 321 322 325
## [205] 327 332 333 334 336 337 339 340 341 343 344 345 349 350 351 352 353
## [222] 354 355 356 361 362 364 365 366 371 372 373 374 378 379 380 383 385
## [239] 386 387 389 391 392 393 396 398 399 401 402 404 406 407 408 409 411
## [256] 412 414 415 419 421 422 423 425 426 429 430 431 434 435 439 440 442
## [273] 443 445 446 448 450 451 452 453 454 455 456 457 460 461 462 463 464
## [290] 465 466 467 468 469 471 472 476 477 478 479 481 482 483 485 488 489
## [307] 490 491 492 493 494 495 496 498 500 501 506 508 509 510 511 512 513
## [324] 515 516 518 520 522 523 525 526 528 529 530 532 533 537 539 544 545
## [341] 546 548 549 550 551 552 553 554 556 558 561 562 563 564 566 567 569
## [358] 570 571 573 575 576 580 583 584 585 587 588 589 590 591 593 595 596
## [375] 598 599 600 602 603 604 605 606 607 608 612 614 615 617 620 621 622
## [392] 623 624 625 626 627 629 630 631 632 633 634 637 638 640 641 644 646
## [409] 647 648 649 651 653 656 657 659 660 661 662 663 664 665 666 667 668
## [426] 669 672 673 674 675 676 677 680 682 683 684 685 686 687 688 689 691
## [443] 693 694 695 696 697 699 700 702 704 705 706 708 710 712 713 714 715
## [460] 716 719 720 722 723 724 725 726 729 732 733 734 735 736 738 739 740
## [477] 741 742 744 745 746 747 749 750 752 753 754 756 757 758 759 761 762
## [494] 763 765 767 769 770 771 772 774 776 777 779 783 784 785 786 788 789
## [511] 790 791 792 794 795 796 799 801 803 804 805 806 807 809 811 812 813
## [528] 815 816 818 819 820 822 823 825 826 827 828 829 832 833 834 835 837
## [545] 838 839 840 841 842 844 845 846 847 848 849 851 852 858 860 861 862
## [562] 865 868 869 870 871 873 874 877 878 879 882 884 885 887 890 891

femalesarray<-which(titanic$sex=='female')
males2<-titanic[ malesarray, ] 
females2<-titanic[ -malesarray, ] 
females2<-titanic[ femalesarray, ] 

#Males/Females Array via a boolean vector indicating appropriate rows.
malesarray2<-ifelse(titanic$sex=='male' , TRUE , FALSE)
males3<-titanic[ malesarray2, ] 
females3<-titanic[ !malesarray2, ]

Aggregation

Aggregation is useful for many different aspects of analysis. Let’s take a look at a few with the titanic dataset.

#This will give us a count of the frequency at each level.  
table(titanic$survived)

## 
##   0   1 
## 549 342

table(titanic$sex)

## 
## female   male 
##    314    577

table(titanic$sibsp)

## 
##   0   1   2   3   4   5   8 
## 608 209  28  16  18   5   7

#This does the same as the following. 
sum(titanic$survived)

## [1] 342

sum(!titanic$survived)

## [1] 549

#We can also generate based on proportions (percentages). This gives the propotion in each category.
prop.table(table(titanic$survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(titanic$sex))

## 
##   female     male 
## 0.352413 0.647587

prop.table(table(titanic$sibsp))

## 
##           0           1           2           3           4           5 
## 0.682379349 0.234567901 0.031425365 0.017957351 0.020202020 0.005611672 
##           8 
## 0.007856341

#We can also combine varaiables to create cross-tabs to get an initial idea of the role of different variables. 
table(titanic$sex, titanic$survived)

##         
##            0   1
##   female  81 233
##   male   468 109

table(titanic$sibsp, titanic$survived)

##    
##       0   1
##   0 398 210
##   1  97 112
##   2  15  13
##   3  12   4
##   4  15   3
##   5   5   0
##   8   7   0

#This gives the percentage in each category.  
prop.table(table(titanic$sex, titanic$survived))

##         
##                   0          1
##   female 0.09090909 0.26150393
##   male   0.52525253 0.12233446

summary(titanic$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

summary(titanic)

##     survived          pclass     
##  Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.0000   Median :3.000  
##  Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :3.000  
##                                  
##                                     name         sex           age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      sibsp           parch             ticket         fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          cabin     embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

titanic$child <- 0
titanic$child[titanic$age < 18] <- 1
#Here sum adds the survived
aggregate(survived ~ child + sex, data=titanic, FUN=sum)

##   child    sex survived
## 1     0 female      195
## 2     1 female       38
## 3     0   male       86
## 4     1   male       23

#Length gives the number of each category
aggregate(survived ~ child + sex, data=titanic, FUN=length)

##   child    sex survived
## 1     0 female      259
## 2     1 female       55
## 3     0   male      519
## 4     1   male       58

#This gives the percentage
aggregate(survived ~ child + sex, data=titanic, FUN=function(x) {sum(x)/length(x)})

##   child    sex  survived
## 1     0 female 0.7528958
## 2     1 female 0.6909091
## 3     0   male 0.1657033
## 4     1   male 0.3965517

aggregate(survived ~ sex, data=titanic, FUN=function(x) {sum(x)/length(x)})

##      sex  survived
## 1 female 0.7420382
## 2   male 0.1889081

Misc. Functions

#View(titanic) #show data browser
names(titanic) #show the names

##  [1] "survived" "pclass"   "name"     "sex"      "age"      "sibsp"   
##  [7] "parch"    "ticket"   "fare"     "cabin"    "embarked" "child"

dim(titanic) #show the dimensions of the data frame

## [1] 891  12

head(titanic, 2) #show the first 2 records

##   survived pclass                                                name
## 1        0      3                             Braund, Mr. Owen Harris
## 2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
##      sex age sibsp parch    ticket    fare cabin embarked child
## 1   male  22     1     0 A/5 21171  7.2500              S     0
## 2 female  38     1     0  PC 17599 71.2833   C85        C     0

tail(titanic, 4) #show the final 2 records

##     survived pclass                                     name    sex age
## 888        1      1             Graham, Miss. Margaret Edith female  19
## 889        0      3 Johnston, Miss. Catherine Helen "Carrie" female  NA
## 890        1      1                    Behr, Mr. Karl Howell   male  26
## 891        0      3                      Dooley, Mr. Patrick   male  32
##     sibsp parch     ticket  fare cabin embarked child
## 888     0     0     112053 30.00   B42        S     0
## 889     1     2 W./C. 6607 23.45              S     0
## 890     0     0     111369 30.00  C148        C     0
## 891     0     0     370376  7.75              Q     0

summary(titanic) #summarize all variables

##     survived          pclass     
##  Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.0000   Median :3.000  
##  Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :3.000  
##                                  
##                                     name         sex           age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      sibsp           parch             ticket         fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          cabin     embarked     child       
##             :687    :  2    Min.   :0.0000  
##  B96 B98    :  4   C:168    1st Qu.:0.0000  
##  C23 C25 C27:  4   Q: 77    Median :0.0000  
##  G6         :  4   S:644    Mean   :0.1268  
##  C22 C26    :  3            3rd Qu.:0.0000  
##  D          :  3            Max.   :1.0000  
##  (Other)    :186

str(titanic) #shows the structure of an R Object

## 'data.frame':    891 obs. of  12 variables:
##  $ survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ name    : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ sibsp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ ticket  : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ...
##  $ fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ child   : num  0 0 0 0 0 0 0 1 0 1 ...

Missing Values

First let’s generate our our sample data that inclues a few missing values.

m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4)
m[c(1,2,8),c(1,3,4)]=NA
colnames(m)<-(c("a","b","c","d"))
df<- as.data.frame(m)
df[c(1,2,8),c(1,3,4)]="NONE"
df

##                   a        b                c                d
## 1              NONE 18.76547             NONE             NONE
## 2              NONE 20.38273             NONE             NONE
## 3  18.7781413259624 24.16863 18.2638090861375 18.1507025907872
## 4  21.6317470548665 33.40690 20.7867788645043 23.6950518075289
## 5  22.0920175894281 29.83371 3.59303768026858 17.4436655705339
## 6  19.1894804780637 14.62938 28.0236678716222 16.0870810066057
## 7    19.39183425176 26.72012 18.8822330677097 24.5084318488144
## 8              NONE 25.47682             NONE             NONE
## 9  15.1956664059143 23.35283 18.0640378633102 17.5176205178589
## 10 23.0847458306809 23.48502 13.4906782376007 14.4604658534033

In our sample dataset, we can see that the data was coded as “NONE” where it was missing. One of the first things you should check is that the data has been coded into the appropriate type. Here, if we check it we can see that the “NONE” variables have caused some numeric variables to be coded as strings.

# We can see the dataframe structure here.
str(df)

## 'data.frame':    10 obs. of  4 variables:
##  $ a: chr  "NONE" "NONE" "18.7781413259624" "21.6317470548665" ...
##  $ b: num  18.8 20.4 24.2 33.4 29.8 ...
##  $ c: chr  "NONE" "NONE" "18.2638090861375" "20.7867788645043" ...
##  $ d: chr  "NONE" "NONE" "18.1507025907872" "23.6950518075289" ...

summary(df)

##       a                   b              c                  d            
##  Length:10          Min.   :14.63   Length:10          Length:10         
##  Class :character   1st Qu.:21.13   Class :character   Class :character  
##  Mode  :character   Median :23.83   Mode  :character   Mode  :character  
##                     Mean   :24.02                                        
##                     3rd Qu.:26.41                                        
##                     Max.   :33.41

We can deal with this issue by recoding each of the columns so that “NONE” is recoded to NA. Here, the first variable df$a gives the variables. The subsequent df$a=="NONE" selects out the rows that have NONE. Then the <- NA assigns NA to those rows which are selected.

df$a[df$a=="NONE"] <- NA
df$c[df$c=="NONE"] <- NA
df$d[df$d=="NONE"] <- NA
df

##                   a        b                c                d
## 1              <NA> 18.76547             <NA>             <NA>
## 2              <NA> 20.38273             <NA>             <NA>
## 3  18.7781413259624 24.16863 18.2638090861375 18.1507025907872
## 4  21.6317470548665 33.40690 20.7867788645043 23.6950518075289
## 5  22.0920175894281 29.83371 3.59303768026858 17.4436655705339
## 6  19.1894804780637 14.62938 28.0236678716222 16.0870810066057
## 7    19.39183425176 26.72012 18.8822330677097 24.5084318488144
## 8              <NA> 25.47682             <NA>             <NA>
## 9  15.1956664059143 23.35283 18.0640378633102 17.5176205178589
## 10 23.0847458306809 23.48502 13.4906782376007 14.4604658534033

str(df)

## 'data.frame':    10 obs. of  4 variables:
##  $ a: chr  NA NA "18.7781413259624" "21.6317470548665" ...
##  $ b: num  18.8 20.4 24.2 33.4 29.8 ...
##  $ c: chr  NA NA "18.2638090861375" "20.7867788645043" ...
##  $ d: chr  NA NA "18.1507025907872" "23.6950518075289" ...

summary(df)

##       a                   b              c                  d            
##  Length:10          Min.   :14.63   Length:10          Length:10         
##  Class :character   1st Qu.:21.13   Class :character   Class :character  
##  Mode  :character   Median :23.83   Mode  :character   Mode  :character  
##                     Mean   :24.02                                        
##                     3rd Qu.:26.41                                        
##                     Max.   :33.41

After we have removed the string variable, we then need to go through and transform the dataframe to have the appropriate structure (with numeric).

df<-transform(df, a = as.numeric(a), c=as.numeric(c), d=as.numeric(d))
str(df)

## 'data.frame':    10 obs. of  4 variables:
##  $ a: num  NA NA 18.8 21.6 22.1 ...
##  $ b: num  18.8 20.4 24.2 33.4 29.8 ...
##  $ c: num  NA NA 18.26 20.79 3.59 ...
##  $ d: num  NA NA 18.2 23.7 17.4 ...

sum(is.na(df$a)) # Do this to count the NA in a

## [1] 3

In cases where you are important a CSV file of data with missing values, it is possible to fix the coded variables by just telling R how they are coded.

#This is in the repository under /data. 
setwd('/vagrant/data')
df3 <- read.csv('df_none.csv', header = TRUE, na.strings = "NONE" )
df3

##     X        a        b        c        d
## 1   1       NA 12.68119       NA       NA
## 2   2       NA 17.34767       NA       NA
## 3   3 27.36288 20.53239 23.43477 18.29824
## 4   4 25.62721 13.31117 31.89819 23.32476
## 5   5 21.90477 20.44727 25.09769 15.92030
## 6   6 27.32332 22.98480 30.96029 21.03089
## 7   7 13.41282 16.29667 12.22052 20.31968
## 8   8       NA 19.78516       NA       NA
## 9   9 20.79425 23.65655 26.60762 21.89112
## 10 10 25.21863 21.80347 21.62041 24.25770

Next we want to be able to diagnose missing values in our dataset.

#The summary variable will give us the number of NA's in each. 
summary(df)

##        a               b               c                d        
##  Min.   :15.20   Min.   :14.63   Min.   : 3.593   Min.   :14.46  
##  1st Qu.:18.98   1st Qu.:21.13   1st Qu.:15.777   1st Qu.:16.77  
##  Median :19.39   Median :23.83   Median :18.264   Median :17.52  
##  Mean   :19.91   Mean   :24.02   Mean   :17.301   Mean   :18.84  
##  3rd Qu.:21.86   3rd Qu.:26.41   3rd Qu.:19.835   3rd Qu.:20.92  
##  Max.   :23.08   Max.   :33.41   Max.   :28.024   Max.   :24.51  
##  NA's   :3                       NA's   :3        NA's   :3

#
str(df)

## 'data.frame':    10 obs. of  4 variables:
##  $ a: num  NA NA 18.8 21.6 22.1 ...
##  $ b: num  18.8 20.4 24.2 33.4 29.8 ...
##  $ c: num  NA NA 18.26 20.79 3.59 ...
##  $ d: num  NA NA 18.2 23.7 17.4 ...

#The complete.cases function provides a boolean vector with a True if the row has no missing variables and False if the row has missing variables.  
complete.cases(df)

##  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

df.complete<-df[complete.cases(df),]
df.missing<-df[!complete.cases(df),]

#Alternate syntax to remove NA via a function.
df.complete2 <- na.omit(df)

#Let's find total. 
c('Complete:', nrow(df.complete), 'Missing:', nrow(df.missing))

## [1] "Complete:" "7"         "Missing:"  "3"

Now we can fix the missing value by doing a very simple model of taking the mean of each column and then substituting.

# Taking the mean of column 3 of the dataframe doesn't work because of NA's
mean(df[,3])

## [1] NA

# Returns the mean of column 3 of the dataframe
df.c1=mean(df[,3], na.rm=TRUE)

#Or we can specify the complete dataframe. 
df.c1=mean(df.complete[,3]) # returns the mean of column 3 of the dataframe

#This will substitute the mean for all missing values. 
df$c[is.na(df$c)] <- df.c1

#We could do the whole step in one step.
df$d[is.na(df$d)] <- mean(df$d, na.rm=TRUE)
df$a[is.na(df$a)] <- mean(df$a, na.rm=TRUE)

1R. Go ahead and provide the code to fix column a [from the df dataframe just above this] and print out your final dataframe with all missing values (NAs) removed.

2R. Now let’s start with the titanic dataset. How many missing values are there for the age field?

3R. Continuing with the Titanic dataset, fill in the age value with the median.

4R. Count the number of NA values in the titanic$embarked and titanic$fare. Then look at the data. What is going on? Recode the data so that missing values are coded as NA.

5R. Determine what is the most common value for titanic$embarked. Recode NA’s to the most common value.

Whle we haven’t looked at any “models” yet, here you can see a simple regression analysis used to predict age from the fare, sex, sibnsp fields with regression analysis. This creates a simple function were age=f(fare,sex,sibsp)

names(titanic)

##  [1] "survived" "pclass"   "name"     "sex"      "age"      "sibsp"   
##  [7] "parch"    "ticket"   "fare"     "cabin"    "embarked" "child"

titanic.complete<-titanic[complete.cases(titanic),]
titanic.missing<-titanic[!complete.cases(titanic),]

# Impute Age for missing values using regression analysis with age as the DV
m.age <- lm(age ~ fare + sex + sibsp, data = titanic.complete)
summary(m.age)

## 
## Call:
## lm(formula = age ~ fare + sex + sibsp, data = titanic.complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.349  -9.943  -1.928   8.045  46.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.072519   1.010099  28.782  < 2e-16 ***
## fare         0.043065   0.009895   4.352 1.55e-05 ***
## sexmale      2.680880   1.081988   2.478   0.0135 *  
## sibsp       -5.010511   0.556481  -9.004  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.64 on 710 degrees of freedom
## Multiple R-squared:  0.1222, Adjusted R-squared:  0.1185 
## F-statistic: 32.95 on 3 and 710 DF,  p-value: < 2.2e-16

m.age2 <- lm(age ~ fare + sex + sibsp, data = titanic)
summary(m.age2)

## 
## Call:
## lm(formula = age ~ fare + sex + sibsp, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.349  -9.943  -1.928   8.045  46.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.072519   1.010099  28.782  < 2e-16 ***
## fare         0.043065   0.009895   4.352 1.55e-05 ***
## sexmale      2.680880   1.081988   2.478   0.0135 *  
## sibsp       -5.010511   0.556481  -9.004  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.64 on 710 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.1222, Adjusted R-squared:  0.1185 
## F-statistic: 32.95 on 3 and 710 DF,  p-value: < 2.2e-16

#Lot's of nested fectors being used to select the appropriate rows.
titanic$age[is.na(titanic$age)]<-predict(m.age, newdata= titanic[is.na(titanic$age),])

6R. Is there any difference between m.age and m.age2? Why or Why not?

Intoduction to Feature Extraction

Here we are going to go through the process of creating new features from the Titanic dataset.

7R. Explain the process of new feature creation in general and provide 3 examples from Titanic.

Let’s start with the process of recoding a continuous variable into a categorical variable

#RECODING CONTINUOU
#This is the simple child coding into a variable which is 0 or 1. 
titanic$child <- 0
titanic$child[titanic$age < 18] <- 1
titanic$adult <- 0
titanic$adult[titanic$age >= 18] <- 1

#This recodes to a character vector
titanic$childcat <- ifelse(titanic$age > 18, c("adult"), c("child")) 

#Note that this is a character vector by defauld and we have to change to factor. 
str(titanic$childcat)

##  chr [1:891] "adult" "adult" "adult" "adult" "adult" ...

titanic<-transform(titanic, childcat = as.factor(childcat))
str(titanic$childcat)

##  Factor w/ 2 levels "adult","child": 1 1 1 1 1 1 1 2 1 2 ...

8R. Create a variable childcat2 that that is a factor variable with 4 levels (infant[>2], child[2-12], teen[13-18], adult[>18]).

titanic$section<-titanic$cabin
names(titanic)

##  [1] "survived" "pclass"   "name"     "sex"      "age"      "sibsp"   
##  [7] "parch"    "ticket"   "fare"     "cabin"    "embarked" "child"   
## [13] "adult"    "childcat" "section"

#View(titanic)
#This 
titanic$section <- gsub("[0-9]", "", titanic$section)
# returns string whitespace using regular expressions.
titanic$section <-gsub(" ", "", titanic$section)

#This returns an integer with the number of cabins associated with a name. 
titanic$cabins<-nchar(titanic$section)
titanic$cabins[titanic$cabins==0]<-1
#This creates a new variable using only the first cabin.
titanic$section2 <-substr(titanic$section, 1, 1)

#Recode as multiple cabins if have multible cabins.
titanic$section2[titanic$cabins>=2]<-"M" 

#Now that we are done recoding, we can change to a factor.
titanic<-transform(titanic, section2 = as.factor(section2))

summary(titanic$section2)

##       A   B   C   D   E   F   G   M   T 
## 687  15  36  51  32  32   9   4  24   1

#
aggregate(survived ~ section2 + sex, data=titanic, FUN=function(x) {sum(x)/length(x)})

##    section2    sex  survived
## 1           female 0.6543779
## 2         A female 1.0000000
## 3         B female 1.0000000
## 4         C female 0.9545455
## 5         D female 1.0000000
## 6         E female 0.9333333
## 7         F female 1.0000000
## 8         G female 0.5000000
## 9         M female 0.8181818
## 10            male 0.1361702
## 11        A   male 0.4285714
## 12        B   male 0.3571429
## 13        C   male 0.3448276
## 14        D   male 0.4285714
## 15        E   male 0.5882353
## 16        F   male 0.6000000
## 17        M   male 0.3846154
## 18        T   male 0.0000000

9R. Recode the data by fare into a factor variable called farecat. The categories should be, under 10, 10-20, 30+. If the value is NA, make it most frequent category.

#Grep searches for a string, returning true. 

for(i in 1:length(titanic$name)){
  titanic$title[i] <- 'None'
  if(grepl('Mr.', titanic$name[i])){titanic$title[i] <- 'Mr'}
  if(grepl('Miss.', titanic$name[i])){titanic$title[i] <- 'Miss'}
  if(grepl('Mrs.', titanic$name[i])){titanic$title[i] <- 'Mrs'}
}
titanic$title <- as.factor(titanic$title)

#As Mentioned though, we like to not do for loops if possible. 
titanic$title2 <- "None"
grepresult<-grepl('Mr.', titanic$name)

titanic$title2[grepl('Mr.', titanic$name)]<-"Mr."
titanic$title2[grepl('Miss.', titanic$name)]<-"Miss."
titanic$title2[grepl('Mrs.', titanic$name)]<-"Mrs."
titanic$title2 <- as.factor(titanic$title2)

summary(titanic$title)

## Miss   Mr  Mrs None 
##  180  518  129   64

summary(titanic$title2)

## Miss.   Mr.  Mrs.  None 
##   180   518   129    64

10R. For both the for loop and other method, add titles for “Master”, “Doctor”, and “Major”.

11R. Create a feature indicating whether someone is of Irish decent (Mc).

Cross Validation

In the past lab we learned how to manually create a random array and select a training and test set using that training set. This is going to do the sampe thing but with a slightly different

train.length=round(nrow(df)/2, 0)
train=sample(nrow(df),train.length, replace=FALSE)

We can then select out our sample.

dfa=df[train,]
dfb=df[-train,]

Lab5 Missing Values and Introduction to Feature Creation