If your are having trouble getting file into RStudio, here are some options.
If sharing is working between the guest virual machine and the host, the easiest way is to sync your Git repository and then load the data from the shared drive. Look at the repository directory.
setwd('/vagrant/data')
list.files()
## [1] "df_none.csv" "~$df_NONE.xlsx"
## [3] "df_NONE.xlsx" "test.csv"
## [5] "titantic_morecolumns.csv" "titantic_morerows.csv"
## [7] "titantic_train.csv" "train.csv"
## [9] "train.xlsx"
titanic <- read.csv('titantic_train.csv', header = TRUE )
You can import a dataframe from github directly using the gui with import dataset->from web url.
https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv Note. During the import process on the top left you can specify the dataframe name. By default it will come through as titantic_train. You can create a new one called titantic using the code below. titanic<-titantic_train
R doesen’t seem to have a real robust method of dealing with files and https, which I have seen cause some problems. Instead, you can vagrant ssh
into the /home/vagrant
directory of the virtual machine this is default. Then enter wget https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titantic_train.csv
from the terminal. This will download the file directly to the linux virtual machine.
##You only need to set working directory if you changed the working directory to something else. This is default.
setwd('/home/vagrant')
list.files()
## [1] "1_twitter.ipynb"
## [2] "_Appendix B - OAuth Primer.ipynb"
## [3] "BeautifulSoup.ipynb"
## [4] "Chapter 0 - Preface.ipynb"
## [5] "Chapter 1 - Mining Twitter.ipynb"
## [6] "Chapter 4 - Mining Google+.ipynb"
## [7] "Chapter 9 - Twitter Cookbook.ipynb"
## [8] "Class 3 More Python Basics. .ipynb"
## [9] "data"
## [10] "downjason.ipynb.json"
## [11] "example.html"
## [12] "example.Rmd"
## [13] "index.html"
## [14] "index.html.1"
## [15] "install.sh"
## [16] "Lab2.ipynb"
## [17] "lab2solution.ipynb"
## [18] "Lab2-webmining.ipynb"
## [19] "Lab 3 - Twitter-Copy1.ipynb"
## [20] "Lab 3 - Twitter.ipynb"
## [21] "Lab3_Twitter_solution.ipynb"
## [22] "lab4.html"
## [23] "Lab4.ipynb"
## [24] "lab4.Rmd"
## [25] "Lab4-Solution.ipynb"
## [26] "Lab6.Rmd"
## [27] "model-figure"
## [28] "model.md"
## [29] "model.Rpres"
## [30] "nestedforloop.R"
## [31] "R"
## [32] "spark_mooc_version"
## [33] "spark_notebook.py"
## [34] "Titanic.ipynb"
## [35] "titantic_train.csv"
## [36] "titantic_train.csv.1"
## [37] "Untitled1.ipynb"
## [38] "Untitled2.ipynb"
## [39] "Untitled3.ipynb"
## [40] "Untitled.ipynb"
titanic <- read.csv('titantic_train.csv', header = TRUE )
# This is a function that
males<-subset(titanic, sex=='male' )
females<-subset(titanic, sex=='female' )
#Males/Females Array via a vector listing the desired rows in an array.
malesarray<-which(titanic$sex=='male')
malesarray
## [1] 1 5 6 7 8 13 14 17 18 21 22 24 27 28 30 31 34
## [18] 35 36 37 38 43 46 47 49 51 52 55 56 58 60 61 63 64
## [35] 65 66 68 70 71 73 74 75 76 77 78 79 81 82 84 87 88
## [52] 90 91 92 93 94 95 96 97 98 100 102 103 104 105 106 108 109
## [69] 111 113 116 117 118 119 121 122 123 125 126 127 128 130 131 132 135
## [86] 136 138 139 140 144 145 146 147 149 150 151 153 154 155 156 158 159
## [103] 160 161 163 164 165 166 169 170 171 172 174 175 176 177 179 180 182
## [120] 183 184 186 188 189 190 192 194 197 198 201 202 203 204 205 207 208
## [137] 210 211 213 214 215 218 220 221 222 223 224 225 226 227 228 229 232
## [154] 233 235 237 239 240 243 244 245 246 249 250 251 253 254 261 262 263
## [171] 264 266 267 268 271 272 274 278 279 281 282 283 284 285 286 287 288
## [188] 289 293 295 296 297 299 302 303 305 306 309 314 315 318 321 322 325
## [205] 327 332 333 334 336 337 339 340 341 343 344 345 349 350 351 352 353
## [222] 354 355 356 361 362 364 365 366 371 372 373 374 378 379 380 383 385
## [239] 386 387 389 391 392 393 396 398 399 401 402 404 406 407 408 409 411
## [256] 412 414 415 419 421 422 423 425 426 429 430 431 434 435 439 440 442
## [273] 443 445 446 448 450 451 452 453 454 455 456 457 460 461 462 463 464
## [290] 465 466 467 468 469 471 472 476 477 478 479 481 482 483 485 488 489
## [307] 490 491 492 493 494 495 496 498 500 501 506 508 509 510 511 512 513
## [324] 515 516 518 520 522 523 525 526 528 529 530 532 533 537 539 544 545
## [341] 546 548 549 550 551 552 553 554 556 558 561 562 563 564 566 567 569
## [358] 570 571 573 575 576 580 583 584 585 587 588 589 590 591 593 595 596
## [375] 598 599 600 602 603 604 605 606 607 608 612 614 615 617 620 621 622
## [392] 623 624 625 626 627 629 630 631 632 633 634 637 638 640 641 644 646
## [409] 647 648 649 651 653 656 657 659 660 661 662 663 664 665 666 667 668
## [426] 669 672 673 674 675 676 677 680 682 683 684 685 686 687 688 689 691
## [443] 693 694 695 696 697 699 700 702 704 705 706 708 710 712 713 714 715
## [460] 716 719 720 722 723 724 725 726 729 732 733 734 735 736 738 739 740
## [477] 741 742 744 745 746 747 749 750 752 753 754 756 757 758 759 761 762
## [494] 763 765 767 769 770 771 772 774 776 777 779 783 784 785 786 788 789
## [511] 790 791 792 794 795 796 799 801 803 804 805 806 807 809 811 812 813
## [528] 815 816 818 819 820 822 823 825 826 827 828 829 832 833 834 835 837
## [545] 838 839 840 841 842 844 845 846 847 848 849 851 852 858 860 861 862
## [562] 865 868 869 870 871 873 874 877 878 879 882 884 885 887 890 891
femalesarray<-which(titanic$sex=='female')
males2<-titanic[ malesarray, ]
females2<-titanic[ -malesarray, ]
females2<-titanic[ femalesarray, ]
#Males/Females Array via a boolean vector indicating appropriate rows.
malesarray2<-ifelse(titanic$sex=='male' , TRUE , FALSE)
males3<-titanic[ malesarray2, ]
females3<-titanic[ !malesarray2, ]
Aggregation is useful for many different aspects of analysis. Let’s take a look at a few with the titanic dataset.
#This will give us a count of the frequency at each level.
table(titanic$survived)
##
## 0 1
## 549 342
table(titanic$sex)
##
## female male
## 314 577
table(titanic$sibsp)
##
## 0 1 2 3 4 5 8
## 608 209 28 16 18 5 7
#This does the same as the following.
sum(titanic$survived)
## [1] 342
sum(!titanic$survived)
## [1] 549
#We can also generate based on proportions (percentages). This gives the propotion in each category.
prop.table(table(titanic$survived))
##
## 0 1
## 0.6161616 0.3838384
prop.table(table(titanic$sex))
##
## female male
## 0.352413 0.647587
prop.table(table(titanic$sibsp))
##
## 0 1 2 3 4 5
## 0.682379349 0.234567901 0.031425365 0.017957351 0.020202020 0.005611672
## 8
## 0.007856341
#We can also combine varaiables to create cross-tabs to get an initial idea of the role of different variables.
table(titanic$sex, titanic$survived)
##
## 0 1
## female 81 233
## male 468 109
table(titanic$sibsp, titanic$survived)
##
## 0 1
## 0 398 210
## 1 97 112
## 2 15 13
## 3 12 4
## 4 15 3
## 5 5 0
## 8 7 0
#This gives the percentage in each category.
prop.table(table(titanic$sex, titanic$survived))
##
## 0 1
## female 0.09090909 0.26150393
## male 0.52525253 0.12233446
summary(titanic$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
summary(titanic)
## survived pclass
## Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.0000 Median :3.000
## Mean :0.3838 Mean :2.309
## 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :3.000
##
## name sex age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## sibsp parch ticket fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## cabin embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
titanic$child <- 0
titanic$child[titanic$age < 18] <- 1
#Here sum adds the survived
aggregate(survived ~ child + sex, data=titanic, FUN=sum)
## child sex survived
## 1 0 female 195
## 2 1 female 38
## 3 0 male 86
## 4 1 male 23
#Length gives the number of each category
aggregate(survived ~ child + sex, data=titanic, FUN=length)
## child sex survived
## 1 0 female 259
## 2 1 female 55
## 3 0 male 519
## 4 1 male 58
#This gives the percentage
aggregate(survived ~ child + sex, data=titanic, FUN=function(x) {sum(x)/length(x)})
## child sex survived
## 1 0 female 0.7528958
## 2 1 female 0.6909091
## 3 0 male 0.1657033
## 4 1 male 0.3965517
aggregate(survived ~ sex, data=titanic, FUN=function(x) {sum(x)/length(x)})
## sex survived
## 1 female 0.7420382
## 2 male 0.1889081
#View(titanic) #show data browser
names(titanic) #show the names
## [1] "survived" "pclass" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "child"
dim(titanic) #show the dimensions of the data frame
## [1] 891 12
head(titanic, 2) #show the first 2 records
## survived pclass name
## 1 0 3 Braund, Mr. Owen Harris
## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## sex age sibsp parch ticket fare cabin embarked child
## 1 male 22 1 0 A/5 21171 7.2500 S 0
## 2 female 38 1 0 PC 17599 71.2833 C85 C 0
tail(titanic, 4) #show the final 2 records
## survived pclass name sex age
## 888 1 1 Graham, Miss. Margaret Edith female 19
## 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NA
## 890 1 1 Behr, Mr. Karl Howell male 26
## 891 0 3 Dooley, Mr. Patrick male 32
## sibsp parch ticket fare cabin embarked child
## 888 0 0 112053 30.00 B42 S 0
## 889 1 2 W./C. 6607 23.45 S 0
## 890 0 0 111369 30.00 C148 C 0
## 891 0 0 370376 7.75 Q 0
summary(titanic) #summarize all variables
## survived pclass
## Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.0000 Median :3.000
## Mean :0.3838 Mean :2.309
## 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :3.000
##
## name sex age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## sibsp parch ticket fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## cabin embarked child
## :687 : 2 Min. :0.0000
## B96 B98 : 4 C:168 1st Qu.:0.0000
## C23 C25 C27: 4 Q: 77 Median :0.0000
## G6 : 4 S:644 Mean :0.1268
## C22 C26 : 3 3rd Qu.:0.0000
## D : 3 Max. :1.0000
## (Other) :186
str(titanic) #shows the structure of an R Object
## 'data.frame': 891 obs. of 12 variables:
## $ survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ ticket : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ...
## $ fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ child : num 0 0 0 0 0 0 0 1 0 1 ...
First let’s generate our our sample data that inclues a few missing values.
m<- matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4)
m[c(1,2,8),c(1,3,4)]=NA
colnames(m)<-(c("a","b","c","d"))
df<- as.data.frame(m)
df[c(1,2,8),c(1,3,4)]="NONE"
df
## a b c d
## 1 NONE 18.76547 NONE NONE
## 2 NONE 20.38273 NONE NONE
## 3 18.7781413259624 24.16863 18.2638090861375 18.1507025907872
## 4 21.6317470548665 33.40690 20.7867788645043 23.6950518075289
## 5 22.0920175894281 29.83371 3.59303768026858 17.4436655705339
## 6 19.1894804780637 14.62938 28.0236678716222 16.0870810066057
## 7 19.39183425176 26.72012 18.8822330677097 24.5084318488144
## 8 NONE 25.47682 NONE NONE
## 9 15.1956664059143 23.35283 18.0640378633102 17.5176205178589
## 10 23.0847458306809 23.48502 13.4906782376007 14.4604658534033
In our sample dataset, we can see that the data was coded as “NONE” where it was missing. One of the first things you should check is that the data has been coded into the appropriate type. Here, if we check it we can see that the “NONE” variables have caused some numeric variables to be coded as strings.
# We can see the dataframe structure here.
str(df)
## 'data.frame': 10 obs. of 4 variables:
## $ a: chr "NONE" "NONE" "18.7781413259624" "21.6317470548665" ...
## $ b: num 18.8 20.4 24.2 33.4 29.8 ...
## $ c: chr "NONE" "NONE" "18.2638090861375" "20.7867788645043" ...
## $ d: chr "NONE" "NONE" "18.1507025907872" "23.6950518075289" ...
summary(df)
## a b c d
## Length:10 Min. :14.63 Length:10 Length:10
## Class :character 1st Qu.:21.13 Class :character Class :character
## Mode :character Median :23.83 Mode :character Mode :character
## Mean :24.02
## 3rd Qu.:26.41
## Max. :33.41
We can deal with this issue by recoding each of the columns so that “NONE” is recoded to NA. Here, the first variable df$a
gives the variables. The subsequent df$a=="NONE"
selects out the rows that have NONE. Then the <- NA
assigns NA
to those rows which are selected.
df$a[df$a=="NONE"] <- NA
df$c[df$c=="NONE"] <- NA
df$d[df$d=="NONE"] <- NA
df
## a b c d
## 1 <NA> 18.76547 <NA> <NA>
## 2 <NA> 20.38273 <NA> <NA>
## 3 18.7781413259624 24.16863 18.2638090861375 18.1507025907872
## 4 21.6317470548665 33.40690 20.7867788645043 23.6950518075289
## 5 22.0920175894281 29.83371 3.59303768026858 17.4436655705339
## 6 19.1894804780637 14.62938 28.0236678716222 16.0870810066057
## 7 19.39183425176 26.72012 18.8822330677097 24.5084318488144
## 8 <NA> 25.47682 <NA> <NA>
## 9 15.1956664059143 23.35283 18.0640378633102 17.5176205178589
## 10 23.0847458306809 23.48502 13.4906782376007 14.4604658534033
str(df)
## 'data.frame': 10 obs. of 4 variables:
## $ a: chr NA NA "18.7781413259624" "21.6317470548665" ...
## $ b: num 18.8 20.4 24.2 33.4 29.8 ...
## $ c: chr NA NA "18.2638090861375" "20.7867788645043" ...
## $ d: chr NA NA "18.1507025907872" "23.6950518075289" ...
summary(df)
## a b c d
## Length:10 Min. :14.63 Length:10 Length:10
## Class :character 1st Qu.:21.13 Class :character Class :character
## Mode :character Median :23.83 Mode :character Mode :character
## Mean :24.02
## 3rd Qu.:26.41
## Max. :33.41
After we have removed the string variable, we then need to go through and transform
the dataframe to have the appropriate structure (with numeric).
df<-transform(df, a = as.numeric(a), c=as.numeric(c), d=as.numeric(d))
str(df)
## 'data.frame': 10 obs. of 4 variables:
## $ a: num NA NA 18.8 21.6 22.1 ...
## $ b: num 18.8 20.4 24.2 33.4 29.8 ...
## $ c: num NA NA 18.26 20.79 3.59 ...
## $ d: num NA NA 18.2 23.7 17.4 ...
sum(is.na(df$a)) # Do this to count the NA in a
## [1] 3
In cases where you are important a CSV file of data with missing values, it is possible to fix the coded variables by just telling R how they are coded.
#This is in the repository under /data.
setwd('/vagrant/data')
df3 <- read.csv('df_none.csv', header = TRUE, na.strings = "NONE" )
df3
## X a b c d
## 1 1 NA 12.68119 NA NA
## 2 2 NA 17.34767 NA NA
## 3 3 27.36288 20.53239 23.43477 18.29824
## 4 4 25.62721 13.31117 31.89819 23.32476
## 5 5 21.90477 20.44727 25.09769 15.92030
## 6 6 27.32332 22.98480 30.96029 21.03089
## 7 7 13.41282 16.29667 12.22052 20.31968
## 8 8 NA 19.78516 NA NA
## 9 9 20.79425 23.65655 26.60762 21.89112
## 10 10 25.21863 21.80347 21.62041 24.25770
Next we want to be able to diagnose missing values in our dataset.
#The summary variable will give us the number of NA's in each.
summary(df)
## a b c d
## Min. :15.20 Min. :14.63 Min. : 3.593 Min. :14.46
## 1st Qu.:18.98 1st Qu.:21.13 1st Qu.:15.777 1st Qu.:16.77
## Median :19.39 Median :23.83 Median :18.264 Median :17.52
## Mean :19.91 Mean :24.02 Mean :17.301 Mean :18.84
## 3rd Qu.:21.86 3rd Qu.:26.41 3rd Qu.:19.835 3rd Qu.:20.92
## Max. :23.08 Max. :33.41 Max. :28.024 Max. :24.51
## NA's :3 NA's :3 NA's :3
#
str(df)
## 'data.frame': 10 obs. of 4 variables:
## $ a: num NA NA 18.8 21.6 22.1 ...
## $ b: num 18.8 20.4 24.2 33.4 29.8 ...
## $ c: num NA NA 18.26 20.79 3.59 ...
## $ d: num NA NA 18.2 23.7 17.4 ...
#The complete.cases function provides a boolean vector with a True if the row has no missing variables and False if the row has missing variables.
complete.cases(df)
## [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
df.complete<-df[complete.cases(df),]
df.missing<-df[!complete.cases(df),]
#Alternate syntax to remove NA via a function.
df.complete2 <- na.omit(df)
#Let's find total.
c('Complete:', nrow(df.complete), 'Missing:', nrow(df.missing))
## [1] "Complete:" "7" "Missing:" "3"
# Taking the mean of column 3 of the dataframe doesn't work because of NA's
mean(df[,3])
## [1] NA
# Returns the mean of column 3 of the dataframe
df.c1=mean(df[,3], na.rm=TRUE)
#Or we can specify the complete dataframe.
df.c1=mean(df.complete[,3]) # returns the mean of column 3 of the dataframe
#This will substitute the mean for all missing values.
df$c[is.na(df$c)] <- df.c1
#We could do the whole step in one step.
df$d[is.na(df$d)] <- mean(df$d, na.rm=TRUE)
df$a[is.na(df$a)] <- mean(df$a, na.rm=TRUE)
1R. Go ahead and provide the code to fix column a [from the df dataframe just above this] and print out your final dataframe with all missing values (NAs) removed.
2R. Now let’s start with the titanic dataset. How many missing values are there for the age field?
3R. Continuing with the Titanic dataset, fill in the age value with the median.
4R. Count the number of NA values in the titanic\(embarked and titanic\)fare. Then look at the data. What is going on? Recode the data so that missing values are coded as NA.
5R. Determine what is the most common value for titanic$embarked. Recode NA’s to the most common value.
Whle we haven’t looked at any “models” yet, here you can see a simple regression analysis used to predict age from the fare, sex, sibnsp fields with regression analysis. This creates a simple function were age=f(fare,sex,sibsp)
names(titanic)
## [1] "survived" "pclass" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "child"
titanic.complete<-titanic[complete.cases(titanic),]
titanic.missing<-titanic[!complete.cases(titanic),]
# Impute Age for missing values using regression analysis with age as the DV
m.age <- lm(age ~ fare + sex + sibsp, data = titanic.complete)
summary(m.age)
##
## Call:
## lm(formula = age ~ fare + sex + sibsp, data = titanic.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.349 -9.943 -1.928 8.045 46.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.072519 1.010099 28.782 < 2e-16 ***
## fare 0.043065 0.009895 4.352 1.55e-05 ***
## sexmale 2.680880 1.081988 2.478 0.0135 *
## sibsp -5.010511 0.556481 -9.004 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.64 on 710 degrees of freedom
## Multiple R-squared: 0.1222, Adjusted R-squared: 0.1185
## F-statistic: 32.95 on 3 and 710 DF, p-value: < 2.2e-16
m.age2 <- lm(age ~ fare + sex + sibsp, data = titanic)
summary(m.age2)
##
## Call:
## lm(formula = age ~ fare + sex + sibsp, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.349 -9.943 -1.928 8.045 46.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.072519 1.010099 28.782 < 2e-16 ***
## fare 0.043065 0.009895 4.352 1.55e-05 ***
## sexmale 2.680880 1.081988 2.478 0.0135 *
## sibsp -5.010511 0.556481 -9.004 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.64 on 710 degrees of freedom
## (177 observations deleted due to missingness)
## Multiple R-squared: 0.1222, Adjusted R-squared: 0.1185
## F-statistic: 32.95 on 3 and 710 DF, p-value: < 2.2e-16
#Lot's of nested fectors being used to select the appropriate rows.
titanic$age[is.na(titanic$age)]<-predict(m.age, newdata= titanic[is.na(titanic$age),])
6R. Is there any difference between m.age and m.age2? Why or Why not?
Here we are going to go through the process of creating new features from the Titanic dataset.
7R. Explain the process of new feature creation in general and provide 3 examples from Titanic.
Let’s start with the process of recoding a continuous variable into a categorical variable
#RECODING CONTINUOU
#This is the simple child coding into a variable which is 0 or 1.
titanic$child <- 0
titanic$child[titanic$age < 18] <- 1
titanic$adult <- 0
titanic$adult[titanic$age >= 18] <- 1
#This recodes to a character vector
titanic$childcat <- ifelse(titanic$age > 18, c("adult"), c("child"))
#Note that this is a character vector by defauld and we have to change to factor.
str(titanic$childcat)
## chr [1:891] "adult" "adult" "adult" "adult" "adult" ...
titanic<-transform(titanic, childcat = as.factor(childcat))
str(titanic$childcat)
## Factor w/ 2 levels "adult","child": 1 1 1 1 1 1 1 2 1 2 ...
8R. Create a variable childcat2 that that is a factor variable with 4 levels (infant[>2], child[2-12], teen[13-18], adult[>18]).
titanic$section<-titanic$cabin
names(titanic)
## [1] "survived" "pclass" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "child"
## [13] "adult" "childcat" "section"
#View(titanic)
#This
titanic$section <- gsub("[0-9]", "", titanic$section)
# returns string whitespace using regular expressions.
titanic$section <-gsub(" ", "", titanic$section)
#This returns an integer with the number of cabins associated with a name.
titanic$cabins<-nchar(titanic$section)
titanic$cabins[titanic$cabins==0]<-1
#This creates a new variable using only the first cabin.
titanic$section2 <-substr(titanic$section, 1, 1)
#Recode as multiple cabins if have multible cabins.
titanic$section2[titanic$cabins>=2]<-"M"
#Now that we are done recoding, we can change to a factor.
titanic<-transform(titanic, section2 = as.factor(section2))
summary(titanic$section2)
## A B C D E F G M T
## 687 15 36 51 32 32 9 4 24 1
#
aggregate(survived ~ section2 + sex, data=titanic, FUN=function(x) {sum(x)/length(x)})
## section2 sex survived
## 1 female 0.6543779
## 2 A female 1.0000000
## 3 B female 1.0000000
## 4 C female 0.9545455
## 5 D female 1.0000000
## 6 E female 0.9333333
## 7 F female 1.0000000
## 8 G female 0.5000000
## 9 M female 0.8181818
## 10 male 0.1361702
## 11 A male 0.4285714
## 12 B male 0.3571429
## 13 C male 0.3448276
## 14 D male 0.4285714
## 15 E male 0.5882353
## 16 F male 0.6000000
## 17 M male 0.3846154
## 18 T male 0.0000000
9R. Recode the data by fare into a factor variable called farecat. The categories should be, under 10, 10-20, 30+. If the value is NA, make it most frequent category.
#Grep searches for a string, returning true.
for(i in 1:length(titanic$name)){
titanic$title[i] <- 'None'
if(grepl('Mr.', titanic$name[i])){titanic$title[i] <- 'Mr'}
if(grepl('Miss.', titanic$name[i])){titanic$title[i] <- 'Miss'}
if(grepl('Mrs.', titanic$name[i])){titanic$title[i] <- 'Mrs'}
}
titanic$title <- as.factor(titanic$title)
#As Mentioned though, we like to not do for loops if possible.
titanic$title2 <- "None"
grepresult<-grepl('Mr.', titanic$name)
titanic$title2[grepl('Mr.', titanic$name)]<-"Mr."
titanic$title2[grepl('Miss.', titanic$name)]<-"Miss."
titanic$title2[grepl('Mrs.', titanic$name)]<-"Mrs."
titanic$title2 <- as.factor(titanic$title2)
summary(titanic$title)
## Miss Mr Mrs None
## 180 518 129 64
summary(titanic$title2)
## Miss. Mr. Mrs. None
## 180 518 129 64
10R. For both the for loop and other method, add titles for “Master”, “Doctor”, and “Major”.
11R. Create a feature indicating whether someone is of Irish decent (Mc).
In the past lab we learned how to manually create a random array and select a training and test set using that training set. This is going to do the sampe thing but with a slightly different
train.length=round(nrow(df)/2, 0)
train=sample(nrow(df),train.length, replace=FALSE)
We can then select out our sample.
dfa=df[train,]
dfb=df[-train,]