Apply Family Notes
Purpose
In this notebook session, I’ll be going over the “Apply” Family in base R. The “Apply Functions” refer to a group of functions that come with base R that allow you to do repetitive actions within different objects(i.e. data frames, lists, etc.)
The functions I’ll go over will be:
- apply()
- lapply()
- mapply()
- rapply()
- sapply()
- tapply()
- vapply()
Packages Used & Loaded:
## [1] "tidyverse" "knitr" "kableExtra"
Apply Functions, Inputs, & Outputs
A quick overview:
kable(applychart) %>%
kable_minimal()
Name | What it does | Input | Output |
---|---|---|---|
apply() | Applies a function to the rows or columns of the object | Data Frame or Matrix | Matrix or Array |
lapply() | Applies a function to all elements within the input | Data Frame, List, Vector | List |
mapply() | Applies a function to multiple lists or vectors - can be considered a multivariate version of ‘sapply’ | Multiple Lists or Vectors (i.e. a Data Frame) | List or Vector |
rapply() | Applies a function recursively through a list - nested lists | Nested Lists | Nested List or Vector depending on arguments passed |
sapply() | A simpler version of ‘lapply’ that works on lists, data frames, and vectors | Data Frame, List, Vector | Matrix or Vector |
tapply() | Applies a function over a ragged/jagged array (an array that has more than one dimension with varying lengths) | Data Frame or Vector that can be split (divided into groups/factors) | Array |
vapply() | Similar to ‘sapply’, but you can pre-specify the type of value that is output, making it a bit faster | Data Frame, List, Vector | Data Frame, List, Vector |
Apply Function: apply()
The apply() function is used to apply a function to all rows or columns of an object. Consequently, only objects with more than one dimension can be used with apply, so a data frame or matrix.
apply(X, MARGIN, FUN)
Where:
Argument | Description |
---|---|
X | Data Frame or Matrix |
MARGIN | ‘1’ or ‘2’ or ‘c(1,2)’ where 1 = Rows and 2 = Columns |
FUN | The function you want to be applied to the data frame or matrix in question |
Data Frame Example
#Creating a mock data frame
<- c("Buffalo","NYC","Seattle","Austin","Orlando","Minneapolis")
City <- c(500,2012,1876,635,4512,823)
Cases <- c(3426,5210,6753,5633,2013,1890)
Controls
<- data.frame(City,Cases,Controls, row.names = NULL)
records
records
## City Cases Controls
## 1 Buffalo 500 3426
## 2 NYC 2012 5210
## 3 Seattle 1876 6753
## 4 Austin 635 5633
## 5 Orlando 4512 2013
## 6 Minneapolis 823 1890
We can use the apply function to calculate column sums…
#Calculating the column sum of all applicable columns
apply(records[,2:3], 2,sum)
## Cases Controls
## 10358 24925
…Or row sums. (Note that these both produce vectors and that we subset the dataframe with [,2:3]
to avoid R throwing an error for the first column that has strings in it. Can’t perform a mathematical function on character strings)
#Calculating row sums
apply(records[,2:3], 1,sum)
## [1] 3926 7222 8629 6268 6525 2713
We can name the vectors in one line of code with the names<-
function:
#Calculating row sums, but applying names from the "City" column
`City Totals` <- `names<-`(apply(records[,2:3], 1,sum), records$City)
`City Totals`
## Buffalo NYC Seattle Austin Orlando Minneapolis
## 3926 7222 8629 6268 6525 2713
Applying Statistic (More Complex) Functions
We can also do different statistical procedures based on each test’s requirement. Let’s do a T-test:
#Making a mock data set for T-test
<- c(67,53)
Ex1_grades <- c(90,89)
Ex2_grades <- c(89,95)
Ex3_grades <- c(95,87)
Ex4_grades <- c(100,99)
Ex5_grades <- c("Student1","Student2")
Student
<- tibble(Student,Ex1_grades,Ex2_grades,Ex3_grades,Ex4_grades,Ex5_grades)
Grades
Grades
## # A tibble: 2 x 6
## Student Ex1_grades Ex2_grades Ex3_grades Ex4_grades Ex5_grades
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Student1 67 90 89 95 100
## 2 Student2 53 89 95 87 99
Let’s just say we want the P.value of one sample T-tests for each student and we want to place it in this dataset as a new column. What’s important to note is that arguments that would normally be passed through to your functions, go as separate arguments at the end of the apply function, after you declare which function you want used.
#Turning off scientific notation formatting
options(scipen = 999)
#Getting index of columns that end with the word "grades"
<- grep(("grades$"),names(Grades))
Gradeindexes
#Using apply to apply the t.test.
<- apply(Grades[,Gradeindexes],1,t.test, alternative = "two.sided", conf.level = 0.95)
testresults
#Using do.call to bind p values to the data set. Because results are in a list, we can use lapply and wrap it in "as.vector" for clean transfer into the dataframe
$`P Values` <- as.vector(format(do.call(rbind, lapply(testresults, function(x){x$p.value})), digits = 2))
Grades
Grades
## # A tibble: 2 x 7
## Student Ex1_grades Ex2_grades Ex3_grades Ex4_grades Ex5_grades `P Values`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Student1 67 90 89 95 100 0.000098
## 2 Student2 53 89 95 87 99 0.000494
Lapply Function: lapply()
The lapply() function is used to apply a function to all elements of a list.
lapply(X, FUN)
Where:
Argument | Description |
---|---|
X | Data Frame, List, Vector |
FUN | The function you want to be applied to the data frame or matrix in question |
Data Frame Example
We can use lapply to make changes to a data frame.
# Changing column names in the "records" data frame to be all CAPS
names(records) <- lapply(names(records),str_to_upper)
records
## CITY CASES CONTROLS
## 1 Buffalo 500 3426
## 2 NYC 2012 5210
## 3 Seattle 1876 6753
## 4 Austin 635 5633
## 5 Orlando 4512 2013
## 6 Minneapolis 823 1890
List Example
We can use lapply to make changes to a list. Need to create a mock list.
#Creating list from randomly sampled numbers, then adding names from the "fruit" constant that comes with base R
#Setting a seed for reproducibility
set.seed(555)
#Generating the random sample of numbers
<- sample(1:50,5)
stock
#Pulling the first five strings of the fruit constant from base R
<- fruit[1:5]
fruits
#Coercing the sampled numbers into a list
<- as.list(as.numeric(stock))
Inventory
#Setting the names of each randomly sampled number to each string in out fruits vector
names(Inventory) <- fruits
Inventory
## $apple
## [1] 42
##
## $apricot
## [1] 49
##
## $avocado
## [1] 24
##
## $banana
## [1] 16
##
## $`bell pepper`
## [1] 29
We can alter the list by adding 100 to each fruit’s count and assigning the result back to Inventory
:
<- lapply(Inventory, function(x) (x+100))
Inventory
Inventory
## $apple
## [1] 142
##
## $apricot
## [1] 149
##
## $avocado
## [1] 124
##
## $banana
## [1] 116
##
## $`bell pepper`
## [1] 129
Mapply Function: mapply()
The Mapply() function applies a function to multiple lists or vectors. This can be considered a multivariate version of ‘sapply.’
mapply(FUN, X, MoreArgs)
Where:
Argument | Description |
---|---|
FUN | The function you want to be applied to the lists or vectors in question |
X | The lists or vectors you want to the function applied to (normally wrapped within the ‘c()’ function |
MoreArgs | A list (wrapped in the ’list() function) of additional arguments to pass to the function |
Multiple List Example
We can use mapply() to alter different elements within multiple lists, as oppose to lapply()
which only works within one list. Let’s create multiple lists to test mapply()
out.
#Want to take these separate list, add th last name "Smith" to all the names, then get the final result in one place (a list)
<- list("John", "Abigail", "Sam","Judy")
names1 <- list("Mary", "Lauri", "Gus")
names2 <- list("Harold", "Peter", "Natalie","Scott","Fatima")
names3
`Names List` <- mapply(function(x) paste(x,"Smith"), c(names1,names2,names3))
`Names List`
## [1] "John Smith" "Abigail Smith" "Sam Smith" "Judy Smith" "Mary Smith" "Lauri Smith" "Gus Smith" "Harold Smith" "Peter Smith" "Natalie Smith" "Scott Smith" "Fatima Smith"
Multiple Vector Example
Mapply() can be used to vectorize function results from multiple vectors.
Let’s same we have vectors of numbers and we want to know the mean of all of them separately:
#Making mock vectors, setting a seed for reproducibility.
set.seed(321)
#Assigning the vectors
<- sample(1:100,12)
vector1 <- sample(1:100,5)
vector2 <- sample(1:100,9)
vector3
vector1
## [1] 54 77 88 80 58 17 47 11 25 31 82 79
vector2
## [1] 98 75 31 82 36
vector3
## [1] 78 87 34 84 4 48 51 80 13
Because we want summaries (the mean) of each vector, we can use the list()
function instead of the c()
function. We can use the MoreArgs
argument to pass the trim
argument to the mean()
function. By default, this is set at zero, but passing it through to demonstrate.
#Calculating the mean of each vector
<-mapply(mean,list(vector1,vector2,vector3), MoreArgs = list(trim = 0))
vectormeans
#Setting names to the results
names(vectormeans) <- c("vector1","vector2","vector3")
vectormeans
## vector1 vector2 vector3
## 54.08333 64.40000 53.22222
Rapply Function: rapply()
The rapply() function is used to apply a function recursively to all elements in a nested list.
rapply(object, f, classes, how)
Where:
Argument | Description |
---|---|
Object | Nested Lists |
f | The function you want to be applied to the nested list in question |
classes | Classes of elements to match on ex: ‘numeric’ , ‘character’ |
how | Sets the action in which the function is executed. Standard options are: ‘replace’, ‘unlist’, ‘list’ |
Nested list example
Let’s say we have a list of cities that have a list of restaurant types embedded in them:
<- list("Buffalo" = list("italian","mexican","japanese","puerto rican"),
Restaurantdata "Seattle" = list("japanese","chinese","southern","steakhouse"),
"Miami" = list("seafood","cuban","italian","polish"))
Restaurantdata
## $Buffalo
## $Buffalo[[1]]
## [1] "italian"
##
## $Buffalo[[2]]
## [1] "mexican"
##
## $Buffalo[[3]]
## [1] "japanese"
##
## $Buffalo[[4]]
## [1] "puerto rican"
##
##
## $Seattle
## $Seattle[[1]]
## [1] "japanese"
##
## $Seattle[[2]]
## [1] "chinese"
##
## $Seattle[[3]]
## [1] "southern"
##
## $Seattle[[4]]
## [1] "steakhouse"
##
##
## $Miami
## $Miami[[1]]
## [1] "seafood"
##
## $Miami[[2]]
## [1] "cuban"
##
## $Miami[[3]]
## [1] "italian"
##
## $Miami[[4]]
## [1] "polish"
We want to change all of the elements so that each restaurant type is capitalized. We can do this with either the tools
or stringr
packages. I’ll use the stringr
package for this example. Note that the "replace"
option in the how
argument will actually alter the Restaurantdata
list, but in order to save it as such, we have to assign in back to the Restaurantdata
object.
<- rapply(Restaurantdata,stringr::str_to_title,how = "replace")
Restaurantdata
Restaurantdata
## $Buffalo
## $Buffalo[[1]]
## [1] "Italian"
##
## $Buffalo[[2]]
## [1] "Mexican"
##
## $Buffalo[[3]]
## [1] "Japanese"
##
## $Buffalo[[4]]
## [1] "Puerto Rican"
##
##
## $Seattle
## $Seattle[[1]]
## [1] "Japanese"
##
## $Seattle[[2]]
## [1] "Chinese"
##
## $Seattle[[3]]
## [1] "Southern"
##
## $Seattle[[4]]
## [1] "Steakhouse"
##
##
## $Miami
## $Miami[[1]]
## [1] "Seafood"
##
## $Miami[[2]]
## [1] "Cuban"
##
## $Miami[[3]]
## [1] "Italian"
##
## $Miami[[4]]
## [1] "Polish"
We can also get a vector of our results by using the unlist
option in the how
function instead. Let’s add the word “restaurants” to each of these elements then unlist
the object to place it in a vector.
<- rapply(Restaurantdata,function(x) paste(x,"restaurants"),how = "unlist")
Restaurantvector
Restaurantvector
## Buffalo1 Buffalo2 Buffalo3 Buffalo4 Seattle1 Seattle2 Seattle3 Seattle4 Miami1 Miami2 Miami3 Miami4
## "Italian restaurants" "Mexican restaurants" "Japanese restaurants" "Puerto Rican restaurants" "Japanese restaurants" "Chinese restaurants" "Southern restaurants" "Steakhouse restaurants" "Seafood restaurants" "Cuban restaurants" "Italian restaurants" "Polish restaurants"
Sapply Function: sapply()
The sapply() function is a simpler version of ‘lapply’ that works to apply functions across all elements of lists, data frames, and vectors.
sapply(X, FUN, simplify, USE.NAMES)
Where:
Argument | Description |
---|---|
X | Data Frame, List, Vector |
FUN | The function you want to be applied to the data frame, list, or vector in question |
simplify | Determines if the result should be simplified to a vector, matrix, or array |
For this example, let’s work with the records
set form earlier. In order to get an error-free compilation, row names have to be set:
#Pulling cities names and placing it into a vector
<- records$CITY
cities
#Removing the CITY variable from the frame to isolate the numeric values
<- records[,2:3]
records
records
## CASES CONTROLS
## 1 500 3426
## 2 2012 5210
## 3 1876 6753
## 4 635 5633
## 5 4512 2013
## 6 823 1890
Let’s divide the numbers in the dataset for each city by 10. We can store the results in a list by setting simplify = FALSE
:
#Dividing each number by 10 and setting the names of the elements in the list
<- sapply(records, function(x) x/10, simplify = FALSE)
recordslist
#setting the names for each element in the list. We use the a for loop to subset the two objects in the list (CASES and CONTROLS) while using the "names" function to copy the city names over
for (i in seq_along(names(recordslist))){
names(recordslist[[i]]) <- cities
}
recordslist
## $CASES
## Buffalo NYC Seattle Austin Orlando Minneapolis
## 50.0 201.2 187.6 63.5 451.2 82.3
##
## $CONTROLS
## Buffalo NYC Seattle Austin Orlando Minneapolis
## 342.6 521.0 675.3 563.3 201.3 189.0
If we set simplify = TRUE
we can get an array instead:
#Dividing each number by 10 and setting the names of the elements in the list
<- sapply(records, function(x) x/10, simplify = TRUE)
recordsarray
row.names(recordsarray) <- cities
recordsarray
## CASES CONTROLS
## Buffalo 50.0 342.6
## NYC 201.2 521.0
## Seattle 187.6 675.3
## Austin 63.5 563.3
## Orlando 451.2 201.3
## Minneapolis 82.3 189.0
Tapply Function: tapply()
Tapply() can be used when we want to perform a function over a ragged/jagged array (an array that has more than one dimension with varying lengths). It is best used to apply functions across a vector (or column in a data frame) and producing the result by factors (categories).
tapply(X, INDEX, FUN)
Where:
Argument | Description |
---|---|
X | Data Frame or Vector that can be `split` (has factors/categories to split on) |
INDEX | A list or vector of one or more factors or groupings that is the SAME length as X |
FUN | The function you want to be applied to the data frame or vector in question |
Data Frame Example
We can use tapply()
to apply a function across a column in a dataframe. Let’s make a simple data frame:
# Setting a seed for reproducibility
set.seed(789)
<- c("UK","USA","Egypt","Ireland","UK","USA","USA")
Teams <- runif(length(Teams), min=30, max = 240)
Seconds <- data.frame(Teams,Seconds)
Runners
Runners
## Teams Seconds
## 1 UK 176.97782
## 2 USA 49.63476
## 3 Egypt 32.49623
## 4 Ireland 154.23733
## 5 UK 133.35138
## 6 USA 34.23435
## 7 USA 150.25773
Let’s say we want to calculate the average time (seconds) for each Team. We can use tapply()
for this.
<- tapply(Runners$Seconds, Runners$Teams, mean)
Runner_means
Runner_means
## Egypt Ireland UK USA
## 32.49623 154.23733 155.16460 78.04228
The results are stored into an array. If a new table is desired it can be manipulated to do so:
<- data.frame(Team = names(Runner_means), Mean =Runner_means, row.names = NULL)
Runners_summary
Runners_summary
## Team Mean
## 1 Egypt 32.49623
## 2 Ireland 154.23733
## 3 UK 155.16460
## 4 USA 78.04228
Jagged Vectors Example
We can use tapply()
to also preform tasks across multiple vectors of different lengths as long as the amount of factors matches the amount of the elements overall in each vector
#Create a list of factors
<- c("Meghan","Gus","Jennifer","Gus","Jennifer","Natalie","Meghan","Jennifer","Gus","Natalie")
Names
<- c(90,67,88,99,100)
Scores1 <- c(99,99,78)
Scores2 <- c(100,78)
Scores3
#Placing them together gives us a vector that has a length of 10. Same as the "Names" vector
<- c(Scores1,Scores2,Scores3)
Test_scores
#We can now apply the same function across this vectors and get summarized results
<- tapply(Test_scores,Names,mean)
Test_averages
Test_averages
## Gus Jennifer Meghan Natalie
## 88.66667 88.66667 94.50000 88.50000
Vapply Function: vapply
Vapply() is similar to ‘sapply’, but you can pre-specify the type of value that is output, making it a bit faster.
vapply(X, INDEX, FUN)
Where:
Argument | Description |
---|---|
X | Data Frame, List, Vector |
FUN | The function you want to be applied to the data frame or vector in question |
FUN.VALUE | A template for the value you want returned |
So let’s look at our records
data set again. We can use vapply()
to compute the sum of each group (Cases and Controls) We can use vapply to ensure the result we get is numeric. Vapply is said to be a “safer” alternative to sapply because it will ensure you’re getting the results you are expecting. If you don’t set the FUN.VALUE
argument in the function, it will throw an error, whereas sapply would automatically produce a result:
records
## CASES CONTROLS
## 1 500 3426
## 2 2012 5210
## 3 1876 6753
## 4 635 5633
## 5 4512 2013
## 6 823 1890
#Using vapply to get the sums. We place a "1" in the numeric argument to tell R we are expecting a non-zero number
vapply(records, sum, FUN.VALUE = numeric(1))
## CASES CONTROLS
## 10358 24925
#Without the FUN.VALUE argument, vapply will throw an error
vapply(records, sum)
## Error in vapply(records, sum): argument "FUN.VALUE" is missing, with no default
#In comparison, sapply will give us the result without specifying the output type
sapply(records, sum)
## CASES CONTROLS
## 10358 24925