#> country year fertility
#> 1 Germany 1960 2.41
#> 2 South Korea 1960 6.16
#> 3 Germany 1961 2.44
#> 4 South Korea 1961 5.99
#> 5 Germany 1962 2.47
#> 6 South Korea 1962 5.79
This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate.
#> country 1960 1961 1962
#> 1 Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79
The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables, year, is stored in the header.
For the tidyverse packages to be optimally used, data need to be reshaped into tidy
format.
Examine the built-in dataset co2
, which is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.
Examine the built-in dataset ChickWeight
, which is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.
Examine the built-in dataset BOD
, which is tidy: each row is an observation with two values (time and demand).
Which of the following built-in datasets is tidy (you can pick more than one):
a. BJsales
b. EuStockMarkets
c. DNase
d. Formaldehyde
e. Orange
f. UCBAdmissions
b-f
For instance, to change the data table by adding a new column, we use mutate
. To filter the data table to a subset of rows, we use filter
. Finally, to subset the data by selecting specific columns, we use select
.
mutate
The function mutate
takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values
.
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)
filter
To do this we use the filter
function, which takes the data table as the first argument and then the conditional statement as the second.
filter(murders, rate <= 0.71)
select
If we want to view just a few columns, we can use the dplyr select
function.
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
Unlike select
which is for columns, filter
is for rows.
filter(murders, population < 5000000 & region == "Northeast")
Make sure murders
has been defined with rate
and rank
and still has all states. Create a table called my_states
that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select
to show only the state name, the rate, and the rank.
%>%
original data → rightarrow → select → rightarrow → filter
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. So we can define other arguments as if the first argument is already defined
summarize
The summarize
function in dplyr provides a way to compute summary statistics with intuitive and readable code.
library(dplyr)
library(dslabs)
data(heights)
s <- heights %>% filter(sex == "Female") %>%summarize(average = mean(height), standard_deviation = sd(height))
s
#> average standard_deviation
#> 1 64.9 3.76
us_murder_rate <- murders %>%summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
#> rate
#> 1 3.03
pull
As most dplyr functions, summarize
always returns a data frame. When a data object is piped that object and its columns can be accessed using the pull
function.
us_murder_rate %>% pull(rate)
#> [1] 3.03
group_by
heights %>% group_by(sex) %>%summarize(average = mean(height), standard_deviation = sd(height))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#> sex average standard_deviation
#> <fct> <dbl> <dbl>
#> 1 Female 64.9 3.76
#> 2 Male 69.3 3.61
We know about the order
and sort
function, but for ordering entire tables, the dplyr function arrange
is useful. For example, here we order the states by population size:
murders %>%arrange(population) %>%head()
#> state abb region population total rate
#> 1 Wyoming WY West 563626 5 0.887
#> 2 District of Columbia DC South 601723 99 16.453
#> 3 Vermont VT Northeast 625741 2 0.320
#> 4 North Dakota ND North Central 672591 4 0.595
#> 5 Alaska AK West 710231 19 2.675
#> 6 South Dakota SD North Central 814180 8 0.983
murders %>% arrange(region, rate) %>% head()
#> state abb region population total rate
#> 1 Vermont VT Northeast 625741 2 0.320
#> 2 New Hampshire NH Northeast 1316470 5 0.380
#> 3 Maine ME Northeast 1328361 11 0.828
#> 4 Rhode Island RI Northeast 1052567 16 1.520
#> 5 Massachusetts MA Northeast 6547629 118 1.802
#> 6 New York NY Northeast 19378102 517 2.668
If we want to see a larger proportion, we can use the top_n
function. This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Here is an example of how to see the top 5 rows:
murders %>% top_n(5, rate)
#> state abb region population total rate
#> 1 District of Columbia DC South 601723 99 16.45
#> 2 Louisiana LA South 4533372 351 7.74
#> 3 Maryland MD South 5773552 293 5.07
#> 4 Missouri MO North Central 5988927 321 5.36
#> 5 South Carolina SC South 4625364 207 4.48
murders %>% group_by(region) %>% class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
The tbl
, pronounced tibble, is a special kind of data frame. The functions group_by
and summarize
always return this type of data frame. The group_by
function returns a special kind of tbl
, the grouped_df
.
For consistency, the dplyr manipulation verbs (select
, filter
, mutate
, and arrange
) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble.
But tibbles are the preferred format in the tidyverse and as a result tidyverse functions that produce a data frame from scratch return a tibble.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences which we describe next.
The print method for tibbles is more readable than that of a data frame.
class(murders[,4])
#> [1] "numeric"
class(as_tibble(murders)[,4])
#> [1] "tbl_df" "tbl" "data.frame"
class(as_tibble(murders)$population)
#> [1] "numeric"
While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
#> # A tibble: 3 x 2
#> id func
#> <dbl> <list>
#> 1 1 <fn>
#> 2 2 <fn>
#> 3 3 <fn>
The function group_by returns a special kind of tibble: a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular the summarize function, are aware of the group information.
tibble
instead of data.frame
grades <- tibble(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))
Note that base R (without packages loaded) has a function with a very similar name, data.frame, that can be used to create a regular data frame rather than a tibble. One other important difference is that by default data.frame coerces characters into factors without providing a warning or message:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))
class(grades$names)
#> [1] "character"
To avoid this, we use the rather cumbersome argument stringsAsFactors
:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90),stringsAsFactors = FALSE)
class(grades$names)
#> [1] "character"
To convert a regular data frame to a tibble, you can use the as_tibble
function.
If we want to access a component of the data frame, the answer is the dot operator.
rates <- filter(murders, region == "South") %>% mutate(rate = total / population * 10^5) %>% .$rate
median(rates)
#> [1] 3.4
do
The do
function serves as a bridge between R functions such as quantile
and the tidyverse. The do
function understands grouped tibbles and always returns a data frame.
First we have to write a function that fits into the tidyverse approach: that is, it receives a data frame and returns a data frame.
my_summary <- function(dat){x <- quantile(dat$height, c(0, 0.5, 1))tibble(min = x[1], median = x[2], max = x[3])
}
We can now apply the function to the heights dataset to obtain the summaries:
heights %>% group_by(sex) %>% my_summary
#> # A tibble: 1 x 3
#> min median max
#> <dbl> <dbl> <dbl>
#> 1 50 68.5 82.7
But this is not what we want. We want a summary for each sex and the code returned just one summary. This is because my_summary
is not part of the tidyverse and does not know how to handled grouped tibbles. do makes this connection:
heights %>% group_by(sex) %>% do(my_summary(.))
#> # A tibble: 2 x 4
#> # Groups: sex [2]
#> sex min median max
#> <fct> <dbl> <dbl> <dbl>
#> 1 Female 51 65.0 79
#> 2 Male 50 69 82.7
Note that here we need to use the dot operator. The tibble created by group_by
is piped to do
. Within the call to do
, the name of this tibble is .
and we want to send it to my_summary
. If you do not use the dot, then my_summary
has no argument and returns an error telling us that argument "dat"
is missing. You can see the error by typing:
heights %>% group_by(sex) %>% do(my_summary())
Although for-loops are an important concept to understand, in R we rarely use them. As you learn more R, you will realize that vectorization is preferred over for-loops since it results in shorter and clearer code. A vectorized function is a function that will apply the same operation on each of the vectors.
Functionals are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here we cover the functional that operates on numeric, logical, and character vectors: sapply
.
The function sapply
permits us to perform element-wise operations on any function. Here is how it works:
x <- 1:10
sapply(x, sqrt)
#> [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16
Each element of x
is passed on to the function sqrt
and the result is returned. These results are concatenated. In this case, the result is a vector of the same length as the original x
.
Other functionals are apply
, lapply
, tapply
, mapply
, vapply
, and replicate
. We mostly use sapply
, apply
, and replicate
in this book, but we recommend familiarizing yourselves with the others as they can be very useful.
The purrr package includes functions similar to sapply
but that better interact with other tidyverse functions. The main advantage is that we can better control the output type of functions. In contrast, sapply
can return several different object types; for example, we might expect a numeric result from a line of code, but sapply
might convert our result to character under some circumstances. purrr functions will never do this: they will return objects of a specified type or return an error if this is not possible.
The first purrr function we will learn is map
, which works very similar to sapply but always, without exception, returns a list:
library(purrr)
s_n <- map(n, compute_s_n)
class(s_n)
#> [1] "list"
If we want a numeric vector, we can instead use map_dbl
which always returns a vector of numeric values.
s_n <- map_dbl(n, compute_s_n)
class(s_n)
#> [1] "numeric"
A particularly useful purrr function for interacting with the rest of the tidyverse is map_df
, which always returns a tibble data frame. However, the function being called needs to return a vector or a list with names. For this reason, the following code would result in a Argument 1 must have names
error:
s_n <- map_df(n, compute_s_n)
We need to change the function to make this work:
compute_s_n <- function(n){x <- 1:ntibble(sum = sum(x))
}
s_n <- map_df(n, compute_s_n)
case_when
The case_when
function is useful for vectorizing conditional statements. It is similar to ifelse
but can output any number of values, as opposed to just TRUE
or FALSE
. Here is an example splitting numbers into negative, positive, and 0:
x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero")
#> [1] "Negative" "Negative" "Zero" "Positive" "Positive"
A common use for this function is to define categorical variables based on existing variables. For example, suppose we want to compare the murder rates in four groups of states: New England, West Coast, South, and other. For each state, we need to ask if it is in New England, if it is not we ask if it is in the West Coast, if not we ask if it is in the South, and if not we assign other. Here is how we use case_when
to do this:
murders %>% mutate(group = case_when(abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",abb %in% c("WA", "OR", "CA") ~ "West Coast",region == "South" ~ "South",TRUE ~ "Other")) %>%group_by(group) %>%summarize(rate = sum(total) / sum(population) * 10^5)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 2
#> group rate
#> <chr> <dbl>
#> 1 New England 1.72
#> 2 Other 2.71
#> 3 South 3.63
#> 4 West Coast 2.90
between
x >= a & x <= b
However, this can become cumbersome, especially within the tidyverse approach. The between
function performs the same operation.
between(x, a, b)
本文发布于:2024-02-04 20:13:52,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170715595459247.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |