i2ds——tidyverse笔记

阅读：评论：0

i2ds——tidyverse笔记

文章目录

Tidy data
- Exercises
Manipulating data frames
- Adding a column with `mutate`
- Subsetting with `filter`
- Selecting columns with `select`
- Exercises
The pipe: `%>%`
Summarizing data
- `summarize`
- `pull`
- Group then summarize with `group_by`
Sorting data frames
- Nested sorting
- The top n n n
Tibbles
- Tibbles display better
- Subsets of tibbles are tibbles
- Tibbles can have complex entries
- Tibbles can be grouped
- Create a tibble using `tibble` instead of `data.frame`
The dot operator
`do`
Vectorization and functionals
The purrr package
Tidyverse condition
- `case_when`
- `between`

Tidy data

#>       country year fertility
#> 1     Germany 1960      2.41
#> 2 South Korea 1960      6.16
#> 3     Germany 1961      2.44
#> 4 South Korea 1961      5.99
#> 5     Germany 1962      2.47
#> 6 South Korea 1962      5.79

This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate.

#>       country 1960 1961 1962
#> 1     Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79

The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables, year, is stored in the header.

For the tidyverse packages to be optimally used, data need to be reshaped into tidy format.

Exercises

Examine the built-in dataset co2, which is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.
Examine the built-in dataset ChickWeight, which is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.
Examine the built-in dataset BOD, which is tidy: each row is an observation with two values (time and demand).
Which of the following built-in datasets is tidy (you can pick more than one):

a. BJsales
b. EuStockMarkets
c. DNase
d. Formaldehyde
e. Orange
f. UCBAdmissions

b-f

Manipulating data frames

For instance, to change the data table by adding a new column, we use mutate. To filter the data table to a subset of rows, we use filter. Finally, to subset the data by selecting specific columns, we use select.

Adding a column with `mutate`

The function mutate takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values.

library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

Subsetting with `filter`

To do this we use the filter function, which takes the data table as the first argument and then the conditional statement as the second.

filter(murders, rate <= 0.71)

Selecting columns with `select`

If we want to view just a few columns, we can use the dplyr select function.

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

Unlike select which is for columns, filter is for rows.

Exercises

Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.

filter(murders, population < 5000000 & region == "Northeast")

Make sure murders has been defined with rate and rank and still has all states. Create a table called my_states that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select to show only the state name, the rate, and the rank.

The pipe: `%>%`

original data → rightarrow → select → rightarrow → filter
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. So we can define other arguments as if the first argument is already defined

Summarizing data

`summarize`

The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code.

library(dplyr)
library(dslabs)
data(heights)

s <- heights %>% filter(sex == "Female") %>%summarize(average = mean(height), standard_deviation = sd(height))
s
#>   average standard_deviation
#> 1    64.9               3.76

us_murder_rate <- murders %>%summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
#>   rate
#> 1 3.03

`pull`

As most dplyr functions, summarize always returns a data frame. When a data object is piped that object and its columns can be accessed using the pull function.

us_murder_rate %>% pull(rate)
#> [1] 3.03

Group then summarize with `group_by`

heights %>% group_by(sex) %>%summarize(average = mean(height), standard_deviation = sd(height))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#>   sex    average standard_deviation
#>   <fct>    <dbl>              <dbl>
#> 1 Female    64.9               3.76
#> 2 Male      69.3               3.61

Sorting data frames

We know about the order and sort function, but for ordering entire tables, the dplyr function arrange is useful. For example, here we order the states by population size:

murders %>%arrange(population) %>%head()
#>                  state abb        region population total   rate
#> 1              Wyoming  WY          West     563626     5  0.887
#> 2 District of Columbia  DC         South     601723    99 16.453
#> 3              Vermont  VT     Northeast     625741     2  0.320
#> 4         North Dakota  ND North Central     672591     4  0.595
#> 5               Alaska  AK          West     710231    19  2.675
#> 6         South Dakota  SD North Central     814180     8  0.983

Nested sorting

murders %>% arrange(region, rate) %>% head()
#>           state abb    region population total  rate
#> 1       Vermont  VT Northeast     625741     2 0.320
#> 2 New Hampshire  NH Northeast    1316470     5 0.380
#> 3         Maine  ME Northeast    1328361    11 0.828
#> 4  Rhode Island  RI Northeast    1052567    16 1.520
#> 5 Massachusetts  MA Northeast    6547629   118 1.802
#> 6      New York  NY Northeast   19378102   517 2.668

The top n n n

If we want to see a larger proportion, we can use the top_n function. This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Here is an example of how to see the top 5 rows:

murders %>% top_n(5, rate)
#>                  state abb        region population total  rate
#> 1 District of Columbia  DC         South     601723    99 16.45
#> 2            Louisiana  LA         South    4533372   351  7.74
#> 3             Maryland  MD         South    5773552   293  5.07
#> 4             Missouri  MO North Central    5988927   321  5.36
#> 5       South Carolina  SC         South    4625364   207  4.48

Tibbles

murders %>% group_by(region) %>% class()
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

The tbl, pronounced tibble, is a special kind of data frame. The functions group_by and summarize always return this type of data frame. The group_by function returns a special kind of tbl, the grouped_df.
For consistency, the dplyr manipulation verbs (select, filter, mutate, and arrange) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble.
But tibbles are the preferred format in the tidyverse and as a result tidyverse functions that produce a data frame from scratch return a tibble.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences which we describe next.

Tibbles display better

The print method for tibbles is more readable than that of a data frame.

Subsets of tibbles are tibbles

class(murders[,4])
#> [1] "numeric"

class(as_tibble(murders)[,4])
#> [1] "tbl_df"     "tbl"        "data.frame"

class(as_tibble(murders)$population)
#> [1] "numeric"

Tibbles can have complex entries

While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:

tibble(id = c(1, 2, 3), func = c(mean, median, sd))
#> # A tibble: 3 x 2
#>      id func  
#>   <dbl> <list>
#> 1     1 <fn>  
#> 2     2 <fn>  
#> 3     3 <fn>

Tibbles can be grouped

The function group_by returns a special kind of tibble: a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular the summarize function, are aware of the group information.

Create a tibble using `tibble` instead of `data.frame`

grades <- tibble(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))

Note that base R (without packages loaded) has a function with a very similar name, data.frame, that can be used to create a regular data frame rather than a tibble. One other important difference is that by default data.frame coerces characters into factors without providing a warning or message:

grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))
class(grades$names)
#> [1] "character"

To avoid this, we use the rather cumbersome argument stringsAsFactors:

grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90),stringsAsFactors = FALSE)
class(grades$names)
#> [1] "character"

To convert a regular data frame to a tibble, you can use the as_tibble function.

The dot operator

If we want to access a component of the data frame, the answer is the dot operator.

rates <-   filter(murders, region == "South") %>% mutate(rate = total / population * 10^5) %>% .$rate
median(rates)
#> [1] 3.4

`do`

The do function serves as a bridge between R functions such as quantile and the tidyverse. The do function understands grouped tibbles and always returns a data frame.
First we have to write a function that fits into the tidyverse approach: that is, it receives a data frame and returns a data frame.

my_summary <- function(dat){x <- quantile(dat$height, c(0, 0.5, 1))tibble(min = x[1], median = x[2], max = x[3])
}

We can now apply the function to the heights dataset to obtain the summaries:

heights %>% group_by(sex) %>% my_summary
#> # A tibble: 1 x 3
#>     min median   max
#>   <dbl>  <dbl> <dbl>
#> 1    50   68.5  82.7

But this is not what we want. We want a summary for each sex and the code returned just one summary. This is because my_summary is not part of the tidyverse and does not know how to handled grouped tibbles. do makes this connection:

heights %>% group_by(sex) %>% do(my_summary(.))
#> # A tibble: 2 x 4
#> # Groups:   sex [2]
#>   sex      min median   max
#>   <fct>  <dbl>  <dbl> <dbl>
#> 1 Female    51   65.0  79  
#> 2 Male      50   69    82.7

Note that here we need to use the dot operator. The tibble created by group_by is piped to do. Within the call to do, the name of this tibble is . and we want to send it to my_summary. If you do not use the dot, then my_summary has no argument and returns an error telling us that argument "dat" is missing. You can see the error by typing:

heights %>% group_by(sex) %>% do(my_summary())

Vectorization and functionals

Although for-loops are an important concept to understand, in R we rarely use them. As you learn more R, you will realize that vectorization is preferred over for-loops since it results in shorter and clearer code. A vectorized function is a function that will apply the same operation on each of the vectors.
Functionals are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here we cover the functional that operates on numeric, logical, and character vectors: sapply.
The function sapply permits us to perform element-wise operations on any function. Here is how it works:

x <- 1:10
sapply(x, sqrt)
#>  [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16

Each element of x is passed on to the function sqrt and the result is returned. These results are concatenated. In this case, the result is a vector of the same length as the original x.
Other functionals are apply, lapply, tapply, mapply, vapply, and replicate. We mostly use sapply, apply, and replicate in this book, but we recommend familiarizing yourselves with the others as they can be very useful.

The purrr package

The purrr package includes functions similar to sapply but that better interact with other tidyverse functions. The main advantage is that we can better control the output type of functions. In contrast, sapply can return several different object types; for example, we might expect a numeric result from a line of code, but sapply might convert our result to character under some circumstances. purrr functions will never do this: they will return objects of a specified type or return an error if this is not possible.
The first purrr function we will learn is map, which works very similar to sapply but always, without exception, returns a list:

library(purrr)
s_n <- map(n, compute_s_n)
class(s_n)
#> [1] "list"

If we want a numeric vector, we can instead use map_dbl which always returns a vector of numeric values.

s_n <- map_dbl(n, compute_s_n)
class(s_n)
#> [1] "numeric"

A particularly useful purrr function for interacting with the rest of the tidyverse is map_df, which always returns a tibble data frame. However, the function being called needs to return a vector or a list with names. For this reason, the following code would result in a Argument 1 must have names error:

s_n <- map_df(n, compute_s_n)

We need to change the function to make this work:

compute_s_n <- function(n){x <- 1:ntibble(sum = sum(x))
}
s_n <- map_df(n, compute_s_n)

Tidyverse condition

`case_when`

The case_when function is useful for vectorizing conditional statements. It is similar to ifelse but can output any number of values, as opposed to just TRUE or FALSE. Here is an example splitting numbers into negative, positive, and 0:

x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE  ~ "Zero")
#> [1] "Negative" "Negative" "Zero"     "Positive" "Positive"

A common use for this function is to define categorical variables based on existing variables. For example, suppose we want to compare the murder rates in four groups of states: New England, West Coast, South, and other. For each state, we need to ask if it is in New England, if it is not we ask if it is in the West Coast, if not we ask if it is in the South, and if not we assign other. Here is how we use case_when to do this:

murders %>% mutate(group = case_when(abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",abb %in% c("WA", "OR", "CA") ~ "West Coast",region == "South" ~ "South",TRUE ~ "Other")) %>%group_by(group) %>%summarize(rate = sum(total) / sum(population) * 10^5) 
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 2
#>   group        rate
#>   <chr>       <dbl>
#> 1 New England  1.72
#> 2 Other        2.71
#> 3 South        3.63
#> 4 West Coast   2.90

`between`

x >= a & x <= b

However, this can become cumbersome, especially within the tidyverse approach. The between function performs the same operation.

between(x, a, b)

本文发布于:2024-02-04 20:13:52，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170715595459247.html

上一篇：文件自动化处理

下一篇：Task01 学习笔记

标签：笔记 i2ds tidyverse

留言与评论（共有 0 条评论）

i2ds——tidyverse笔记