- R Cheat Sheet Dplyr
- R Data Cleaning Cheat Sheet
- R Tidyverse Cheat Sheet
- R Dplyr Cheat Sheet Deutsch
- Rstudio Cheat Sheet Dplyr
- R Data Manipulation Cheat Sheet
- R Studio Dplyr Cheat Sheet
Base R Cheat Sheet RStudio® is a trademark of RStudio, Inc. CC BY Mhairi McNeill. mhairihmcneill@gmail.com Learn more at web page or vignette. package.
Overview
- R with dplyr and tidyr cheat sheet. Whenever I used R for my data analyses, I had to write a lot of codes to manipulate my data, and sometimes the codes cannot be maintainable. Thanks to dplyr and tidyr packages I no logner need to write long and redundant codes. This blog is where I write some tricks of using dplyr and tidyr.
- With dplyr and tidyr Cheat Sheet. Converts data to tbl class. Tbl’s are easier to examine than data frames. R displays only the data that fits onscreen: dplyr::glimpse(iris) Information dense summary of tbl data. Utils::View(iris) View data set in spreadsheet-like display (note capital V).
How can I manipulate dataframes without repeating myself?
To be able to use the six main dataframe manipulation ‘verbs’ with pipes in
dplyr
.To understand how
group_by()
andsummarize()
can be combined to summarize datasets.Be able to analyze a subset of data using logical filtering.
Manipulation of dataframes means many things to many researchers, we oftenselect certain observations (rows) or variables (columns), we often group thedata by a certain variable(s), or we even calculate summary statistics. We cando these operations using the normal base R operations:
But this isn’t very nice because there is a fair bit of repetition. Repeatingyourself will cost you time, both now and later, and potentially introduce somenasty bugs.
The dplyr
package
Luckily, the dplyr
package provides a number of very useful functions for manipulating dataframesin a way that will reduce the above repetition, reduce the probability of makingerrors, and probably even save you some typing. As an added bonus, you mighteven find the dplyr
grammar easier to read.
Here we’re going to cover 6 of the most commonly used functions as well as usingpipes (%>%
) to combine them.
select()
filter()
group_by()
summarize()
mutate()
If you have have not installed this package earlier, please do so:
Now let’s load the package:
Using select()
If, for example, we wanted to move forward with only a few of the variables inour dataframe we could use the select()
function. This will keep only thevariables you select.
If we open up year_country_gdp
we’ll see that it only contains the year,country and gdpPercap. Above we used ‘normal’ grammar, but the strengths ofdplyr
lie in combining several functions using pipes. Since the pipes grammaris unlike anything we’ve seen in R before, let’s repeat what we’ve done aboveusing pipes.
To help you understand why we wrote that in that way, let’s walk through it stepby step. First we summon the gapminder dataframe and pass it on, using the pipesymbol %>%
, to the next step, which is the select()
function. In this casewe don’t specify which data object we use in the select()
function since ingets that from the previous pipe. Fun Fact: There is a good chance you haveencountered pipes before in the shell. In R, a pipe symbol is %>%
while in theshell it is |
but the concept is the same!
Using filter()
If we now wanted to move forward with the above, but only with Europeancountries, we can combine select
and filter
R Cheat Sheet Dplyr
Challenge 1
Write a single command (which can span multiple lines and includes pipes) thatwill produce a dataframe that has the African values for lifeExp
, country
and year
, but not for other Continents. How many rows does your dataframehave and why?
Solution to Challenge 1
As with last time, first we pass the gapminder dataframe to the filter()
function, then we pass the filtered version of the gapminder dataframe to theselect()
function. Note: The order of operations is very important in thiscase. If we used ‘select’ first, filter would not be able to find the variablecontinent since we would have removed it in the previous step.
R Data Cleaning Cheat Sheet
Using group_by() and summarize()
R Tidyverse Cheat Sheet
Now, we were supposed to be reducing the error prone repetitiveness of what canbe done with base R, but up to now we haven’t done that since we would have torepeat the above for each continent. Instead of filter()
, which will only passobservations that meet your criteria (in the above: continent'Europe'
), wecan use group_by()
, which will essentially use every unique criteria that youcould have used in filter.
You will notice that the structure of the dataframe where we used group_by()
(grouped_df
) is not the same as the original gapminder
(data.frame
). Agrouped_df
can be thought of as a list
where each item in the list
is adata.frame
which contains only the rows that correspond to the a particularvalue continent
(at least in the example above).
Using summarize()
The above was a bit on the uneventful side but group_by()
is much moreexciting in conjunction with summarize()
. This will allow us to create newvariable(s) by using functions that repeat for each of the continent-specificdata frames. That is to say, using the group_by()
function, we split ouroriginal dataframe into multiple pieces, then we can run functions(e.g. mean()
or sd()
) within summarize()
.
That allowed us to calculate the mean gdpPercap for each continent, but it getseven better.
Challenge 2
Calculate the average life expectancy per country. Which has the longest average lifeexpectancy and which has the shortest average life expectancy?
Solution to Challenge 2
Another way to do this is to use the dplyr
function arrange()
, whicharranges the rows in a data frame according to the order of one or morevariables from the data frame. It has similar syntax to other functions fromthe dplyr
package. You can use desc()
inside arrange()
to sort indescending order.
The function group_by()
allows us to group by multiple variables. Let’s group by year
and continent
.
That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize()
.
count() and n()
A very common operation is to count the number of observations for eachgroup. The dplyr
package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in thedataset for the year 2002, we can use the count()
function. It takes the nameof one or more columns that contain the groups we are interested in, and we canoptionally sort the results in descending order by adding sort=TRUE
:
If we need to use the number of observations in calculations, the n()
functionis useful. For instance, if we wanted to get the standard error of the lifeexpectency per continent:
You can also chain together several summary operations; in this case calculating the minimum
, maximum
, mean
and se
of each continent’s per-country life-expectancy:
Using mutate()
We can also create new variables prior to (or even after) summarizing information using mutate()
.
Connect mutate with logical filtering: ifelse
When creating new variables, we can hook this with a logical condition. A simple combination ofmutate()
and ifelse()
facilitates filtering right where it is needed: in the moment of creating something new.This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimensionof the data frame will not change) or for updating values depending on this given condition.
Combining dplyr
and ggplot2
In the plotting lesson we looked at how to make a multi-panel figure by addinga layer of facet panels using ggplot2
. Here is the code we used (with someextra comments):
This code makes the right plot but it also creates some variables (starts.with
and az.countries
) that we might not have any other uses for. Just as we used%>%
to pipe data along a chain of dplyr
functions we can use it to pass datato ggplot()
. Because %>%
replaces the first argument in a function we don’tneed to specify the data =
argument in the ggplot()
function. By combiningdplyr
and ggplot2
functions we can make the same figure without creating anynew variables or modifying the data.
Using dplyr
functions also helps us simplify things, for example we couldcombine the first two steps:
Advanced Challenge
R Dplyr Cheat Sheet Deutsch
Calculate the average life expectancy in 2002 of 2 randomly selected countriesfor each continent. Then arrange the continent names in reverse order.Hint: Use the dplyr
functions arrange()
and sample_n()
, they havesimilar syntax to other dplyr functions.
Rstudio Cheat Sheet Dplyr
Solution to Advanced Challenge
Other great resources
R Data Manipulation Cheat Sheet
Key Points
R Studio Dplyr Cheat Sheet
Use the
dplyr
package to manipulate dataframes.Use
select()
to choose variables from a dataframe.Use
filter()
to choose data based on values.Use
group_by()
andsummarize()
to work with subsets of data.Use
mutate()
to create new variables.
Comments are closed.