tidyr
, dplyr
, and
ggplot2
in actionif (require(plotly) == FALSE){
install.packages(plotly)
}
if (require(r02pro) == FALSE){
install.packages(r02pro)
}
if (require(forcats) == FALSE) {
install.packages(forcats)
}
library(dplyr)
library(tidyr)
library(readr)
library(gapminder)
library(r02pro)
library(ggplot2)
library(forcats)
This draws on students’ self-study of the DataCamp course Reshaping
Data with tidyr
. (https://campus.datacamp.com/courses/reshaping-data-with-tidyr)
and explains what you might have found difficult in this course. Some
issues do not relate to tidyr
but to plotting or other
aspects of the homework.
We will rename the data frame in each step to keep track of the changes, although normally you would probably do it all in one pipe and in a last step you would assign the whole pipe to a variable name.
forcats
.This is inspired by https://campus.datacamp.com/courses/reshaping-data-with-tidyr/from-wide-to-long-and-back?ex=3
The obesity data used in this task come from the r02pro
library.
Install this library if you don’t have it installed (you probably don’t):
if (require(r02pro) == FALSE) {
install.packages("r02pro")
}
library(r02pro)
You need the gm
dataset (snapshot of one year, but
dichotomized sex).
glimpse(gm)
Rows: 65,531
Columns: 33
$ country <chr> "Albania", "Albania", "Albania", "…
$ year <dbl> 1999, 2000, 2001, 2002, 2003, 2004…
$ smoking_female <dbl> 4.0, NA, 4.0, NA, NA, 4.0, 29.2, N…
$ smoking_male <dbl> NA, NA, 40.5, NA, NA, 40.5, NA, NA…
$ lungcancer_newcases_female <dbl> 11.1, 10.8, 11.2, 11.5, 11.2, 11.0…
$ lungcancer_newcases_male <dbl> 46.1, 43.8, 44.1, 45.2, 44.8, 42.8…
$ owid_edu_idx <dbl> 58.7, 58.0, 60.0, 60.0, 60.7, 60.7…
$ food_supply <dbl> 2730, 2800, 2860, 2770, 2790, 2870…
$ average_daily_income <dbl> 5.96, 6.30, 6.81, 7.22, 7.39, 7.70…
$ sanitation <dbl> 89.5, 90.0, 90.6, 91.2, 91.7, 92.3…
$ child_mortality <dbl> 25.90, 24.50, 23.10, 21.80, 20.40,…
$ income_per_person <dbl> 5890, 6440, 6750, 7150, 7580, 8040…
$ HDI <dbl> 0.671, 0.678, 0.684, 0.691, 0.696,…
$ alcohol_male <dbl> 10.600000, 11.499156, 11.510882, 1…
$ alcohol_female <dbl> 2.5000000, 2.7400844, 2.7389118, 2…
$ livercancer_newcases_male <dbl> 17.80, 17.10, 16.40, 15.70, 15.10,…
$ livercancer_newcases_female <dbl> 8.28, 7.66, 7.10, 6.58, 6.10, 5.66…
$ mortality_male <dbl> 23.5, 19.6, 20.2, 21.2, 20.6, 25.2…
$ mortality_female <dbl> 12.9, 11.0, 12.3, 12.6, 12.0, 14.5…
$ cholesterol_fat_in_blood_male <dbl> 4.90, 4.90, 4.91, 4.91, 4.91, 4.90…
$ cholesterol_fat_in_blood_female <dbl> 4.92, 4.92, 4.92, 4.92, 4.91, 4.90…
$ continent <chr> "Europe", "Europe", "Europe", "Eur…
$ region <chr> "Southern Europe", "Southern Europ…
$ population <dbl> 3130.0, 3130.0, 3130.0, 3120.0, 31…
$ life_expectancy <dbl> 75.4, 76.0, 75.9, 75.6, 75.8, 76.2…
$ sugar <dbl> 63.4, 66.6, 70.7, 67.0, 55.0, 58.6…
$ BMI_female <dbl> 25.1, 25.2, 25.3, 25.3, 25.4, 25.5…
$ BMI_female_group <chr> "pre_obesity", "pre_obesity", "pre…
$ BMI_male <dbl> 25.6, 25.7, 25.8, 25.9, 26.0, 26.1…
$ BMI_male_group <chr> "pre_obesity", "pre_obesity", "pre…
$ health_spending <dbl> 6.35, 6.04, 6.28, 6.16, 6.88, 6.84…
$ GDP_per_capita <dbl> 1.96, 2.14, 2.25, 2.38, 2.52, 2.68…
$ HDI_category <chr> "medium", "medium", "medium", "med…
dplyr
Create the obesity_df
data frame similar to that one in
DataCamp’s exercise by selecting the columns
BMI_female_group
, BMI_male
, year
,
country
, and region
. Remove rows with
NA
s in the BMI_female
, BMI_male
columns.
The BMI_...
columns show the average BMI of the population
in the given country. (The DataCamp data was showing something else,
namely the percentage of obese people in the country) Impute the value
“Other” for NA
in region
.
obesity_df <- gm %>%
select(c(BMI_female, BMI_male, year, country, region)) %>% drop_na(starts_with("BMI")) %>%
replace_na(list(region = "Other"))
Check the regions (sort distinct values alphabetically, just print it)
obesity_df %>%
distinct(region) %>%
arrange(region)
If there are ties, pick the most recent year. You can do it in
several ways. One of them is to create a new aggregated data frame, find
out which year takes most rows and then use this number to filter the
original data frame. You can do it without retyping the year value by
pulling the value out of the data frame in the last step. Another way
would be to mutate this information on the current data frame, filter it
and remove the mutated column again. Either way, call the resulting data
frame obesity_snapshot
.
(maxdf <- obesity_df %>%
group_by(year) %>%
count() %>%
ungroup() %>%
slice_max(n = 1,
order_by = n,
with_ties = TRUE) %>%
slice_max(n = 1, order_by = year,
with_ties = TRUE) %>% # or together like right below
# slice_max(n = 1, order_by = tibble(n,year)) #%>%
pull(year)
)
[1] 2007
obesity_snapshot <- obesity_df %>%
filter(year == maxdf)
obesity_snapshot <- obesity_df %>%
group_by(year) %>%
add_count() %>%
ungroup() %>%
slice_max(n = 1,
order_by = tibble(n, year),
with_ties = TRUE) %>%
select(-n)
Rowwise operations are cumbersome with some functions and
mean
is one of them. This is how you work around that
problem without any additional skill.
obesity_snapshot <- obesity_snapshot %>%
mutate(BMI_both = (BMI_female + BMI_male)/2)
obesity_snapshot %>% slice_head(n = 5)
This worked well because +
is a vectorized operation,
unlike sum
.
When you have to use a non-vectorized function on individual rows,
you can help yourself out with rowwise()
. This would work
wonderfully with sum
, for instance:
obesity_snapshot %>%
rowwise() %>%
mutate(sum_BMI = sum(BMI_male, BMI_female)) %>%
ungroup() %>%
select(matches("BMI")) %>%
slice_head(n = 5)
But mean
takes only one argument. So you have to combine
the two columns into one vector, and this construct then operates for
each row.
obesity_snapshot %>%
rowwise() %>%
mutate(BMI_both = mean(c(BMI_male, BMI_female))) %>%
ungroup() %>% slice_head(n = 5)
NA
ggplot2
By default, ggplot2
orders categorical values
alphabetically. You often want to sort them according to the values in a
numeric column. ggplot2
treats all columns with categorical
variables as factors. Technically, you need to reorder the factor
levels, while plotting (that is, inside ggplot2
or
other plotting code).
For this, you need the forcats
library (part of
tidyverse
). This library generally enables you to rename or
reorder factor levels. Among many other functions, forcats
has a dedicated function for ggplot2
that reorders values
of a data frame column according to another data frame column. It is
called fct_reorder
and is used like this:
obesity_snapshot %>%
ggplot() +
geom_point(aes(x = BMI_both,
y = fct_reorder(country, BMI_both)
))
This is obviously very illegible; nevertheless you can see that the
countries are not sorted alphabetically but according to the
BMI_both
variable.
We want just those countries that are close to the quartiles
selected_1q <- obesity_snapshot %>%
slice_min(BMI_male, prop = 0.25) %>%
slice_max(BMI_male, n = 1, with_ties = FALSE) %>% pull(country)
selected_2q <- obesity_snapshot %>%
slice_min(BMI_male, prop = 0.5) %>%
slice_max(BMI_male, n = 1, with_ties = FALSE) %>% pull(country)
selected <-
obesity_snapshot %>%
ggplot() +
geom_point(aes(x = BMI_both,
y = fct_reorder(country, BMI_both)
)) +
scale_y_discrete(breaks = c(selected_1q, selected_2q))
selected
tidyr
Use tidyr::pivot_longer
with the
names_prefix
argument to directly transform the column
names “BMI_male” and “BMI_female” to values “male” and “female”. Call
this column gender
. Also, call the column where the numeric
BMI values appear, the BMI
column. Relocate the columns so
that the Store this data frame to a variable called
obesity_long
.
obesity_long <- obesity_snapshot %>%
pivot_longer(c("BMI_female", "BMI_male"),
names_prefix = "BMI_",
names_to = "gender",
values_to = "BMI")
obesity_long %>% slice_head(n = 6) %>% select(c(country, gender, matches("BMI")))