NOTE: web update Nov 28, 2024. Next year, check the links again. They happened to update it in the middle of this lecture series… :-O
This is a continuation of work with UK government data on Road
Accidents. Here comes a brief summary. For details, please refer to the
file data_wrangling_task.Rmd
in the folder
2024-11-15and22
.
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(readr, warn.conflicts = FALSE, quietly = TRUE)
library(readxl, warn.conflicts = FALSE, quietly = TRUE)
This is the website that contains the casualty datasets.
https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data
This is a cleaned version of the source data. Each row corresponds to one person who was killed or injured in a traffic accident on British roads between 2019 and 2023.
casualties <- read_tsv("all_casualties_labeled.tsv")
Rows: 665408 Columns: 10
── Column specification ─────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): ID, accident_reference, casualty_class, casualty_type, sex_of_casualty, casualty_severity
dbl (4): rowid, accident_year, age_of_casualty, vehicle_reference
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(casualties)
Rows: 665,408
Columns: 10
$ ID <chr> "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019",…
$ rowid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 2…
$ accident_reference <chr> "010128300", "010128300", "010128300", "010152270", "010155191", "010155192"…
$ accident_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
$ age_of_casualty <dbl> 58, -1, -1, 24, 21, 68, 47, 16, 20, 41, 25, 40, 24, 20, 25, 24, 28, 74, 34, …
$ vehicle_reference <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ casualty_class <chr> "Driver or rider", "Passenger", "Passenger", "Driver or rider", "Passenger",…
$ casualty_type <chr> "Car occupant", "Car occupant", "Car occupant", "Car occupant", "Cyclist", "…
$ sex_of_casualty <chr> "Male", "Female", "Female", "Female", "Female", "Male", "Female", "Female", …
$ casualty_severity <chr> "Slight", "Slight", "Slight", "Slight", "Slight", "Serious", "Slight", "Slig…
Which labels do we have there and how are the casualties distributed within these labels?
casualties %>% group_by(....) %>% count()
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `....` is not found.
Backtrace:
1. casualties %>% group_by(....) %>% count()
4. dplyr:::group_by.data.frame(., ....)
Bar plot of casualties counts by year
casualties %>%
ggplot(mapping = aes(.......)) +
.....() +
scale_y_continuous(breaks = seq(0, 160000, 20000),
name = "Casualties count") +
scale_x_continuous(name = "Accident year") +
ggtitle(label = "Casualties on British roads broken by year")
Error in .....() : could not find function "....."
Hint: Casualties are individual people. There could be more than one casualty in an accident. You need to de-duplicate rows with identical accident ids or aggregate the data frame accordingly.
Computing accidents (not casualties) broken by years
Plotting accidents broken by years
Casualty class says whether the casualty was the driver, passenger or pedestrian
Compute casualty_class
broken by year
casualties %>% ..... %>% .....
Draw a bar plot. Play around with the position and fill of the bars.
Count the casualties by years
casualties %>% ..... %>% .....
Draw a bar plot with dodged bars with fill color representing the
casualty_class
and the x axis showing years
Are the casualty age and class distributed in the same way every year? Facet the bar plot by year.
This below would be a heatmap.
Do the age distributions of the casualties differ by year? Create an aggregated table containing for each year the mean, median, 25th percentile, and 75th percentile of casualties’ ages.
Draw boxplots or violin plots for the individual years
casualty_class
and casualty_type
associated?Just tabulate it and interpret by eyeballing.
The worst accident - who were the casualties?
Find the worst accident (max count of casualties)
aggreg <- casualties %>%
group_by(accident_reference) %>%
count() %>%
ungroup()
glimpse(aggreg)
Rows: 519,549
Columns: 2
$ accident_reference <chr> "010128300", "010152270", "010155191", "010155192", "010155194", "010155195"…
$ n <int> 3, 1, 1, 1, 2, 3, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1…
max_reference <- aggreg %>%
slice_max(order_by = n, n = 1) %>%
pull(accident_reference)
Create a separate data frame of casualties from this accident and present as many relevant details about them as you can find.
worst_accident <- casualties %>% filter(accident_reference == max_reference)
Model the age and sex of casualties
What was the age and sex of the unfortunate motorbike rider? We know from looking at the filtered data that the person was on Vehicle 1.
worst_accident %>% ....