NOTE: web update Nov 28, 2024. Next year, check the links again. They happened to update it in the middle of this lecture series… :-O

Introduction

This is a continuation of work with UK government data on Road Accidents. Here comes a brief summary. For details, please refer to the file data_wrangling_task.Rmd in the folder 2024-11-15and22.

library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(readr, warn.conflicts = FALSE, quietly = TRUE)
library(readxl, warn.conflicts = FALSE, quietly = TRUE)

Road Safety Data (Department of Transport)

This is the website that contains the casualty datasets.

https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data

This is a cleaned version of the source data. Each row corresponds to one person who was killed or injured in a traffic accident on British roads between 2019 and 2023.

casualties <- read_tsv("all_casualties_labeled.tsv")
Rows: 665408 Columns: 10
── Column specification ─────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): ID, accident_reference, casualty_class, casualty_type, sex_of_casualty, casualty_severity
dbl (4): rowid, accident_year, age_of_casualty, vehicle_reference

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(casualties)
Rows: 665,408
Columns: 10
$ ID                 <chr> "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019", "rc_2019",…
$ rowid              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 2…
$ accident_reference <chr> "010128300", "010128300", "010128300", "010152270", "010155191", "010155192"…
$ accident_year      <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
$ age_of_casualty    <dbl> 58, -1, -1, 24, 21, 68, 47, 16, 20, 41, 25, 40, 24, 20, 25, 24, 28, 74, 34, …
$ vehicle_reference  <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ casualty_class     <chr> "Driver or rider", "Passenger", "Passenger", "Driver or rider", "Passenger",…
$ casualty_type      <chr> "Car occupant", "Car occupant", "Car occupant", "Car occupant", "Cyclist", "…
$ sex_of_casualty    <chr> "Male", "Female", "Female", "Female", "Female", "Male", "Female", "Female", …
$ casualty_severity  <chr> "Slight", "Slight", "Slight", "Slight", "Slight", "Serious", "Slight", "Slig…

Which labels do we have there and how are the casualties distributed within these labels?

Casualties by year

casualties %>% group_by(....) %>% count()
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `....` is not found.
Backtrace:
 1. casualties %>% group_by(....) %>% count()
 4. dplyr:::group_by.data.frame(., ....)

Bar plot of casualties counts by year

casualties %>% 
  ggplot(mapping = aes(.......)) + 
  .....() + 
  scale_y_continuous(breaks = seq(0, 160000, 20000), 
                     name = "Casualties count") +
  scale_x_continuous(name = "Accident year") +
  ggtitle(label = "Casualties on British roads broken by year")
Error in .....() : could not find function "....."

Accidents including a casualty broken by years

Hint: Casualties are individual people. There could be more than one casualty in an accident. You need to de-duplicate rows with identical accident ids or aggregate the data frame accordingly.

Computing accidents (not casualties) broken by years

Plotting accidents broken by years

Casualties broken by years and casualty class

Casualty class says whether the casualty was the driver, passenger or pedestrian

Compute casualty_class broken by year

casualties %>% ..... %>% .....

Plot casualties broken by years and casualty classes

Draw a bar plot. Play around with the position and fill of the bars.

Casualty severity

Count the casualties by years

casualties %>% ..... %>% .....

Draw a bar plot with dodged bars with fill color representing the casualty_class and the x axis showing years

Are the casualty age and class distributed in the same way every year? Facet the bar plot by year.

This below would be a heatmap.

Casualty’s age

Do the age distributions of the casualties differ by year? Create an aggregated table containing for each year the mean, median, 25th percentile, and 75th percentile of casualties’ ages.

Draw boxplots or violin plots for the individual years

A question about how the data is labeled: how are casualty_class and casualty_type associated?

Just tabulate it and interpret by eyeballing.

The worst accident - who were the casualties?

Find the worst accident (max count of casualties)

aggreg <- casualties %>% 
  group_by(accident_reference) %>% 
  count() %>% 
  ungroup()
glimpse(aggreg)
Rows: 519,549
Columns: 2
$ accident_reference <chr> "010128300", "010152270", "010155191", "010155192", "010155194", "010155195"…
$ n                  <int> 3, 1, 1, 1, 2, 3, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1…
max_reference <- aggreg %>% 
  slice_max(order_by = n, n = 1) %>% 
  pull(accident_reference)

Create a separate data frame of casualties from this accident and present as many relevant details about them as you can find.

worst_accident <- casualties %>% filter(accident_reference == max_reference)

Model the age and sex of casualties

What was the age and sex of the unfortunate motorbike rider? We know from looking at the filtered data that the person was on Vehicle 1.

worst_accident %>% ....
LS0tDQp0aXRsZTogIkJyaXRpc2ggQ2FzdWFsdGllcyBpbiB0cmFmZmljIGFjY2lkZW50cyINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQo+IE5PVEU6IHdlYiB1cGRhdGUgTm92IDI4LCAyMDI0LiBOZXh0IHllYXIsIGNoZWNrIHRoZSBsaW5rcyBhZ2Fpbi4gVGhleSBoYXBwZW5lZCB0byB1cGRhdGUgIGl0IGluIHRoZSBtaWRkbGUgb2YgdGhpcyBsZWN0dXJlIHNlcmllcy4uLiA6LU8gDQoNCiMgSW50cm9kdWN0aW9uDQpUaGlzIGlzIGEgY29udGludWF0aW9uIG9mIHdvcmsgd2l0aCBVSyBnb3Zlcm5tZW50IGRhdGEgb24gUm9hZCBBY2NpZGVudHMuIA0KSGVyZSBjb21lcyBhIGJyaWVmIHN1bW1hcnkuIEZvciBkZXRhaWxzLCBwbGVhc2UgcmVmZXIgdG8gdGhlIGZpbGUgYGRhdGFfd3JhbmdsaW5nX3Rhc2suUm1kYCBpbiB0aGUgZm9sZGVyIGAyMDI0LTExLTE1YW5kMjJgLiANCg0KYGBge3J9DQpsaWJyYXJ5KGRwbHlyLCB3YXJuLmNvbmZsaWN0cyA9IEZBTFNFLCBxdWlldGx5ID0gVFJVRSkNCmxpYnJhcnkoZ2dwbG90Miwgd2Fybi5jb25mbGljdHMgPSBGQUxTRSwgcXVpZXRseSA9IFRSVUUpDQpsaWJyYXJ5KHJlYWRyLCB3YXJuLmNvbmZsaWN0cyA9IEZBTFNFLCBxdWlldGx5ID0gVFJVRSkNCmxpYnJhcnkocmVhZHhsLCB3YXJuLmNvbmZsaWN0cyA9IEZBTFNFLCBxdWlldGx5ID0gVFJVRSkNCmBgYA0KDQoNCiMjIFJvYWQgU2FmZXR5IERhdGEgKERlcGFydG1lbnQgb2YgVHJhbnNwb3J0KQ0KDQpUaGlzIGlzIHRoZSB3ZWJzaXRlIHRoYXQgY29udGFpbnMgdGhlIGNhc3VhbHR5IGRhdGFzZXRzLiANCg0KPGh0dHBzOi8vd3d3LmRhdGEuZ292LnVrL2RhdGFzZXQvY2I3YWU2ZjAtNGJlNi00OTM1LTkyNzctNDdlNWNlMjRhMTFmL3JvYWQtc2FmZXR5LWRhdGE+DQoNClRoaXMgaXMgYSBjbGVhbmVkIHZlcnNpb24gb2YgdGhlIHNvdXJjZSBkYXRhLiBFYWNoIHJvdyBjb3JyZXNwb25kcyB0byBvbmUgcGVyc29uIHdobyB3YXMga2lsbGVkIG9yIGluanVyZWQgaW4gYSB0cmFmZmljIGFjY2lkZW50IG9uIEJyaXRpc2ggcm9hZHMgYmV0d2VlbiAyMDE5IGFuZCAyMDIzLiANCg0KYGBge3J9DQpjYXN1YWx0aWVzIDwtIHJlYWRfdHN2KCJhbGxfY2FzdWFsdGllc19sYWJlbGVkLnRzdiIpDQpnbGltcHNlKGNhc3VhbHRpZXMpDQpgYGANCg0KV2hpY2ggbGFiZWxzIGRvIHdlIGhhdmUgdGhlcmUgYW5kIGhvdyBhcmUgdGhlIGNhc3VhbHRpZXMgZGlzdHJpYnV0ZWQgd2l0aGluIHRoZXNlIGxhYmVscz8NCg0KIyBDYXN1YWx0aWVzIGJ5IHllYXINCg0KYGBge3J9DQpjYXN1YWx0aWVzICU+JSBncm91cF9ieSguLi4uKSAlPiUgY291bnQoKQ0KYGBgDQpCYXIgcGxvdCBvZiBjYXN1YWx0aWVzIGNvdW50cyBieSB5ZWFyDQoNCmBgYHtyfQ0KY2FzdWFsdGllcyAlPiUgDQogIGdncGxvdChtYXBwaW5nID0gYWVzKC4uLi4uLi4pKSArIA0KICAuLi4uLigpICsgDQogIHNjYWxlX3lfY29udGludW91cyhicmVha3MgPSBzZXEoMCwgMTYwMDAwLCAyMDAwMCksIA0KICAgICAgICAgICAgICAgICAgICAgbmFtZSA9ICJDYXN1YWx0aWVzIGNvdW50IikgKw0KICBzY2FsZV94X2NvbnRpbnVvdXMobmFtZSA9ICJBY2NpZGVudCB5ZWFyIikgKw0KICBnZ3RpdGxlKGxhYmVsID0gIkNhc3VhbHRpZXMgb24gQnJpdGlzaCByb2FkcyBicm9rZW4gYnkgeWVhciIpDQpgYGANCg0KIyBBY2NpZGVudHMgaW5jbHVkaW5nIGEgY2FzdWFsdHkgYnJva2VuIGJ5IHllYXJzDQpIaW50OiBDYXN1YWx0aWVzIGFyZSBpbmRpdmlkdWFsIHBlb3BsZS4gVGhlcmUgY291bGQgYmUgbW9yZSB0aGFuIG9uZSBjYXN1YWx0eSBpbiBhbiBhY2NpZGVudC4gDQpZb3UgbmVlZCB0byBkZS1kdXBsaWNhdGUgcm93cyB3aXRoIGlkZW50aWNhbCBhY2NpZGVudCBpZHMgb3IgYWdncmVnYXRlIHRoZSBkYXRhIGZyYW1lIGFjY29yZGluZ2x5LiANCg0KQ29tcHV0aW5nICphY2NpZGVudHMqIChub3QgY2FzdWFsdGllcykgYnJva2VuIGJ5IHllYXJzDQoNCmBgYHtyfQ0KDQpgYGANCg0KUGxvdHRpbmcgYWNjaWRlbnRzIGJyb2tlbiBieSB5ZWFycw0KDQoNCg0KIyBDYXN1YWx0aWVzIGJyb2tlbiBieSB5ZWFycyBhbmQgY2FzdWFsdHkgY2xhc3MgDQpDYXN1YWx0eSBjbGFzcyBzYXlzIHdoZXRoZXIgdGhlIGNhc3VhbHR5IHdhcyB0aGUgZHJpdmVyLCBwYXNzZW5nZXIgb3IgcGVkZXN0cmlhbg0KDQpDb21wdXRlIGBjYXN1YWx0eV9jbGFzc2AgYnJva2VuIGJ5IHllYXINCmBgYHtyfQ0KY2FzdWFsdGllcyAlPiUgLi4uLi4gJT4lIC4uLi4uDQpgYGANCg0KIyBQbG90IGNhc3VhbHRpZXMgYnJva2VuIGJ5IHllYXJzIGFuZCBjYXN1YWx0eSBjbGFzc2VzDQoNCkRyYXcgYSBiYXIgcGxvdC4gUGxheSBhcm91bmQgd2l0aCB0aGUgcG9zaXRpb24gYW5kIGZpbGwgb2YgdGhlIGJhcnMuICAgDQoNCmBgYHtyfQ0KDQpgYGANCg0KDQojIENhc3VhbHR5IHNldmVyaXR5DQpDb3VudCB0aGUgY2FzdWFsdGllcyBieSB5ZWFycw0KDQpgYGB7cn0NCmNhc3VhbHRpZXMgJT4lIC4uLi4uICU+JSAuLi4uLg0KYGBgDQoNCkRyYXcgYSBiYXIgcGxvdCB3aXRoIGRvZGdlZCBiYXJzIHdpdGggZmlsbCBjb2xvciByZXByZXNlbnRpbmcgdGhlIGBjYXN1YWx0eV9jbGFzc2AgYW5kIHRoZSB4IGF4aXMgc2hvd2luZyB5ZWFycyANCg0KYGBge3J9DQpjYXN1YWx0aWVzICU+JSANCiAgLi4uLi4gKyANCiAgZ2VvbV9iYXIocG9zaXRpb24gPSAiZG9kZ2UiKQ0KDQpgYGANCg0KQXJlIHRoZSBjYXN1YWx0eSBhZ2UgYW5kIGNsYXNzIGRpc3RyaWJ1dGVkIGluIHRoZSBzYW1lIHdheSBldmVyeSB5ZWFyPyBGYWNldCB0aGUgYmFyIHBsb3QgYnkgeWVhci4NCg0KYGBge3J9DQpjYXN1YWx0aWVzICU+JSANCiAgLi4uLiArDQogIGdlb21fYmFyKHBvc2l0aW9uID0gImRvZGdlIikgKw0KICAuLi4uDQpgYGANCg0KVGhpcyBiZWxvdyB3b3VsZCBiZSBhIGhlYXRtYXAuIA0KDQpgYGB7cn0NCmNhc3VhbHRpZXMgJT4lIGdncGxvdChhZXMoeCA9IGNhc3VhbHR5X3NldmVyaXR5LCB5ID0gY2FzdWFsdHlfY2xhc3MpKSArIA0KICBnZW9tX2JpbjJkKCkgDQpgYGANCg0KIyBDYXN1YWx0eSdzIGFnZSANCkRvIHRoZSBhZ2UgZGlzdHJpYnV0aW9ucyBvZiB0aGUgY2FzdWFsdGllcyBkaWZmZXIgYnkgeWVhcj8gDQpDcmVhdGUgYW4gYWdncmVnYXRlZCB0YWJsZSBjb250YWluaW5nIGZvciBlYWNoIHllYXIgdGhlIG1lYW4sIG1lZGlhbiwgMjV0aCBwZXJjZW50aWxlLCBhbmQgNzV0aCBwZXJjZW50aWxlIG9mIGNhc3VhbHRpZXMnIGFnZXMuIA0KDQpgYGB7cn0NCmNhc3VhbHRpZXMgJT4lIGdyb3VwX2J5KGFjY2lkZW50X3llYXIpICU+JSANCiAgc3VtbWFyaXplKG1lYW5fYWdlID0gbWVhbihhZ2Vfb2ZfY2FzdWFsdHkpLCANCiAgICAgICAgICAgIG1lZGlhbl9hZ2UgPSBtZWRpYW4oYWdlX29mX2Nhc3VhbHR5KSwNCiAgICAgICAgICAgIFExX2FnZSA9IHF1YW50aWxlKGFnZV9vZl9jYXN1YWx0eSwgcHJvYnMgPSAwLjI1KSwNCiAgICAgICAgICAgIFEzX2FnZSA9IHF1YW50aWxlKGFnZV9vZl9jYXN1YWx0eSwgcHJvYnMgPSAwLjc1KSkNCmBgYA0KDQpEcmF3IGJveHBsb3RzIG9yIHZpb2xpbiBwbG90cyBmb3IgdGhlIGluZGl2aWR1YWwgeWVhcnMNCg0KYGBge3J9DQoNCmBgYA0KDQoNCiMgQSBxdWVzdGlvbiBhYm91dCBob3cgdGhlIGRhdGEgaXMgbGFiZWxlZDogaG93IGFyZSBgY2FzdWFsdHlfY2xhc3NgIGFuZCBgY2FzdWFsdHlfdHlwZWAgYXNzb2NpYXRlZD8gDQoNCkp1c3QgdGFidWxhdGUgaXQgYW5kIGludGVycHJldCBieSBleWViYWxsaW5nLiANCg0KYGBge3J9DQpjYXN1YWx0aWVzICU+JSBncm91cF9ieShjYXN1YWx0eV9jbGFzcywgY2FzdWFsdHlfdHlwZSkgJT4lIGNvdW50KCkNCmBgYA0KDQpUaGUgd29yc3QgYWNjaWRlbnQgLSB3aG8gd2VyZSB0aGUgY2FzdWFsdGllcz8NCg0KRmluZCB0aGUgd29yc3QgYWNjaWRlbnQgKG1heCBjb3VudCBvZiBjYXN1YWx0aWVzKQ0KDQpgYGB7cn0NCmFnZ3JlZyA8LSBjYXN1YWx0aWVzICU+JSANCiAgZ3JvdXBfYnkoYWNjaWRlbnRfcmVmZXJlbmNlKSAlPiUgDQogIGNvdW50KCkgJT4lIA0KICB1bmdyb3VwKCkNCmdsaW1wc2UoYWdncmVnKQ0KYGBgDQoNCmBgYHtyfQ0KbWF4X3JlZmVyZW5jZSA8LSBhZ2dyZWcgJT4lIA0KICAuLi4uICU+JQ0KICBwdWxsKGFjY2lkZW50X3JlZmVyZW5jZSkNCmBgYA0KDQoNCkNyZWF0ZSBhIHNlcGFyYXRlIGRhdGEgZnJhbWUgb2YgY2FzdWFsdGllcyBmcm9tIHRoaXMgYWNjaWRlbnQgYW5kIHByZXNlbnQgYXMgbWFueSByZWxldmFudCBkZXRhaWxzIGFib3V0IHRoZW0gYXMgeW91IGNhbiBmaW5kLiANCg0KDQpgYGB7cn0NCndvcnN0X2FjY2lkZW50IDwtIGNhc3VhbHRpZXMgJT4lIC4uLi4NCmBgYA0KDQpgYGB7cn0NCndvcnN0X2FjY2lkZW50ICU+JSAuLi4uICU+JSBjb3VudCgpDQpgYGANCmBgYHtyfQ0Kd29yc3RfYWNjaWRlbnQgJT4lIGdyb3VwX2J5KGNhc3VhbHR5X2NsYXNzKSAlPiUgY291bnQoKQ0KYGBgDQoNCmBgYHtyfQ0Kd29yc3RfYWNjaWRlbnQgJT4lIA0KICBncm91cF9ieSh2ZWhpY2xlX3JlZmVyZW5jZSwgY2FzdWFsdHlfY2xhc3MpICU+JSANCiAgY291bnQoKQ0KYGBgDQoNCk1vZGVsIHRoZSBhZ2UgYW5kIHNleCBvZiBjYXN1YWx0aWVzDQoNCmBgYHtyfQ0Kd29yc3RfYWNjaWRlbnQgJT4lIA0KICANCiAgc2NhbGVfeF9jb250aW51b3VzKGJyZWFrcyA9IHNlcSgwLCBtYXgod29yc3RfYWNjaWRlbnQkYWdlX29mX2Nhc3VhbHR5KSwgMikpKw0KICBzY2FsZV95X2NvbnRpbnVvdXMoYnJlYWtzID0gc2VxKDAsIDYsIDEpKQ0KYGBgDQoNCg0KV2hhdCB3YXMgdGhlIGFnZSBhbmQgc2V4IG9mIHRoZSB1bmZvcnR1bmF0ZSBtb3RvcmJpa2UgcmlkZXI/IFdlIGtub3cgZnJvbSBsb29raW5nIGF0IHRoZSBmaWx0ZXJlZCBkYXRhIHRoYXQgdGhlIHBlcnNvbiB3YXMgb24gVmVoaWNsZSAxLiANCg0KYGBge3J9DQp3b3JzdF9hY2NpZGVudCAlPiUgLi4uLg0KYGBgDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg==