Sharla Gelfand

Tidying the TTC

Hello and welcome to another episode of tidy that data!

I enjoyed my last data tidying post so much and I’ve tweeted a couple of times about the immense satisfaction I get from tidying data sets – so this is long overdue.

I’m returning to the City of Toronto Open Data catalogue (FYI – they just launched their new open data portal, which includes a data blog, better catalogue navigation, a developer API, in-portal data exploration, and more! congratulations to the team 🎉), this time looking at Toronto Transit Commission (TTC), aka public transit, data.

This specific data set is called “TTC Ridership Analysis”. As the source says, the “TTC Ridership Analysis data set measures the first point of payment when boarding at the start of a journey using the TTC. Data includes ticket types as well as the type of vehicle.” Since it is an analysis data set, some aggregation and analysis has already been done to get it in its current format.

Unfortunately, this is not the format I want 😈

This data set actually contains three-in-one – who, where, and when.

Within the “who” data set, there are main headings (adult, senior/student, children) and subheadings (tokens, tickets, monthly pass, etc) as well as a additional section that doesn’t contain a main heading at all (day/vist./other, blind/war amps, gta pass, etc). These headings are all in the same column. There are sub-totals for each “who” section and a “system total” at the end. The “where” and “when” sections follow a similar pattern, with main and sub-headings and totals.

There is also some preamble and “postamble” discussing the data set. Because of the preamble, I expect that we won’t get nice column headings when importing the data set.

That’s okay, this is what R was made for 💪

(this too)

Let’s load the data in.

library(readxl)
library(knitr)

ttc <- read_excel("data/tidying-ttc/1985-2017 Analysis of ridership.xlsx")
TORONTO TRANSIT COMMISSION X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9
ANALYSIS OF RIDERSHIP NA NA NA NA NA NA NA NA NA
1985 TO 2017 ACTUALS (000’S) NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA
NA FARE MEDIA 2017 2016 2015 * 2014 2013 2012 2011 2010
WHO ADULT NA NA NA NA NA NA NA NA
NA TOKENS 76106 102073 110945 111157 112360 117962 124748 120366
NA TICKETS N/A N/A N/A N/A N/A N/A N/A 1298
NA TWO-FARE N/A N/A N/A N/A N/A N/A N/A N/A
NA PRESTO - SINGLE RIDE 67829 27397 13323 9862 8194 4399 1139 0
NA PRESTO - SRVM TOKEN RIDE 1271 1157 N/A N/A N/A N/A N/A N/A

As expected, the first few rows are the preamble, and the first column gives the who/where/when specification. The second column of data includes both the main headings (adult) as well as sub headings (tokens, tickets, etc). All of the columns are characters, and the missing values (“N/A”) read in as strings instead of missing values.

We’ll get started by replacing those “N/A” strings with literal NA (missing values), and converting everything to lowercase so it doesn’t feel like the data set is yelling at us 📣. For the NA replacement, I’m using Nicholas Tierney’s naniar package, designed to deal with missing data. This vignette specifically is a great walk through of how to replace values with NA.

library(dplyr)
library(naniar)

ttc <- ttc %>%
  replace_with_na_all(condition = ~.x == "N/A") %>%
  mutate_all(.funs = tolower)
TORONTO TRANSIT COMMISSION X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9
analysis of ridership NA NA NA NA NA NA NA NA NA
1985 to 2017 actuals (000’s) NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA
NA fare media 2017 2016 2015 * 2014 2013 2012 2011 2010
who adult NA NA NA NA NA NA NA NA
NA tokens 76106 102073 110945 111157 112360 117962 124748 120366
NA tickets NA NA NA NA NA NA NA 1298
NA two-fare NA NA NA NA NA NA NA NA
NA presto - single ride 67829 27397 13323 9862 8194 4399 1139 0
NA presto - srvm token ride 1271 1157 NA NA NA NA NA NA

Next, I’m going to remove those first few rows of description and any row that describes a total (it’s just aggregating the other rows). I’m also cleaning up the column names using Sam Firke’s janitor package, since it feels like they’re yelling too.

library(stringr)
library(janitor)

ttc <- ttc[-c(1:3),] %>%
  clean_names() %>%
  filter(!str_detect(x_1, "total")) %>%
  mutate(id = row_number())
id toronto_transit_commission x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8
1 NA fare media 2017 2016 2015 * 2014 2013 2012 2011
2 who adult NA NA NA NA NA NA NA
3 NA tokens 76106 102073 110945 111157 112360 117962 124748
4 NA tickets NA NA NA NA NA NA NA
5 NA two-fare NA NA NA NA NA NA NA
6 NA presto - single ride 67829 27397 13323 9862 8194 4399 1139

At this point, I’ve also added in an id column that’s just the row number. I’ll use this to figure out how to cutoff and split the various who/where/when data sets. To do this, I’m getting the id (i.e., row) where the “who”, “where”, and “when” sections start by seeing, for example, where toronto_transit_commission == "who".

who_start <- ttc %>%
  filter(toronto_transit_commission == "who") %>%
  pull(id)

where_start <- ttc %>%
  filter(toronto_transit_commission == "where") %>%
  pull(id)

when_start <- ttc %>%
  filter(toronto_transit_commission == "when") %>%
  pull(id)

From this, we can see that row 2 is where the “who” data set begins, and rows 34 and 41 for the “where” and “when” data sets, respectively.

ttc_who <- ttc[who_start:(where_start - 1), -1]
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
adult NA NA NA NA NA NA NA NA NA
tokens 76106 102073 110945 111157 112360 117962 124748 120366 114686
tickets NA NA NA NA NA NA NA 1298 8807
two-fare NA NA NA NA NA NA NA NA NA
presto - single ride 67829 27397 13323 9862 8194 4399 1139 0 0
presto - srvm token ride 1271 1157 NA NA NA NA NA NA NA

Since this contains a few different sections (adult, senior/student, etc), I’m repeating a similar process to understand where each of those sections starts. If you know of a more clever way to do this, let me know!

ttc_who <- ttc_who %>%
  mutate(id = row_number())

adult_start <- ttc_who %>%
  filter(x_1 == "adult") %>%
  pull(id)

senior_student_start <- ttc_who %>%
  filter(x_1 == "senior/student") %>%
  pull(id)

children_start <- ttc_who %>%
  filter(x_1 == "children") %>%
  pull(id)

remaining_start <- ttc_who %>%
  filter(x_1 == "day/vist./other") %>%
  pull(id)

I’ve recomputed the id field so that it uses the new row numbers from ttc_who, and not from the original ttc sedate. Again, we can see that the adult, senior/student, children, and remaining sections start at rows 1, 14, 23, and 28. Now let’s split them into individual data sets, adding a “who” column that describes the main heading so that data’s not lost.

adult <- ttc_who[(adult_start + 1):(senior_student_start - 1), ] %>%
  mutate(who = "adult")

senior_student <- ttc_who[(senior_student_start + 1):(children_start - 1), ] %>%
  mutate(who = "senior/student")

children <- ttc_who[(children_start + 1):(remaining_start - 1), ] %>%
  mutate(who = "children")
who x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9
adult tokens 76106 102073 110945 111157 112360 117962 124748 120366
adult tickets NA NA NA NA NA NA NA 1298
adult two-fare NA NA NA NA NA NA NA NA
adult presto - single ride 67829 27397 13323 9862 8194 4399 1139 0
adult presto - srvm token ride 1271 1157 NA NA NA NA NA NA
adult presto - srvm cash ride 821 582 NA NA NA NA NA NA

Starting to look a lot better! The “remaining” section is a bit of an anomaly in that its first column actually describes the “who” and not the fare type, so it needs some additional massaging. I’m doing this weird renaming and rearranging so that we can easily stick this together with the other sections.

remaining <- ttc_who[-c(1:(remaining_start -1)), ] %>%
  mutate(who = x_1,
         x_1 = NA)
who x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9
day/vist./other NA 6728 9130 8561 10033 11428 11929 10642 10605
blind/war amps NA 1086 1088 1086 1119 1109 1086 1060 1073
premium express NA 448 474 490 451 401 372 344 322
postal carriers NA NA NA NA NA NA NA NA NA
gta pass NA 4283 4855 5471 6087 5784 5388 5642 5667

Now, we can put the “who” back together and start rearranging, first by giving appropriate names to the columns.

the_who <- adult %>%
  bind_rows(senior_student) %>%
  bind_rows(children) %>%
  bind_rows(remaining) %>%
  select(who, everything(), -id)

colnames(the_who) <- c("who", "fare_type", 2017:1985)
who fare_type 2017 2016 2015 2014 2013 2012 2011 2010
adult tokens 76106 102073 110945 111157 112360 117962 124748 120366
adult tickets NA NA NA NA NA NA NA 1298
adult two-fare NA NA NA NA NA NA NA NA
adult presto - single ride 67829 27397 13323 9862 8194 4399 1139 0
adult presto - srvm token ride 1271 1157 NA NA NA NA NA NA
adult presto - srvm cash ride 821 582 NA NA NA NA NA NA

We’re getting close 🙏. This data is in a good place, but it’s in a wide format – each year’s data is in a different column. This will make it difficult to, for example, plot rides paid with tokens by year. The last step is to convert the data to a long format, so that we have one row for every observation (i.e., one row for every who/fare_type/year combination).

library(tidyr)

the_who <- the_who %>%
  gather(key = year, value = n, -who, -fare_type) %>%
  mutate(year = as.numeric(year),
         n = as.numeric(n)) %>%
  select(year, who, fare_type, n)
year who fare_type n
2017 adult tokens 76106
2017 adult tickets NA
2017 adult two-fare NA
2017 adult presto - single ride 67829
2017 adult presto - srvm token ride 1271
2017 adult presto - srvm cash ride 821

Now we’re done with the “who”! The “where” and “when” follow very similar patterns, so I’ll go through them quickly.

ttc_where <- ttc[(where_start + 1):(when_start - 1), -1] %>%
  filter(x_1 != "rail") %>%
  mutate(where = if_else(x_1 == "bus", "bus", "rail")) %>%
  select(where, x_1, everything())

colnames(ttc_where) <- c("where", "method", 2017:1985)

ttc_where <- ttc_where %>%
  gather(key = year, value = n, -where, -method) %>%
  mutate(year = as.numeric(year),
         n = as.numeric(n)) %>%
  select(year, where, method, n)
year where method n
2017 bus bus 261113
2017 rail subway 213012
2017 rail s.r.t. 3177
2017 rail trolley coach 0
2017 rail streetcar 55914
2016 bus bus 252899
ttc_when <- ttc[when_start:nrow(ttc), -1] %>%
  select(-id)

colnames(ttc_when) = c("when", 2017:1985)

ttc_when <- ttc_when %>%
  gather(key = year, value = n, -when) %>%
  mutate(year = as.numeric(year),
         n = as.numeric(n)) %>%
  select(year, when, n)
year when n
2017 weekday 424155
2017 weekend/holiday 109061
2016 weekday 424117
2016 weekend/holiday 113962
2015 weekday 423808
2015 weekend/holiday 110197

Now that we have all three datasets in a tidy, analyzable format, we can explore.

Let’s start by looking at TTC trips over time, just for adults.

library(plotly)

the_who %>%
  filter(who == "adult") %>%
  group_by(year) %>%
  summarise(trips = sum(n/1000, na.rm = TRUE)) %>%
  plot_ly(x = ~year, y = ~trips) %>%
  add_lines() %>%
  layout(title = "Annual adult TTC trips", 
         yaxis = list(title = "TTC trips (millions)", 
                      rangemode = "tozero"))

We can see immediately that there is a huge drop in adult TTC trips in 1990, dropping from a high of 367 million annual trips to 280 million in 1995. It wasn’t until 2007 that the adult trips reached what they’d been in 1990. I’m not from Toronto, so I did some digging.

I found the TTC’s 1991 annual report which explained that the decline was due to the recession at the time. This may seem obvious if you were around and cognizant then, but I was, ahem, born during this decline. Former premier Mike Harris was also elected in ’95 and cut provincial funding to the TTC. The increase in ridership during the late 90s seems mostly due to the controversial formation of the “Megacity”, the amalgamation of six former municipalities into one “City of Toronto”.

We can also look at what percentage of TTC trips start on the bus, subway, etc. The dataset measures the first point of payment, so we don’t actually see, for example, how many trips were on the bus since you can make a direct transfer a streetcar or subway to the bus.

ttc_where %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ungroup() %>%
  plot_ly(x = ~year, y = ~prop) %>%
  add_lines(color = ~method) %>%
  layout(title = "Percent of trips started by transit type", 
         yaxis = list(title = "", tickformat = ".1%"),
         legend = list(orientation = "h"))

The S.R.T is an extension of the subway line in the east (in Scarborough) – it is a light metro system, and not counted in with the rest of the subway. This line is quite short (6 instead of the 69 stations on the main subway) and contains the least used station, which has no direct bus connection.

I’d be especially interested to see how this data changes for 2018 – in late 2017 we saw the subway extension of Line 1 into Vaughan, with 6 new stations including a stop at York Univesity, which is expected to carry 27,000 passengers a day by 2020 – previously, the primary route to the university was an express bus. The city is also experimenting with the King Street Pilot, intended to make travelling by streetcar less painful (i.e., less stuck in traffic with cars).

There’s a lot more exploration that can be done with this data; this is not intended to be exhaustive, or to shed new light on the operations of the TTC or the history of Toronto. I did learn a fair bit about both in the process (including that the TTC had trolley coaches from 1922 until 1993!) but this is perhaps an indication that I have a lot to learn about my new city and the transit that serves it!

This is also not intended to roast the way this data set has been put together by the TTC – the original data set is in a tabular format that is verbose, quite detailed, and intended for human reading. It is a great overview and analysis of the TTC over the last 32 years. My goal is to demonstrate how we can take that tabular data, and though it looks daunting, transform it into a machine readable format to allow for additional exploration.

👋