Hourly Subway Station Flows

Wait 5 sec.

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Pie charts are bad, asany fule kno. We’re not as good at judging relative differences between anglesand areas as we are at judging relative differences in lengths on a commonbaseline. This is especially true when we have more than two things to compareat the same time. So, as a rule, you shouldn’t use them. You should figure outsome other way of viewing your data instead. On the other hand, I just made 424animated pie charts because if you’re going to break a rule you should breakit good and hard. A view of the New York City Subway System (excluding the SIR). We’ll animate this in just a minute. The New York City Subway system is very large and carries a lot of passengersevery day. TheMTA makes quite a bit of data available about thesubway, including data on hourly flow through the system. Now, the MTA can’ttrack individual pathways people take through the subway. If you use an OMNYcard (or before that, a Metrocard) to enter the system, thissignals the start of a trip from some specific station or station complex. Butunlike some systems, you don’t need to “tag out” of the subway, you just exitthrough a turnstile. So the system doesn’t know where you exit it. In addition,while many stations are just on a single line, some (like 34 St/Penn Station, orFulton Street) are station complexes that serve many lines and allow transfersbetween them.However, the MTA does publish hourly Origin-Destinationestimatesfor all pairs of stations. These are their bestguessabout the flow of traffic from any particular station to any other. Becausethere are so many combinations, visualizing that sort of data is quite tricky.Even then, you don’t get information about routes through the system, juststart and end points. Transit analysts and planners can go further byintroducing some further assumptions about Subway users. For example we might assume that commuters take the most efficient route between any given pair of entry and exit stations, and build from there to a picture of flow through the system.I do something rather more simple here. I use the MTA’s hourlyorigin-destination estimates and aggregate them on a station-by-station basis tocalculate in-and-out flows across 424 subway stations or stationcomplexes. These specific numbers are averaged over all Mondays in 2025. Foreach hour of the we calculate the total passenger volume at the station, and theshare of that volume that are estimated arrivals and departures. Then we draw apie chart for each station, coloring it yellow for departures,purple for arrivals. The circle size reflects total volume and the pie sliceproportions show the flow balance.The flow data is pretty bulky. The original dataset has about 121 million rows. But working with it is pretty straightforward, thanks to the magic of parquet files, duckdb, and duckplyr. Having patiently downloaded the data via its API, I put it in a parquet file. The CSV is about 17GB but the parquet file boils it down to 1.5GB. Then I made a small R package that bundled that data with a few convenience functions. This lets me use the data without copying it into any single project. So I can write, e.g., 1 2 3 4 5 6 7 8 9101112131415161718nycsubwayodr::nyc_subway_odr()#> # A duckplyr data frame: 15 variables#> year month day_of_week hour_of_day timestamp day_of_month origin_station_complex_id#> #> 1 2025 1 Monday 1 2025-01-06 01:00:00 6 189#> 2 2025 1 Monday 1 2025-01-06 01:00:00 6 313#> 3 2025 1 Monday 1 2025-01-06 01:00:00 6 611#> 4 2025 1 Monday 1 2025-01-06 01:00:00 6 125#> 5 2025 1 Monday 1 2025-01-06 01:00:00 6 313#> 6 2025 1 Monday 1 2025-01-06 01:00:00 6 154#> 7 2025 1 Monday 1 2025-01-06 01:00:00 6 167#> 8 2025 1 Monday 1 2025-01-06 01:00:00 6 612#> 9 2025 1 Monday 1 2025-01-06 01:00:00 6 272#> 10 2025 1 Monday 1 2025-01-06 01:00:00 6 167#> # ℹ more rows#> # ℹ 8 more variables: origin_station_complex_name , origin_latitude , origin_longitude ,#> # destination_station_complex_id , destination_station_complex_name ,#> # destination_latitude , destination_longitude , estimated_average_ridership From there, we lazily query the data and duckdb does the work of doing the calculations. The whole table is never loaded into your R session, and duckdb is very fast. From there, we take our hourly flow summaries, join them to a tibble of station and line data, and export the result to some JSON files that D3js animates for us.Here’s the result. There are three views. Initially, you see just the schematic subway map. If you click the “Map” button in the top left, it will switch to the ticking pie-chart view, which puts a pie on every station complex, with each tick being an hour of the day. The pies pile up on one another in the geographic view (in a not wholly uninformative way), but click again to have them expand to a somewhat more abstracted, force-directed network view of the system. Then click again to go back to the map. You can hover over or tap on nodes to get information about the bit of data it’s currently showing. Map / Departures Arrivals (async function() { const [networkData, boroughs] = await Promise.all([ fetch("network_odr_monday.json").then(r => r.json()), fetch("boroughs.geojson").then(r => r.json()) ]); const chart = createSubwayNetworkODR(d3, networkData, boroughs, { width: Math.min(window.innerWidth - 40, 1800), height: Math.min(900, Math.max(500, window.innerHeight - 100)) }); document.getElementById("odr-chart").appendChild(chart.node); chart.setThemePageBody(false); const states = ["geo", "volume", "network"]; const labels = ["Map", "Net Flow", "Network"]; let stateIdx = 0; let theme = "light"; let animating = true; let animInterval = null; const modeBtn = document.getElementById("mode-btn"); modeBtn.addEventListener("click", () => { stateIdx = (stateIdx + 1) % states.length; modeBtn.textContent = labels[stateIdx]; chart.update(states[stateIdx]); }); const themeBtn = document.getElementById("theme-btn"); themeBtn.addEventListener("click", () => { theme = theme === "light" ? "dark" : "light"; chart.setTheme(theme); }); const slider = document.getElementById("hour-slider"); slider.addEventListener("input", () => chart.updateHour(+slider.value)); const tickLabels = ["12am","","","","4am","","","","8am","","","","12pm","","","","4pm","","","","8pm","","","11pm"]; const tickContainer = document.getElementById("tick-labels"); tickLabels.forEach(t => { const span = document.createElement("span"); span.textContent = t; span.style.width = "0"; span.style.textAlign = "center"; span.style.overflow = "visible"; span.style.whiteSpace = "nowrap"; tickContainer.appendChild(span); }); const playBtn = document.getElementById("play-btn"); const playIconSvg = ''; const pauseIconSvg = ''; function startAnimation() { animating = true; playBtn.innerHTML = pauseIconSvg; animInterval = setInterval(() => { const next = (+slider.value + 1) % 24; slider.value = next; chart.updateHour(next); }, 1000); } function stopAnimation() { animating = false; playBtn.innerHTML = playIconSvg; if (animInterval) { clearInterval(animInterval); animInterval = null; } } playBtn.addEventListener("click", () => { if (animating) stopAnimation(); else startAnimation(); }); chart.setSliderElement(document.getElementById("odr-hour-slider-wrap")); chart.updateHour(8); startAnimation();})();Now, you might reasonably say, Kieran, that’s a lot of data to show that people go to work in the morning and come home in the evening. I’m not saying there’s nothing to that criticism. But there are quite a few interesting details in there as the data pick up traffic to different parts of town. The big interchanges naturally dominate the view, but even here there are things of interest about the balance of flow, as e.g. Penn Station has people coming in on New Jersey Transit during morning rush hour and then entering the subway, which does a lot to balance its net flow during rush-hour and even tip it towards net departures. But more importantly, who doesn’t want to sit back and contemplate more than 400 pie charts, each one pulsing with life as another hour ticks by?To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Hourly Subway Station Flows