Last year I played around a bit with the New York Citibike data, and looked a little bit at the different use patterns among among the sexes, and between subscribers and ad hoc users of the service.
Being an Englishman, I was also wondering if there were differences between the patterns of bike usage on different sides of the Atlantic, so I recently got hold of the 20 million odd records of Barclay Bike data from the Transport for London open data portal.
There are some differences between what has been included in the TFL and Citibike data. Most notably we get less information about who is using the London bikes - the records are not split by user type, so we cannot tell regular users from occasional ones as in the NY data. This is a shame, because there are some very interesting patterns of usage that emerge from the NY data when you split by subscribers and occasional users: what seems essentially to be commuters and tourists.
Weather data
There are lots of other datasets available from New York and London which we can combine with these data; one of the most obvious is weather data. Data from weather stations all over the world are available from the NCDC.
I’ve worked with weather data from NCDC before, so already I have some scripts for tidying up this data. A full description of the data is available here, but for simplicity I’m just going to look at daily mean temperature.
The code is pretty straightforward, so I’ll skip to some analysis, but as ever, all the code used to render this (and all other posts) can be found on github1.
Here’s how I’m going to go about it:
- Find active weather stations close to the bike stations.
- Check the coherency of the data from these stations.
- Aggregate bike journeys into number of daily journeys.
- Combine daily bike journeys with ‘global surface summary of the day’ (GSOD) data.
- Make some interesting insights (or something…)
Find weather stations
It can be a little hard to find out exactly which stations are operating, and exactly where they are. I’ve found that stations metadata can be a bit out of date, so I tend to go for a brute force approach. For the New York weather data, I pulled all of the data for the corresponding time period from all the stations in New York State.
For the UK you can’t easily limit the data down to the greater London region, so I just pulled all the UK data for the appropriate time period.
New York
So I’ll start by seeing which weather stations are available in the local area. And because I selected a date range when I selected the data to download from the NCDC, I already know that these stations were active for at least part of the time period that we are interested in.
The lines in the plot below originate at the centroid of all the bike hire stations.
So it turns out that there are four weather stations in close proximity to the bike hire stations in New York. One (THE_BATTERY) appears to be at the end of a pier, so these measurements may not be completely representative of conditions within the city.
That said, the other two likely candidates are within central park, and the final one is round 7 km from the middle point of the Citibike stations at La Guardia airport, so again, this may not be particularly representative.
In the following table I have calculated linear distance from the centroid of bike stations to the individual weather stations using code from an excellent post on the subject here (in metres). Begin and end refer to the start and end of weather records.
London
So what about London? There are two stations in the locality, one at St. James’s Park, the other slightly further out at London city airport. The St. James’s Park station is less than 800 m from the centroid of the Barclay Bike stations, so it should do nicely.
And London City airport…
Checking the data
Next thing is to have a look at the integrity of the data coming from these stations.
It’s pretty clear that we can discard one of the NY met stations pretty quickly - the Nantucket memorial station has a very incomplete record indeed. And whilst the pattern is very similar, it looks as if the records from the the riverside battery station are a few degrees cooler than the Central Park measurements, so it may not be wise to include measurements from it either.
And what about London? Records from the St. James’s Park and London City airport stations look pretty similar, but the latter is a few kms away from the bike stations. Since the St. James’s Park station is so close to the centroid, it makes sense just to use measurements from this station.
Making some insights
So having combined the weather data with daily journey counts, this is what comes out:
The first thing we can say is that New York has much greater extremes of weather than London - no surprise here (note the x axes deliberately not to the same scale).
The response of the bike users to these extremes is much more interesting. In New York, the number of Citibike users drops off after about $20\,^{\circ}\mathrm{C}$, whilst in London we just get the hint of a drop off closer to $25\,^{\circ}\mathrm{C}$, which is more or less the maximum temperature recorded.
Since the NY data also records some information about the type of user, we can drill down a little further.
The male and female subscribers both show a pretty similar pattern which is obviously driving the shape of the curve we saw in the plot before. The NAs, which are made up of occasional users without subscriptions (for which we might assume tourists) show a very different pattern.
Up to the maximum temperature of around $30\,^{\circ}\mathrm{C}$, the number of daily journeys slowly creeps up, and shows none of the steep drop off around $20\,^{\circ}\mathrm{C}$ evident among subscribers. In fact, the pattern is a lot more like that shown in the first plot for London users.
A whole load of caveats
So this is a probably an oversimplistic way of looking at things. Some of the other questions we might consider are: what about other environmental variables: rainfall, snow depth, wind speed (which are all likely to be correlated with temperature). And what about weekday, maybe there will be different patterns for weekends and working days? I’ll pick these interactions apart in a future post.
And what about 19 August 2012?
Trying to work out what is behind some of these outliers is an interesting question - like 19 August 2012 in London:
A cursory Google search reveals nothing of note, but there were only a third of users on bikes compared to the two subsequent years.
It was the hottest day of 2012, and did happen to fall on a Sunday…
but usage was still markedly lower than other warm weekend days in 2012…
2013…
and 2014…
Maybe it was just as nice day for a BBQ?
-
I cheat a little bit in the source code here because the data is reasonably large and takes time to process, so I have already done some of the processing and saved out to .Rdata or .Rds objects. ↩