Available online at www.sciencedirect.com Procedia Computer Science 00 (2017) 000–000 www.elsevier.com/locate/procedia The 2nd edition of the International Workshop on Data Mining on IoT Systems (DaMIS) Prediction of bicycle counter data using regression Johan Holmgrena,b,∗, Sebastian Aspegrena, Jonas Dahlströma aDepartment of Computer Science and Media Technology, Malmö University, Malmö 205 06, Sweden bK2 (The Swedish Knowledge Centre for Public Transport) Abstract We present a study, where we used regression in order to predict the number of bicycles registered by a bicycle counter (located in Malmö, Sweden). In particular, we compared two regression problems, differing only in their target variables (one using the absolute number of bicycles as target variable and the other one using the deviation from a long-term trend estimate of the expected number of bicycles as target variable). Our results show that using the trend curve deviation as target variable has potential to improve the prediction accuracy (compared to using the absolute number of bicycles as target variable). The results also show that support vector regression (using 2nd and 3rd degree polynomial kernels) and regression trees perform best for our problem. c 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. Keywords: Bicycle counter, regression, trend curve, regression algorithm comparison 1. Introduction The bicycle has become an important part of urban transport due to its ability to contribute to fast, sustainable, and cost efficient transport. It also contributes to a healthy, active, life style, and the popularity of the bicycle is accentuated by the increase of bicycling that can be observed around the world. Due to the positive effects of bicycling, there is an increasing interest from public authorities to increase the use of the bicycle. However, in order to achieve a modal shift towards bicycling (from motorized transport), it is important to increase the attractiveness of the bicycle. This can be achieved by implementing various types of policy measures, including the construction and improvement of biking infrastructure, such as bicycling lanes and safe parking facilities. Other initiatives include bicycle sharing systems, which are currently being implemented in cities around the world 1,2. Bicycle sharing systems enable, for example, fast multimodal passenger transport, where public transport and the bicycle can be combined in an efficient way 1. The recent introduction of electrical bicycles, is another factor that increases the attractiveness of the bicycle 3. However, in order to build a transport system that encourages bicycling, it is important to fully understand the current bicycle flows, and what factors influence the travelers’ choices whether to travel by bicycle, to use some other mode of transport (e.g., car or bus), or to not travel at all. Hence, it is important to collect various types of traffic, transport and bicycle related data, which can be done using Internet of Things (IoT) connected devices, such as bicycle ∗ Corresponding author. Tel.: +46-40-6657688 ; fax: +46-40-665 76 46. E-mail address: [email protected] 1877-0509 c 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs.2 Holmgren et al. / Procedia Computer Science 00 (2017) 000–000 counters and mobile phone applications that enables registering the movement of travellers. Bicycle counters, which are the focus of the current paper, allow to continuously register the bicycles that pass some particular point in a transport network. Due to the possibility to register a large share of the passing bicycles, bicycle counters (typically built using inductive loop detectors) are commonly used to collect bicycle flow data. In the presented study, we analyzed data collected by a bicycle counter located in Malmö, Sweden. An important purpose of our work was to quantify how various factors, such as day of week, time of year, and weather (temperature and precipitation), is expected to influence the amount of bicycle traffic at a particular point in the traffic network. We studied how it is appropriate to formulate a regression problem that can be used to estimate the number of bicycles registered by a bicycle counter. In particular, we investigated whether the use of a long-term trend estimate of the number of registered bicycles has the potential to improve the regression accuracy. We also compared different regression approaches, in order to identify which approach is most suitable for the considered problem. The current study builds on the Bachelor’s thesis of Aspegren and Dahlström 4, who compared a set of regression algorithms regarding their ability to estimate the number of bicycles registered by our bicycle counter. Aspegren and Dahlström limited their analysis to consider only working days, whereas we include all days in the regression problem, explicitly considering day of week, school breaks, national holidays, and bridge days as input features. Our work aims to provide input for passenger transport analysis models used by city and transport planners, e.g., for assessing the impact of transport policy measures. The relevance in this direction is emphasized by the fact that bicycling is currently being incorporated in passenger transport analysis models. The current paper is organized in the following way. In the next section we give an account to previous research related to our work. In Section 3, we describe the data processing that we conducted in the beginning of our study. In Section 4, we present our regression modeling, which is followed in Section 5 by our computational results. We finalize the paper in Section 6 with some conclusions and pointers to future work. 2. Related work The research related to bicycle data analysis has been quite intensive during the recent years. Romanillos et al. 5 provide an overview of big data approaches applied in the bicycling context. A large amount of research concern bicycle sharing systems, where the studied problems include bicycle repositioning 6 and location of base (or docking) stations 7. Data mining has been applied in the bicycle sharing context, for example, in order to estimate usage patterns 8,9. Data mining also plays an important role in travel demand estimation (including bicycle demand analysis), which is an integral part of traffic and transport analysis models (both in urban and in regional contexts). Traditionally, travel demand is estimated using travel survey data, often combined with GPS trajectories 10. Bicycle demand can be further estimated using different types of discrete choice models, which have been used, for example, for bicycle route and destination choice estimations 11. In addition, there exists research on how various factors, including weather, calendar events, and work related factors, influence the choice whether or not to use the bicycle 12,13. The current paper focus on regression analysis using bicycle counter data in order to quantify how factors such as weather are expected to influence the amount of bicycling. According to the best of our knowledge, there exist no such previous study, except for the work by Aspegren and Dahlström 4. 3. Data pre-processing In our study, we considered the time period September 13, 2006 to March 31, 2014, where we used bicycle volume data from a bicycle counter located in the city center of Malmö, weather data (i.e., temperature and precipitation), and information about national holidays and school breaks. We obtained information about school breaks from the web pages of the public schools in Malmö; however, as complete information about school breaks were not publicly available for the considered time period, we made a few assumptions concerning school breaks. In particular, we assumed that the longer school breaks occur during the same weeks each year, which was partially confirmed by the municipality of Malmö. The bicycle counter and weather data sets, which we received from the municipality of Malmö, specify values hourly. However, in the regression problem, where we considered each day as a data point, we aggregated the bicycle counter data for each day, and we used the averages of the temperature and precipitation values for each day. In addition, the bicycle counter and weather data sets had some missing values, which we estimatedHolmgren et al. / Procedia Computer Science 00 (2017) 000–000 3 using interpolation. Weather data values were missing up to a few hours here and there, and bicycle counter values were missing for periods up to a couple of weeks in either or both of the directions. We estimated a missing weather factor value xhk for some hour hk, between two hours hi and h j (hi < hk < h j) with known weather factor values as xhk = xhi + xh j − xhi h j − hi (hk − hi) , (1) where xhi and xh j denote the (known) weather factor values for hours hi and h j respectively. Similarly, we estimated a missing bicycle counter value by taking the average of the corresponding hour one year immediately before and one year immediately after the missing values. For example, we estimated a missing bicycle counter value xh,d,w,y for hour h on weekday d in week w and year y as xh,d,w,y = xh,d,w,y−1 + xh,d,w,y+1 2 . (2) 4. Regression problem formulation We formulated our regression problem, as an extension of the model by Aspegren and Dahlström 4, using the input features provided in Table 1. It should be mentioned that for the i:th day in a year, the time_in_year feature is calculated as i n , where n is the number of days in the year (either 365 or 366). For our set of input features, we formulated two regression problems (P1 and P2), which differ only in their target variables. Table 1. Input features used in our two regression problems. Feature (name) Type Value range year Ordinal {2006, . . ., 2014} time_in_year Numerical (0, 1] is_monday Nominal {0, 1} is_tuesday Nominal {0, 1} is_wednesday Nominal {0, 1} is_thursday Nominal {0, 1} is_friday Nominal {0, 1} is_saturday Nominal {0, 1} is_sunday Nominal {0, 1} is_school_break Nominal {0, 1} is_bridge_day Nominal {0, 1} is_public_holiday Nominal {0, 1} temperature (daily avg.) Numerical R precipitation (daily avg.) Numerical R As mentioned in Section 1, our purpose for using regression was to estimate the number of bicycles registered by a bicycle counter, considering the factors provided in Table 1. Therefore, we chose to use the total number of (daily) registered bicycles as regression target variable in one of our regression problems (P1). In P2, we instead used the deviation from an estimated long-term trend curve as target variable. Our reason for formulating P2 was that we observed a long-term trend of varying number of registered bicycles at the bicycle counter. The diagram to the right in Fig. 1, which presents the moving yearly average of the number of registered bicycles per day, shows that we have an initial increase of bicycle volumes, followed by a decrease, and by another increase at the end of the time series. This contradicts what is expected, as there has been a rather linear increase of the population in Malmö from about 276000 as of December 31, 2006 to about 318000 as of December 31, 2014. This means that the number of bicycles registered by the counter, most likely does not follow the overall trend in Malmö. For example, the observed decrease might be partly due to the opening of a new railway station in 2010,4 Holmgren et al. / Procedia Computer Science 00 (2017) 000–000 2007 2008 2009 2010 2011 2012 2013 2014 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of bicycles 3-week average 2008 2009 2010 2011 2012 2013 2014 4800 5000 5200 5400 5600 5800 6000 6200 6400 Number of bicycles Yearly average Fig. 1. Average number of bicycles per day using three week moving average (to the left) and yearly moving average (to the right). resulting in a redistribution of the bicycle flows in Malmö. In order to consider this (probably) deviating trend at the bicycle counter, we decided to formulate our regression problem P2, where we used the deviation from a long-term trend estimate at the bicycle counter instead of the absolute number of bicycles as target variable. We constructed our trend estimate (or trend curve) using the following steps (see also Fig. 2): 1. We calculated monthly indices (over the number of bicycles) using the ratio-to-moving-average method, for which we estimated a seasonal index curve (using splines). 2. For each day, we divided the number of registered bicycles with the index given by the seasonal index curve. 3. Finally, we fitted a 4 degree polynomial to the index adjusted time series, giving us our long-term trend estimate. 0 2 4 6 8 10 12 Month 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Index Monthly index Seasonal index curve 2007 2008 2009 2010 2011 2012 2013 2014 Year 0 2000 4000 6000 8000 10000 12000 Number of bikes per day Number of bikes - index adjusted Long-term trend estimate Fig. 2. Monthly volume indices, seasonal index curve, and the long-term trend estimate that we used in our regression problem P2. For each day (d), the deviation from the long-term trend estimate (used as target variable in P2) is given by num_bicyclesd− f (d) f (d) , where f (d) is the number of bicycles given by the long-term trend estimate for day d. 5. Computational results In order to compare the performance of different regression approaches, and to investigate whether the use of a longterm trend estimate has potential to improve the regression accuracy, we implemented and evaluated our regression problems (P1 and P2) using Weka (the Waikato Environment for Knowledge Analysis) machine learning tool 14. In our study, we included the following (six) regression algorithms: • M5 rules (decision list generated using a series of model trees, each generating an M5 rule).Holmgren et al. / Procedia Computer Science 00 (2017) 000–000 5 • M5P (model tree where each leaf is an M5 rule). • Rep tree (regression tree built using information gain). • Linear regression (linear regression function built using the Akaike metric). • Multi layer perceptron (network model based on back propagation). • SMOReg (Support vector regression with 1,2, and 3 degree polynomial kernels). We chose to include the three first algorithms as they were the most promising algorithms (for a similar regression problem) according to Aspegren and Dahlström 4, and we included the remaining three algorithms in order to make the set of selected algorithms more diverse. Even though some meta algorithms, such as Random subspace and Bagging, have potential to provide good accuracy 4, we chose to not consider any meta algorithms in the current study. In addition, tuning the selected algorithms (to optimize their performance) was outside the scope of the current study. For each of the selected algorithms, we constructed regression models (using 10-fold cross validation) for both P1 and P2. We then generated output metrics using the metrics package included in the scikit-learn machine learning library for Python. The reason for exporting the Weka output and further processing it using Python was that we needed to translate the predictions made for P2 into absolute number of bicycles, hence providing comparable metrics for our two regression problems. Our results (see Table 2 and Table 3) clearly indicate that, among the tested algorithms, SMO reg (support vector regression using 2nd and 3rd degree polynomial kernels) and Rep tree (regression tree) perform best for our regression problems. For each of the considered regression algorithms (except for Multi layer perceptron using the standard Weka configuration) our results also shows that all of the considered output metrics are better for P2 than for P1. That is, the problem where we used the deviation from the trend curve as target variable performs best. Table 2. Results for regression problem P1, using the absolute number of registered bicycles as target variable. Algorithm R2 corr. coeff. Mean abs. error Root mean sq. error Rel. abs. error Root rel. sq. error M5 rules 0.853 622.8 865.8 32.4% 38.3% M5P (model tree) 0.854 619.8 862.3 32.3% 38.2% Rep tree 0.868 587.2 820.8 30.6% 36.3% Linear regression 0.770 817.0 1083 42.6% 47.9% Multi layer perceptron 0.830 707.3 932.7 36.8% 41.3% SMOReg (1 deg. poly. kernel) 0.765 809.7 1095 42.2% 48.5% SMOReg (2 deg. poly. kernel) 0.850 628.3 875.0 32.7% 38.7% SMOReg (3 deg. poly. kernel) 0.869 571.3 818.6 29.8% 36.2% Table 3. Results for regression problem P2, using the deviation from the trend curve shown in Fig. 2 as target variable. The performance metrics corresponds to the absolute number of bicycles. Algorithm R2 corr. coeff. Mean abs. error Root mean sq. error Rel. abs. error Root rel. sq. error M5 rules 0.859 599.2 848.1 31.2% 37.5% M5P (model tree) 0.869 575.9 818.8 30.0% 36.2% Rep tree 0.875 563.8 799.2 29.4% 35.4% Linear regression 0.798 755.3 1015 39.3% 44.9% Multi layer perceptron 0.826 695.1 943.5 36.2% 41.7% SMOReg (1 deg. poly. kernel) 0.792 745.9 1031 38.9% 45.6% SMOReg (2 deg. poly. kernel) 0.866 579.3 826.7 30.2% 36.6% SMOReg (3 deg. poly. kernel) 0.871 559.8 810.6 29.2% 35.9%6 Holmgren et al. / Procedia Computer Science 00 (2017) 000–000 6. Conclusions and future work We have studied how regression can be used to estimate the number of bicycles registered by a bicycle counter (located in Malmö, Sweden). In particular, we formulated and compared two regression problems, which only differ in their target variables (one with the absolute number of bicycles as target variable and one with the deviation from a long-term trend estimate of the expected number of bicycles as target variable). In addition, we compared a number of regression algorithms (see Section 5) in order to find out which algorithms perform best for the considered problem. For the two versions of our regression problem, we obtained the results presented in Table 2 and Table 3. In particular, it should be emphasized that, by explicitly modeling day of week, national holidays, school breaks, and bridge days as input features, we managed to significantly improve the prediction accuracy compared to the results recently obtained by Aspegren and Dahlström 4. For example, in terms of relative absolute error, we managed to improve the results from about 90% to around 30% for the best performing algorithms. This shows that it is important to consider input features representing different day characteristics in the regression problem formulation. Our results show that using the deviation from the number of expected bicycles provided by a long-term trend estimate as target variable has potential to improve the prediction accuracy (compared to using the absolute number of bicycles as target variable). For example, for the relative absolute error, we observed an improvement from 29.8% to 29.2% for the best performing algorithm. Obviously, the quality of the trend estimate influences how accurate prediction can be achieved. As the purpose of our study was not to identify the optimal trend estimate, we considered only one trend estimate (see Fig. 2). We leave for future work to analyze other trend for further improvements. Our results further show that SMO reg (support vector regression using 2nd and 3rd degree polynomial kernels) and Rep tree (regression tree) perform best for the considered problem. As future work, we also aim to incorporate the results of this study into the agent based mode choice model ASIMUT 15, where the decision making of travelers is explicitly modeled. In particular, we plan to explore how knowledge on the preferences of the bicyclists can be used to set the weights of the agent-based utility function (used in ASIMUT) to make mode choices. It is our belief that bicycling is one of the key factors to consider when estimating the mode choice of the travelers, as the bicycle is relatively sensitive to the changing weather conditions, which can be clearly seen in Fig. 1. References 1. Fishman, E., Washington, S., Haworth, N.. Bike share: A synthesis of the literature. Transport Reviews 2013;33(2):148–165. 2. O’Brien, O., Cheshire, J., Batty, M.. Mining bicycle sharing data for generating insights into sustainable transport systems. Journal of Transport Geography 2014;34:262– 273. 3. Jones, T., Harms, L., Heinen, E.. Motives, perceptions and experiences of electric bicycle owners and implications for health, wellbeing and mobility. Journal of Transport Geography 2016;53:41–49. 4. Aspegren, S., Dahlström, J.. A comparison of machine learning algorithms for estimation of bicycle flows based on bicycle barometer and weather data. Bachelor’s thesis; Malmö University; Sweden; 2016. 5. Romanillos, G., Austwick, M.Z., Ettema, D., Kruijf, J.D.. Big data and cycling. Transport Reviews 2016;36(1):114–133. 6. Raviv, T., Tzur, M., Forma, I.A.. Static repositioning in a bike-sharing system: models and solution approaches. EURO Journal on Transportation and Logistics 2013;2(3):187–229. 7. García-Palomares, J.C., Gutiéerrez, J., Latorre, M.. Optimizing the location of stations in bike-sharing programs: A GIS approach. Applied Geography 2012;35(1–2):235–246. 8. Datta, A.K.. Predicting bike-share usage patterns with machine learning. Master’s thesis; University of Oslo; Norway; 2014. 9. Vogel, P., Greiser, T., Mattfeld, D.C.. Understanding bike-sharing systems using data mining: Exploring activity patterns. Procedia - Social and Behavioral Sciences 2011;20:514–523. 10. Shen, L., Stopher, P.R.. Review of GPS travel survey and gps data-processing methods. Transport Reviews 2014;34(3):316–334. 11. Hood, J., Sall, E., Charlton, B.. A gps-based bicycle route choice model for san francisco, california. Transportation Letters 2011;3(1):63– 75. 12. Heinen, E., Maat, K., van Wee, B.. The effect of work-related factors on the bicycle commute mode choice in the netherlands. Transportation 2013;40(1):23–43. 13. Corcoran, J., Li, T., Rohde, D., Charles-Edwards, E., Mateo-Babiano, D.. Spatio-temporal patterns of a public bicycle sharing program: the effect of weather and calendar events. Journal of Transport Geography 2014;41:292–305. 14. Frank, E., Hall, M.A., Witten, I.H.. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques". Fourth ed.; Morgan Kaufmann; 2016. 15. Hajinasab, B., Davidsson, P., Persson, J.A., Holmgren, J.. Towards an agent-based model of passenger transportation. In: Gaudou, B., Sichman, J.S., editors. Multi-Agent Based Simulation XVI: International Workshop, MABS 2015, Istanbul, Turkey, May 5, 2015, Revised Selected Papers. Cham: Springer International Publishing; 2016, p. 132–145.