Travel Time Prediction Model for Heavy Goods Vehicles Problem Statement There were so many researches who worked on travel time estimations, but there is no project on travel time estimation of heavy goods vehicles, especially no one has said what is happening between two GPS points by considering all the factors I.e. Distance, Speed limits, duration, air temperature, wind direction and many other factors. So before developing a prediction model for travel time estimation of heavy goods vehicles, it is very important to know what is happening between one GPS point and the other. In this thesis project, we are addressing this problem. Aim & Objectives The aim of this thesis is to build a prediction model for travel time estimation of heavy goods vehicles (HGVs) based on sparse GPS data, road network data, and weather data. and we have also investigated how machine learning approaches could be applied to predict the travel times of HGVs. With travel time, we refer to the time needed by a vehicle to travel between two points along a given route. In this project, we explore different types of supervised learning algorithms, focusing on regression algorithms, firstly because regression algorithms other than allowing for prediction also allow for the investigation of the relationships between the different variables. In order to develop this prediction model, we have to go through many steps like gathering the required data, processing the data, Identifying the factors that affecting the vehicle speed, feature selection...etc. Research Questions (I have written these research questions just to give an idea please don’t take them as it is, but our research questions should be mainly on feature selection and machine learning) To achieve our goal, it is very important to answer our research questions. 1Q) How can machine-learning methodologies be used to predict the travel time of Heavy Goods Vehicles (HGVs), considering sparse GPS data, weather data, traffic volume data, and road network data? 2Q) What is a feature selection? What are the different ways of selecting features? 3Q) What are the input features that are affecting the speed of the vehicle (Heavy goods vehicle)? Data Gathering: Firstly, we have gathered all the required data for our thesis experiment which is historic data i.e. Sparse GPS data, Road network data (NVDB) and Weather data (Väg Väder informations Systemet, VViS), most of the data was given by the Swedish transport company on our Professor’s request. GPS DATA: This given data is a historic GPS data collected by the two trucks (Heavy Goods Vehicles) between a particular period of time interval and the data format is as follows. Latitude Longitude Altitude Speed And Heading. ROAD NETWORK DATA: The given road network data is NVDB (Nationell VägDataBas) data i.e. NVDB is a national road database of Sweden which contains all the basic information as follows. Name of the road Location of the road Road speed limits Width of the road …. etc. Weather DATA: The weather data is taken from Swedish Road Weather Information Systems i.e. Väg Väder informations Systemet, VViS which contains the following information. Name of the Weather Station Date & Time Surface Temperature Dew Point Freezing Point Air Temperature Code of precipitation Rainfall Amount of precipitation Maximum Wind Speed Average Wind Speed Wind Direction Average Wind Speed Over 30 Minutes Warning Code Air Humidity Error Code Data Processing: Given GPS data (Positions deidentified) was sorted in such way that you can see the exact day, month, and year (Time), then we have divided it into number of trips by considering the Javatimestamp. KML (Keyhole Markup Language) file is an XML based data format which is used to describe geographic data, and also visualize in Google Earth which is an Earth browser. There is an international site called GPS visualizer which automatically converts the file into KML when you upload it. http://www.gpsvisualizer.com/map_input?form=googleearth 2.2.1 Map Matching: Each GPS point is mapped to the nearest road segment (NVDB links), which must be done by using ‘Distance between two points’ i.e. calculating the distance between each GPS point and the road segment which is so called map matching. This can be done by using any one of the following methods. IsOn Segment Ray Casting SLC method The researchers who worked on this kind of projects were used IsOn-Segment method, so the same method is used in our thesis to do map matching. NVDB links are nothing but the links for all the road paths (Road Segments), the complete information of the roads is presented in NVDB data. In a country, there will be so many regions, in the given NVDB data, the data is given separately for each region. Initially, among different regions we have chosen the “Halland Region”. After getting all the nearest road segments for each GPS point, we have brought one full stretch of road for the Halland region, this was done by joining all the road segments between each two GPS points i.e. Ending point of one road segment should match the starting point of another road segment, like wise we have done it for all the trips. Figure a) Full stretch of road as it is viewed in Google Earth In the above figure, you can see the full stretch of road. If you see the redline on the map, it is showing exactly on the road how the vehicle was travelled. To get this full stretch without any deviations or errors we have developed so many classes as follows. Check positions Distance measure GPS Parse KML file Get road segments Map GPS to road segments…etc. The road which has shown in the figure is one full stretch of road with in the Halland region. We have chosen this road for our experiment. Data Formation: The complete data was formed based on two types of attributes as follows. Time related attributes between two GPS points Weather related attributes between two GPS points The complete data between two GPS points Combining the data from both time related and weather- related attributes, will give us the complete data required to build a prediction model. So, in order to form the data, one should know, how to get the information of time & weather data, and how to combine both together. Time Related Attributes b/w Two GPS points: Time related data is the combination of both GPS data and road network data. As we mentioned above firstly, we have converted the given GPS data into human understandable format i.e. we converted the data into time, day, month, and year. The road network data is taken from NVDB (Nationell VägDataBas) links where we have all the information of roads such as Names, location, KML number etc. Programmatically, we have combined the GPS data and road network data together, then we have taken the required information between two GPS points such as. Date Time Weekday Part of the Day Duration in minutes Distance in kilo meters Weighted average speed limits Vehicle speed. We have shown time related data in a table, just for clear understanding GPS b/w two Points Date & Time Duration in Minutes Distance in Kilometers Wgt Avg Speed Limits Vehicle Speed GPS: 0-1 * * * * * GPS: 1-2 * * * * * GPS: 2-3 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Weather Related Attributes between Two GPS points: The entire Sweden is divided into northern Norrland, southern Norrland, Svealand, Götaland among which the Halland belongs to the Götaland. The below map as taken from the Trafikverket state traffic web page ( http://trafikinfo.trafikverket.se/LIT/#url=Vagtrafiken/Karta ). This figure as taken from the TRAFIKVERKET website The basic weather data contains all the information as we said in the earlier chapters which is very large data, but we don’t require all that data, we only require the weather information data with in the Halland region of Sweden. It is possible to get the required weather information from the entire weather data with the help of the weather station coordinates and it’s time. Firstly, we have discovered number of weather stations presented with in the Halland region of Sweden from the TRAFIKVERKET webpage ( http://trafikinfo.trafikverket.se/LIT/#url=Vagtrafiken/Sok ), there are 36 weather stations with in the Halland region. Each weather station was checked in a Google Earth with their names, the nearest weather station to the selected road segment as taken (The closest weather station to the road segment) and put into a temporary file likewise we have selected all the weather station. To get the required information from the weather data, we need have weather stations coordinates. After some research, we come across Yr.no ( https://www.yr.no/ ) from where we got the coordinates for all the selected weather stations. Then we started matching historical trip to the respective weather conditions. To do that, we must find a row from the weather data measured at the moment that is closest to the time of historic trips (Comparing time between historical trips and weather data). The Complete data between two GPS points: The combination of historical trips data and the weather data is the complete data between two GPS points. Finally, we have arranged all the input variables along with the output variable as follows. Surface Temperature Dew Point Freezing Point Air Temperature Code of precipitation Type of Precipitation Amount of precipitation Maximum Wind Speed Average Wind Speed Wind Direction Average Wind Speed Over 30 Minutes Air Humidity Weekday Part of the Day Duration in Minutes Distance in Kilometers Weighted Average Speed Vehicle Speed (Output) Data Mining Tools: Out there, so many data mining tools available for machine learning like Weka, R-Language, SAS Platform, MLJAR, Spark MLLib, mlpack, Accord.NET Framework, Scikit-Learn and many more. Among so many data mining tools for machine learning, why we have selected the Scikit-learn platform? Six reasons why we have selected Scikit-Learn tool [By Ben Lorica October 13, 2015]. Commitment to documentation and usability Models are chosen and implemented by a dedicated team of experts. Covers most of the machine learning tasks. Python and Pydata (Allows users to play with data sets). Focus (Common algorithms through a consistent interface). Scikit learn scales to most of the data problems. Features Selection: We have prepared data as you see in the earlier chapters (Data Preparation) which is needed for building a prediction model. Now, we must check the input variables i.e. In our data, we have given many input variables, but we must find the variables which are mostly contributing for the vehicles speed, and the input variables which are showing poor performance towards the vehicle speed must be removed. In to do that, we have to perform feature selection. By Jason Brownlee on July 14, 2014 in Python Machine Learning, feature selection is a process where you automatically select the features those are contributing most towards the prediction variable (Output Variable). In your data, if you have too many irrelevant feature effects the accuracy of the models performed. Before modeling your data, there are three advantages of performing feature selection as follows. Reduces Overfitting Improves Accuracy Reduces Training Time. There are tons of methods available to select the best features from your data set, again these methods are categorized into different ways like classification and regression etc. Since our data is a regression data set, we have performed several regression features selection models to get the best input variables. Among several models we have chosen the best suitable models for our data. Based on the results we have taken commonly high contributing input variables from all the models (feature selection models). We have listed all the selected feature selections models and their results in a table. The following are the results from the selected feature selection models. Performance of the Feature Selection Models DATA FEATURES LIN CORR. LASSO LINEAR REG RF RFE RIDGE REG STABILITY MEAN Surface Temperature 0.01 0.02 0.01 0.17 0.11 0.01 0 0.05 Dew Point 0.01 0 0.2 0.19 0.56 0.21 0 0.17 Air Temperature 0.01 0.07 0.18 0.11 0.44 0.19 0 0.14 Code 0 0 0.12 0 0.78 0.13 0 0.15 Type of precipitation 0 0 0.14 0 0.33 0.15 0 0.09 Amount of precipitation 0 0 0 0.04 0 0 0 0.01 Max Wind 0.25 0 0.16 0.15 0.89 0.17 0 0.23 Average Wind 0.25 0 0.82 0.13 1 0.79 0 0.43 Wind Direction 0.07 0 0.08 0.11 0.67 0.09 0 0.15 Average wind speed over 30 minutes 0.26 0.03 1 0.1 1 0.97 0 0.48 Humidity 0.01 0 0.02 0.22 0.22 0.02 0 0.07 Weekday 0.01 0.07 0.16 0.09 1 0.17 0 0.21 Part of the Day 0 0 0.97 0.02 1 1 0 0.43 Weighted Avg Speed Limts 1 1 0.36 1 1 0.38 1 0.82 Prediction Model: In the above section, we have performed several features selection models and by which we have selected each individual feature (Variable) depending upon their contribution towards the vehicle speed. Machine Learning Algorithms: For any kind of data if you want to build a prediction model then you must split the complete data into training and testing sets, then apply the machine learning algorithms for that. So here we have chosen k-fold cross validation for our model, it splits the data into k equal training and testing for k times. In this K-fold cross validation, K can be any number like 2,5,7, 10...etc. We have taken 10-fold cross validation which is a common technique for all. Since our data is a regression dataset, we have performed several regression algorithms on our data set. Of course, for any kind of data all the algorithms do not work, so we have performed several and taken some of them which works. Again, we have taken the best algorithms which gives good results. Each machine learning algorithm must be performed by using error methods which gives the error results between original data and the algorithms which we use. Throughout our machine learning we have used two error methods those are Mean Absolute Error and Root Mean Squared Error on several algorithms. In Scikit-Learning tool these error methods are not available directly, instead they named it as ‘neg_mean_absolute_error’ & ‘neg_mean_squared_error’. Before you apply Root Mean Squared Error (RMSE) on different machine learning algorithms, first you must find Mean Squared Error (MSE). The RMSE is nothing but the square root of MSE. The following are the results from different machine learning algorithms by using mean absolute error and root mean squared error. Algorithms Mean Absolute Error (MAE) Root Mean Squared Error (RMSE) Generalized Linear Models Linear Regression 3.63779875681 4.19486747533 Ridge Regression 3.63760300636 4.19458936256 Lasso 3.69988555933 4.22066308116 Bayesian Regression 3.63968694845 4.19315297563 Support Vector Machines (SVM) Linear Kernel 3.58570478835 4.00621978259 RBF Kernel 3.54205119147 3.94564872552 Nearest Neighbors KNeighbors Regressioin 4.08387641508 4.75812284427 Decision Trees Decision Tree Regressor 3.44138987533 3.93268706268 Ensemble Methods Gradeint Boosting Regressor 3.53433612561 4.14030302188 Neural Network Models Neural Network MLP Regressor 3.85572774724 4.74282936713 University Requirements: According to our University requirements, In our thesis we must cover the following topics. Chapter 1: introduction ( Project idea, goals, motivation, research questions, contributions, and outine of the thesis). Chapter 2: Research methodology Chapter 3: Literature review / related work Chapter 4: Results of the application of the research methods Chapter 5: Evaluation Chapter 6: Conclusions and future work References Eventual Appendix