Prediction of Bus Dwell Time using Time Series Analysis and Machine Learning
Master thesis
Permanent lenke
https://hdl.handle.net/11250/3148014Utgivelsesdato
2024Metadata
Vis full innførselSamlinger
- Master's theses (RealTek) [1877]
Sammendrag
The reliability of public transportation is strongly dependent on punctual transit trips. Precise bus dwell time (BDT) predictions are important in this regard, as BDT directly influences arrival times and departure times at stops. Since BDT is defined as the duration for which a bus remains stationary at a stop to service passengers, accurate estimates of BDT can enable transit companies to optimize bus scheduling in a larger network of stops and routes. This thesis explored the possibility of modeling BDT as a time series with external predictors. SARIMAX, LSTM and XGBoost were used to predict BDT at selected bus stops on line 20 in the Oslo region. We used data from 19 different stops in the eastward driving direction of line 20. The BDT data at all stops was sampled in 15-minute intervals from 06:00 to 23:00 every day during the first quarter of 2023. In this work, seven predictors were used to model BDT. These seven predictors were either a category of passenger variables or temporal variables. Our results indicated that passenger variables were more important in predicting BDT than temporal variables. There were also significant variations in feature importance between the different stops. The results showed that XGBoost was the best model in our research. It had the lowest prediction error for our selected evaluation metrics across all stops, with cumulated RMSE, MAE and MAPE scores of 99.33, 78.85 and 3.67 respectively. LSTM yielded prediction errors that were similar to those of XGBoost, producing RMSE, MAE and MAPE scores of 100.34, 79.82 and 3.73 respectively. SARIMAX had the highest prediction errors among the models. These results indicated that a time series approach for modeling BDT may not be optimal compared to other methods. This claim is also supported by the fact that the data did not appear to have strong temporal characteristics.