Urban Transportation Demand Forecasting / Large-Scale Data Analysis (EN/FR)

Context & Objectives

EN / This project focuses on the analysis and valorization of large-scale urban transportation data, using publicly available New York City Yellow Taxi trip records.
The selected topic addresses urban transportation demand forecasting.

The dataset includes over 119 million trips recorded across three years, representing a large, heterogeneous data volume comparable to real-world business challenges (data quality, scale, performance, and structure).

The objective was to build an end-to-end data pipeline, from data ingestion and modeling to exploratory analysis and the delivery of actionable insights to support decision-making.

– – – – – – – – – – – – – – – – – – – – – – –

FR / Ce projet s’inscrit dans une démarche d’analyse et de valorisation de données de transport urbain à grande échelle, à partir des données publiques des Yellow Taxis de New York.
Le sujet retenu porte sur la prévision de la demande de transport urbain.

Le dataset couvre plus de 119 millions de trajets enregistrés sur trois années, représentant un volume de données conséquent, hétérogène et proche de problématiques réelles rencontrées en entreprise (qualité des données, volumétrie, performance, structuration).

L’objectif était de reproduire un pipeline data de bout en bout, depuis l’ingestion et la modélisation des données jusqu’à l’analyse exploratoire et la production de résultats exploitables pour l’aide à la décision.

Tools & Methods
  • DuckDB
  • PostgreSQL, PostGIS
  • Docker
  • DBeaver
  • Python (SQLAlchemy, pandas, NumPy, scikit-learn, matplotlib, seaborn, joblib)
  • Streamlit, GitHub, anaconda_prompt

Step 1 – Database Environment Setup

  • Download of large-scale data from the official public NYC Yellow Taxi Trip dataset, covering the years 2022 to 2024

  • Storage of files in a RAW zone acting as a local data lake

  • Large-scale structural exploration using DuckDB and SQL queries to understand the dataset structure, volume, and consistency prior to any transformation 

  • Deployment of a PostgreSQL / PostGIS Docker container via the command line, including container configuration 

  • Connection to PostgreSQL from Python after installing the connector

Step 2 – Data Modeling and Preparation

  • Design of the curated schema based on a star schema architecture (fact–dimension modeling) and definition of an analytical data mart

  • Creation of database schemas and tables using SQL (DBeaver), then loading parquet files from the RAW zone into PostgreSQL using Python

  • Enrichment of dimension tables using official reference data (including the Taxi Zone Lookup provided) 

  • Population of analytical tables using SQL queries ( including dimension tables, the fact table, and the data mart) 

Step 3 – Data preparation for Machine Learning (Python)

  • Data cleaning, preprocessing, and exploratory data analysis (EDA) to ensure data quality and to explore temporal and spatial demand patterns

  • Feature engineering to build a final modeling dataset, including the definition of the target variable (number of taxi trips per hour and per zone), creation of temporal and historical features (lags, rolling statistics), and handling of categorical variables

Step 4 – Machine Learning 

  • Definition of the forecasting problem: predicting the number of taxi trips per hour and per zone

  • Chronological train / test split to ensure realistic model evaluation

  • Implementation of a baseline model (naive persistence forecast) as a reference

  • Training of an initial Machine Learning model using a Random Forest Regressor

  • Model performance evaluation using MAE and RMSE metrics

  • Feature importance analysis and identification of model limitations

Step 5 – Forecast Simulation and Application 

  • Implementation of a forecast simulation to estimate taxi demand for a given zone based on a selected date and time, across different forecast horizons

  • Development of an interactive Streamlit application to demonstrate practical usage of the forecasting model

Results & Deliverables

A taxi demand forecasting model was built and is able to accurately estimate the number of trips per hour and per zone based on historical data.
However, the model remains sensitive to external factors not captured in the data, such as weather conditions or exceptional events.
Results were delivered through a comprehensive notebook, as well as a model simulation and a shareable interactive application.

Visual examples illustrating the project :