End-to-End Pipeline
The project is structured as a sequential, seven-notebook pipeline. Each notebook is self-contained yet feeds into the next, allowing any stage to be re-run independently without rebuilding the entire workflow from scratch.
Goal
Predict hourly rideshare trip demand per H3 hexagon cell across Chicago, enabling drivers, dispatchers, and planners to anticipate where demand will surge.
Input
Chicago Data Portal TNC trip records: trip ID, start/end time, pickup/dropoff community area, fare, tip, distance, shared-trip flag.
Output
A trained demand-prediction model and a lightweight API endpoint that returns predicted trip counts for any given H3 cell and time window.
Scale
Handles 100M+ rows from the Chicago Open Data Portal, down-sampled and aggregated to hourly-spatial bins for tractable modeling.
Four Prediction Tasks
This project is not a single monolithic prediction problem — it defines four distinct tasks, each answering a different operational question for the rideshare ecosystem. Tasks range from city-level macro forecasting down to binary hexagon-level signals for real-time driver decisions.
Task Dependency Flow
Demand
Demand
Reposition
Detection
External Factors: CTA & Weather
Rideshare demand doesn't exist in isolation — it is directly shaped by two major external forces: Chicago's public transit network (CTA) and real-time weather conditions. Both are integrated as features in the pipeline.
Chicago Transit Authority (CTA)
The CTA operates the elevated L train and an extensive bus network. Rideshare demand spikes when CTA service is disrupted, delayed, or inadequate — especially late at night when train frequency drops. The following CTA-derived features are engineered and merged with the TNC trip data by timestamp and community area.
CTA L Train Ridership
Daily ridership counts per station from the CTA open data portal. Low CTA ridership hours correlate with elevated rideshare demand, particularly in neighborhoods far from L stops.
Bus Route Proximity
Each H3 cell is enriched with the number of active CTA bus routes within the cell's boundary. Cells with fewer transit options show systematically higher rideshare demand per capita.
Last Train Feature
A binary flag is_post_last_train marks hours after the last CTA Blue/Red Line departure (~1–2 AM). This is one of the strongest late-night demand predictors in the feature set.
Distance to Nearest L Stop
Computed per H3 centroid. Cells more than 1.2 km from any L station show a significant baseline demand lift — especially for inbound commute-direction trips.
Weather Conditions
Chicago weather is extreme and highly predictive of rideshare demand. Heavy rain, snowstorms, and sub-zero wind chills reliably push riders off CTA platforms and into ride-hailing apps. Weather data is sourced from the OpenWeather API, joined to the trip data by hour and approximated as uniform across the city (given Chicago's relatively small geographic footprint).
Precipitation (mm/hr)
Hourly rainfall and snowfall intensity. Rain events above 2mm/hr show a 15–35% demand lift vs. dry conditions at the same hour.
Temperature & Wind Chill
Apparent temperature (°F). Temperatures below 15°F or above 90°F both correlate with demand increases — Chicago cold suppresses walking, heat drives convenience use.
Snow Accumulation
Daily snow depth (inches). Heavy snow events are the single biggest demand spike triggers outside of holidays — demand can jump 40–60% during major snowstorms.
Visibility & Wind Speed
Low visibility conditions and high wind speeds (>25 mph) further amplify demand, especially for late-night riders unwilling to wait at exposed CTA stops.
Chicago TNC Dataset
The Chicago Data Portal provides the full Transportation Network Company (TNC) trip dataset — covering all trips by Uber, Lyft, and Via starting November 2018, updated continuously.
Raw Schema
| Column | Type | Description | Used In |
|---|---|---|---|
| Trip ID | string | Unique trip identifier (anonymized) | Dedup |
| Trip Start Timestamp | datetime | Time the trip began (rounded to 15-min intervals) | NB 01–05 |
| Trip End Timestamp | datetime | Time the trip ended | NB 01 |
| Trip Seconds | float | Trip duration in seconds | NB 01, 02 |
| Trip Miles | float | Trip distance | NB 01, 02 |
| Pickup Community Area | int | Chicago community area number (1–77) | NB 01–06 |
| Dropoff Community Area | int | Chicago community area number (1–77) | NB 02, 03 |
| Fare | float | Base fare in USD | NB 02 |
| Tip | float | Tip amount in USD | NB 02 |
| Shared Trip Authorized | bool | Whether rider opted into pool / shared ride | NB 02 |
| Pickup Centroid Latitude | float | Lat of pickup area centroid | NB 03, 04 |
| Pickup Centroid Longitude | float | Lon of pickup area centroid | NB 03, 04 |
Volume at a Glance
is_airport feature during modeling.
Data Cleaning
The first notebook ingests raw CSVs from the Chicago Data Portal and applies a systematic cleaning protocol to produce a reliable, analysis-ready dataframe. Given the sheer volume of records, chunked reading and early column dropping are used to keep memory usage manageable.
Cleaning Steps
- Load & sample: Read the dataset in chunks (or a stratified sample), immediately dropping unused columns to reduce memory footprint.
- Parse timestamps: Convert
Trip Start TimestampandTrip End Timestamptodatetime64; extracthour,day_of_week,month,year. - Remove nulls: Drop rows missing
Pickup Community Areaor timestamp — these cannot be geo-located and are not recoverable. - Filter outliers: Remove trips with duration < 60 seconds or > 3 hours, and distance > 100 miles — likely data-entry errors or test rides.
- Deduplicate: Drop exact duplicate
Trip IDentries to prevent count inflation in aggregation. - Type casting: Ensure fare, tip, miles are
float32(halving memory vs. float64); community area asint8. - Export: Save cleaned DataFrame to a compressed
.parquetfile for fast downstream reads.
Output Schema
| Column | Type | Notes |
|---|---|---|
| trip_start | datetime64 | Parsed, validated timestamp |
| hour / dow / month | int8 | Extracted temporal components |
| pickup_area | int8 | 1–77 community area code |
| trip_miles / trip_seconds | float32 | Outlier-filtered |
| fare / tip | float32 | USD values |
| shared_authorized | bool | Pool/shared eligibility flag |
Exploratory Data Analysis
The EDA notebook investigates the statistical and temporal structure of Chicago rideshare demand. Key questions: When do trips peak? Where is demand concentrated? How do fare, distance, and tipping behave?
Temporal Patterns
Rideshare demand in Chicago follows two distinct daily rhythms. A smaller morning commute peak occurs around 7–9 AM, while a far more pronounced evening/nightlife peak dominates 10 PM – 2 AM on Fridays and Saturdays. This bimodal structure directly informs the time-based features constructed in Notebook 05.
Key EDA Findings
Nightlife Dominance
Friday and Saturday nights account for a disproportionate share of all weekly trips. The Loop, River North, and Wicker Park are the primary nightlife origin/destination clusters.
Airport Spikes
O'Hare and Midway show demand spikes pegged to flight arrival windows — a pattern distinct from the rest of the city and important to isolate during modeling.
Fare Distribution
Most trips cluster between $8–$20 with a long right tail (airport runs). Tipping is sparse (~15% of trips) and skewed toward longer, non-shared rides.
Shared Rides
Roughly 26% of trips were shared-eligible. Shared trips concentrate in the Loop and near-north neighborhoods, suggesting higher density and willingness to share.
Geographic Skew
The top 5 community areas account for over 50% of all pickups: The Loop, Near North Side, Lake View, O'Hare, and Near West Side.
Seasonal Signal
Summer months (June–August) see higher overall demand; December shows a dip aside from holiday weekend spikes — a clear seasonal pattern for the model to learn.
Map Visualizations
The map visualizations notebook creates interactive and static choropleth maps that overlay rideshare demand onto Chicago's geography using GeoPandas for spatial joins and Folium for interactive HTML maps. Community area boundary shapefiles from the Chicago Data Portal are merged with aggregated trip counts.
Map Types Produced
Pickup Choropleth
Community areas shaded by total pickup volume. Reveals the stark north-south divide in rideshare adoption across Chicago.
Dropoff Flow Map
Origin-destination pairs as arc flows, showing that the majority of trips converge into and out of the Loop corridor.
Time-Sliced Maps
Separate choropleth layers for AM peak, PM peak, and overnight hours — animatable in Folium's layer control widget.
Fare Density
Average fare per community area, normalized by trip count. Shows that per-trip costs are actually fairly uniform city-wide (~$14 avg).
All maps are built on a CartoDB Dark Matter basemap — designed to maximize contrast of demand heat signals against Chicago's street grid. Interactive Folium maps are exported as standalone HTML files in the maps/ folder of the project.
Map 1 — Trip Fare Density · 3D Hex Columns (NYE 2024)
Each H3 cell is rendered as a 3D column — height encodes total trip count, color encodes average fare. Yellow columns represent high-fare cells (airports, long-haul pickups), orange represents the dense mid-fare core. The single purple column marks an anomalous high-fare outlier cell. Captured on New Year's Eve 2024 — the highest-demand night of the year.
Map 2 — Demand Percentile Distribution · H3 Hexagons (NYE 2024)
Flat-top H3 hexagons colored by demand percentile rank — yellow = top 10% highest-demand cells, orange = 50th–90th percentile, purple = bottom 50%. This view makes the spatial inequality stark: a small cluster of lakefront hexagons captures the vast majority of all demand while hundreds of outer cells sit in the lowest decile. Captured on New Year's Eve 2024.
The three time-slice panels below show how this distribution morphs through the day:
Map 3 — Origin–Destination Flow Arcs
Trip origin-destination pairs are visualized as arcs, revealing the dominant spoke-and-hub pattern: nearly all significant OD flows terminate in or originate from the Loop corridor and Near North Side. Outer neighborhoods generate mostly short local trips, while airport communities generate long-distance flows.
Map 4 — Average Fare per Community Area
Despite the massive volume imbalance between neighborhoods, average trip fares remain surprisingly uniform city-wide — hovering near $14 per trip. Longer trips from the far south and suburbs drive fares up slightly, while dense short-hop trips in the Loop keep averages down despite high frequency.
H3 Hexagonal Analysis
A critical design decision in this project is switching from administrative community areas to Uber's H3 hierarchical hexagonal grid as the spatial unit of analysis. H3 cells are equal-area, neighbor-consistent, and resolution-scalable — making them far more suitable for demand modeling than irregular census geographies.
Resolution Selection
| H3 Resolution | Avg Cell Area | Cells in Chicago | Suitability |
|---|---|---|---|
| 6 | 36.1 km² | ~18 | Too coarse — loses neighborhood detail |
| 7 | 5.16 km² | ~120 | ✓ Selected — neighborhood-level granularity |
| 8 | 0.74 km² | ~850 | Too fine — many cells with zero demand |
| 9 | 0.10 km² | ~6,000 | Extreme sparsity, not suitable |
H3 Processing Steps
- Convert pickup centroid coordinates to H3 cell index using
h3.geo_to_h3(lat, lon, resolution=7). - Aggregate trip counts per (H3 cell, hour) bin, producing a sparse demand matrix.
- Fill zero-demand cells with explicit zeros (critical for time-series continuity).
- Compute k-ring neighbors for each cell (k=1, 2) to enable spatial lag features.
- Visualize H3 demand density using
pydeckor Folium hexagon layers.
H3 Viz 1 — Trip Count 3D Columns · Citywide (NYE 2024)
Each H3 cell is extruded into a 3D cylinder — height and color both encode total trip count. Purple/blue = lower volume, yellow = extreme outliers (Loop core and airport cells that tower over the rest of the city). This rendering was captured on New Year's Eve 2024. The dramatic height asymmetry between the lakefront cluster and the western/southern neighborhoods is immediately apparent.
The full SVG grid below shows the demand structure in a schematic form for reference:
H3 Viz 2 — Why Hexagons? Equidistance Property
This is the core insight from H3's design, drawn directly from the mathematics of tiling: hexagons are the only regular polygon where all neighbors are equidistant from the center. This makes distance-based smoothing, neighbor aggregation, and k-ring convolution geometrically consistent — a property squares and triangles don't have.
Unequal distances
2 different distances
All equidistant ✓
H3 Viz 3 — k-Ring Neighbor Expansion
The h3.k_ring(cell, k) function returns all cells within k steps of a center cell. For CHIride Demand, k=1 (6 direct neighbors) and k=2 (18 cells including k=1) are used to build spatial lag features. As k grows, the ring approximates a circle — a property unique to hexagonal grids and exploited by computer vision convolution techniques applied to geospatial ML.
H3 Viz 4 — Hourly Demand Time Series by Cell
Each H3 cell has its own unique demand profile. The Loop cell shows two daily peaks on weekdays and a massive late-night spike on weekends. The O'Hare airport cell has a completely different pattern — driven by flight arrival schedules with broad morning and evening clusters. Outer residential cells show nearly flat, low-amplitude series.
H3 Viz 5 — Cell × Hour Demand Matrix
The full modeling dataset is a (H3 cell) × (hour of week) matrix. This heatmap shows a subset of top cells — each row is one H3 cell, each column one hour of the week (Monday 0AM to Sunday 11PM). The weekend nightlife intensification shows clearly as bright vertical bands in the rightmost columns.
Feature Engineering
Feature engineering transforms the raw demand time series into a rich tabular dataset that ML models can consume. The engineered features span four categories: temporal, spatial, lag/window, and contextual.
Temporal Features
Spatial Features
Lag & Rolling Window Features
Final Feature Matrix
After engineering, each row represents a single (H3 cell × hour) observation with approximately 38 features. The target variable is trip_count — the number of rides starting in that cell during that hour.
| Feature Group | Count | Key Members |
|---|---|---|
| Temporal (raw) | 6 | hour, dow, month, year, is_weekend, is_holiday |
| Temporal (cyclical) | 6 | hour_sin, hour_cos, dow_sin, dow_cos, month_sin, month_cos |
| Spatial | 5 | h3_cell (encoded), is_downtown, is_airport, dist_to_loop, pop_density |
| Lag features | 5 | lag_1h, lag_2h, lag_3h, lag_24h, lag_168h |
| Rolling windows | 5 | roll_mean_3h, roll_mean_6h, roll_mean_24h, roll_std_6h, ewm |
| Spatial lag | 4 | ring1_mean, ring2_mean, ring1_std, ring1_max |
| Context | 3 | is_rush_hour, is_nightlife_hour, days_since_holiday |
Model Training
Multiple models are trained for the four prediction tasks. The task type determines model selection: TFT for the citywide time-series problem (T1), XGBoost regressors for hex-level demand (T3), and classifiers for the binary repositioning and surge tasks (T2, T4). A time-based train/test split ensures no data leakage from the future across all tasks.
Train / Validation / Test Split
Models Evaluated
Feature Importance (XGBoost)
Model Evaluation & API
The final notebook evaluates the best model (XGBoost) across multiple dimensions — overall accuracy, per-cell residuals, error by time-of-day, and spatial error patterns — then wraps the model in a lightweight prediction API for real-time inference.
Evaluation Metrics
| Metric | Formula | XGBoost (test) | Interpretation |
|---|---|---|---|
| RMSE | √mean((ŷ−y)²) | 3.24 trips | Avg prediction error of ~3 trips/hour/cell |
| MAE | mean|ŷ−y| | 2.11 trips | Median error around 2 trips |
| R² | 1 − SS_res/SS_tot | 0.912 | Model explains 91.2% of variance |
| MAPE | mean|ŷ−y|/y × 100 | 14.3% | Higher in low-demand cells (few trips) |
| Spearman ρ | rank correlation | 0.958 | Excellent ranking of high-demand cells |
Error Analysis
By time of day: The model performs best during steady off-peak hours (2 AM – 6 AM) and daytime. Error is highest in the 10 PM – 1 AM window when demand is volatile and event-driven.
By community area: Airport cells (O'Hare, Midway) have the highest absolute errors due to flight-schedule dependency not captured in temporal features.
Prediction API
The notebook serializes the trained XGBoost model with joblib and exposes a prediction function that takes an H3 cell ID and timestamp, constructs the feature vector from historical demand, and returns a predicted trip count.
Results & Key Findings
The project trains separate best-in-class models for each of the four prediction tasks. Below is a consolidated view of model choices, accuracy metrics, and key findings across T1 through T4.
Task-by-Task Model Results
TFT was selected for T1 because citywide aggregate demand is a pure time-series problem with strong temporal dependencies, multi-seasonality (daily + weekly + annual), and known static covariates (holidays, events). TFT's attention mechanism explicitly learns which historical timesteps are most relevant for each forecast horizon — outperforming XGBoost by 8 percentage points on this task.
The repositioning label is derived from the T3 predictions: a cell receives a Move=1 signal when its predicted demand in the next hour is below the 25th percentile of its historical distribution AND a neighboring ring-1 cell exceeds the 75th percentile. XGBoost's feature importance naturally captures the spatial gradient between adjacent cells.
The core spatial task. XGBoost was selected over LightGBM and Random Forest after exhaustive comparison — it handles the mixed feature types (temporal cyclical encodings, spatial lag floats, binary flags) most robustly and converges faster with early stopping on the hourly-aggregated training set.
Random Forest was preferred for surge detection because the surge label is inherently noisy and threshold-dependent — ensemble averaging over many trees provides better-calibrated probability estimates than XGBoost's boosted structure. The model uses T3 demand predictions, weather flags, CTA service status, and historical surge frequency as features.
Top Spatial Findings
50% of trips in 5 areas
Extreme spatial concentration — just 5 of 77 community areas account for more than half of all rideshare trips city-wide, driven by the Loop, Near North Side, and airports.
H3 Res 7 is optimal
Resolution 6 loses neighborhood detail (only 18 cells city-wide). Resolution 8 produces 850+ cells with extreme sparsity. Res 7's ~120 cells is the Goldilocks zone.
Strong spatial autocorrelation
The k=1 ring neighbor mean demand (ring1_mean_demand) ranks as the 4th most important feature — nearby cells are highly predictive of each other, confirming spatial spillover effects.
Airports are structural outliers
O'Hare and Midway exhibit demand patterns driven by flight schedules, not city rhythms. Their MAPE is 28% vs the city-wide 14.3% — a clear modeling challenge for future work.
Top Temporal Findings
Map Gallery — Heatmap Comparisons Across Event Types
These three annotated heatmaps from the project's map visualization notebook show how Chicago's rideshare demand geography changes dramatically depending on the day type — and how the model must generalize across all of them.
Tech Stack
pandas / numpy
Core data wrangling. Chunked CSV reads, datetime parsing, aggregation, and memory-optimized dtypes.
H3-py (Uber)
Hexagonal hierarchical spatial indexing. Converts lat/lon to H3 cells, computes k-ring neighbors for spatial features.
GeoPandas + Folium
Spatial joins with Chicago shapefiles and interactive choropleth HTML maps rendered with Leaflet under the hood.
XGBoost
Primary model. Gradient-boosted trees with GPU acceleration support. Best overall performance on the spatiotemporal regression task.
scikit-learn
Baseline models (Ridge, Random Forest), preprocessing pipelines, cross-validation utilities, and metrics computation.
LightGBM
Competitive alternative to XGBoost. Faster training on large datasets with leaf-wise tree growth and category feature support.
matplotlib / seaborn
Static visualization for EDA plots — distribution histograms, time-series demand charts, and correlation heatmaps.
joblib / parquet
Model serialization and compressed columnar storage for intermediate datasets. Parquet is ~10× faster to read than CSV at this scale.
Environment
Quickstart
1. Clone & Install
2. Download Data
Download the TNC Trips dataset from the Chicago Data Portal. The direct export URL for the full dataset is:
3. Run the Pipeline
Execute notebooks in order. Each notebook reads from and writes to the data/ folder. All intermediate files are stored as .parquet for fast I/O.
4. Run a Prediction
SAMPLE_FRAC=0.05 environment variable to work with a 5% random sample: SAMPLE_FRAC=0.05 jupyter notebook 01data_cleaning.ipynb