Chicago Ride Demand Forecast · Spatiotemporal ML Pipeline

Chicago Ride
Demand Forecast

A spatiotemporal machine learning pipeline that predicts rideshare trip demand across Chicago's community areas — combining H3 hexagonal indexing, time-series feature engineering, and ensemble models to answer where and when rides will be needed.

Python 3.10+ H3 · Uber Hexagons XGBoost / Random Forest Chicago Data Portal GeoPandas · Folium

100M+

TNC trips processed

community areas

notebook pipeline stages

H3 Res 7

spatial resolution

What is CHIride Demand?

CHIride Demand is a full end-to-end data science project that ingests years of public Transportation Network Company (TNC) trip records from the City of Chicago Open Data Portal and trains machine learning models to forecast how many rideshare trips will originate from each hexagonal grid cell in a given hour.

Unlike academic toy datasets, this pipeline handles real-world messiness: hundreds of millions of rows, missing geo data, irregular temporal gaps, and strong spatial autocorrelation — all requiring deliberate engineering choices at every step.

Why Does This Problem Matter?

Rideshare platforms like Uber and Lyft use demand forecasting to pre-position drivers, calibrate surge pricing, and minimize rider wait times. A model that can predict which neighborhoods will see demand spikes in the next hour lets the platform incentivize drivers to rebalance before surge occurs — reducing both empty miles and passenger frustration.

For the public sector, the same predictions inform transit planning, equity analysis, and infrastructure investment decisions.

Key Design Decisions

SPATIAL UNIT

H3 over Admin Boundaries

Chicago's 77 community areas range from 1.6–71 km². H3 Resolution 7 provides uniform ~5.2 km² cells, making trip counts directly comparable across space.

TEMPORAL SPLIT

Chronological, Never Shuffled

Time-series data must be split chronologically. Random cross-validation would leak future information into training and produce wildly optimistic metrics.

TARGET VARIABLE

Aggregated Trip Count

Rather than modeling individual trips, we aggregate to (H3 cell × hour) bins. This converts a point-process problem into a well-posed regression task.

Chicago Rideshare Context

The Chicago Data Portal publishes anonymized rideshare data covering all trips by Uber, Lyft, and Via starting November 2018. As of 2024, it holds over 100 million trip records and continues growing at ~4M trips/month.

Chicago rideshare is geographically concentrated: just 5 of 77 community areas (The Loop, Near North Side, Lake View, O'Hare Airport, Near West Side) account for over 50% of all pickups. This skew is a core challenge for the model — it must perform well in both dense urban cores and sparse outer neighborhoods.

Top 5 Pickup Areas — % of Total Trips

The Loop18.4%

Near North Side12.1%

Lake View8.7%

O'Hare Airport7.2%

Near West Side5.9%

All other 72 areas47.7%

⚡

Architecture

End-to-End Pipeline

The project is structured as a sequential, seven-notebook pipeline. Each notebook is self-contained yet feeds into the next, allowing any stage to be re-run independently without rebuilding the entire workflow from scratch.

NB 01

Data Cleaning

NB 02

EDA

NB 03

Map Viz

NB 04

H3 Analysis

NB 05

Feature Eng.

NB 06

Model Training

NB 07

Eval & API

🎯

Goal

Predict hourly rideshare trip demand per H3 hexagon cell across Chicago, enabling drivers, dispatchers, and planners to anticipate where demand will surge.

📦

Input

Chicago Data Portal TNC trip records: trip ID, start/end time, pickup/dropoff community area, fare, tip, distance, shared-trip flag.

📤

Output

A trained demand-prediction model and a lightweight API endpoint that returns predicted trip counts for any given H3 cell and time window.

🌐

Scale

Handles 100M+ rows from the Chicago Open Data Portal, down-sampled and aggregated to hourly-spatial bins for tractable modeling.

🎯

Prediction Goals

Four Prediction Tasks

This project is not a single monolithic prediction problem — it defines four distinct tasks, each answering a different operational question for the rideshare ecosystem. Tasks range from city-level macro forecasting down to binary hexagon-level signals for real-time driver decisions.

T1 · Citywide Demand

Total Rides Across All of Chicago — Next Hour

Forecasts the aggregate number of rideshare trips that will be requested across the entire city in the coming hour. This is a pure time-series regression problem — a single scalar output capturing macro-level demand dynamics driven by time-of-day, day-of-week, weather, and special events.

📈 Regression 🏙️ City-level TFT · 95% accuracy

T2 · Driver Repositioning Signal

Should Drivers in THIS Hexagon Move? Yes / No

A binary classification signal per H3 cell: should an idle driver currently in this hexagon reposition to a better area? Derived from predicted demand surplus/deficit. Gives drivers an actionable, real-time cue — move or stay — without requiring them to interpret raw demand numbers.

🔀 Binary Classification ⬡ H3 Cell-level Output: 0 / 1

T3 · H3 Hexagon Demand

Rides in THIS Specific Hexagon — Next Hour

Forecasts the trip count for each individual H3 Resolution-7 cell in the next hour. This is the core spatiotemporal regression task — ~120 simultaneous predictions, one per hexagon. Uses the full suite of lag features, spatial neighbor features, and external factors. Enables precise geographic demand heatmaps.

📊 Regression ⬡ Per-Hex ~120 outputs

T4 · Surge Detection

Is a Surge Pricing Event Imminent in THIS Hexagon?

A binary classification task predicting whether a hexagon will experience a surge pricing condition in the next hour — defined as demand exceeding supply by a threshold ratio. Enables proactive driver routing before surge occurs, rather than reactive repositioning after surge has already spiked fares.

⚡ Binary Classification ⬡ H3 Cell-level Output: 0 / 1

Why four tasks? Each task serves a different stakeholder. T1 helps operations teams scale infrastructure. T2 gives individual drivers a simple binary signal. T3 powers visualization dashboards and APIs. T4 enables dynamic pricing systems to anticipate surge windows before they occur.

Task Dependency Flow

T1

Citywide
Demand

▶

T3

Hex-Level
Demand

▶

T2

Driver
Reposition

▶

T4

Surge
Detection

T3 hex-level forecasts feed both the repositioning signal (T2) and surge detection (T4)

🌦️

External Data Sources

External Factors: CTA & Weather

Rideshare demand doesn't exist in isolation — it is directly shaped by two major external forces: Chicago's public transit network (CTA) and real-time weather conditions. Both are integrated as features in the pipeline.

Chicago Transit Authority (CTA)

The CTA operates the elevated L train and an extensive bus network. Rideshare demand spikes when CTA service is disrupted, delayed, or inadequate — especially late at night when train frequency drops. The following CTA-derived features are engineered and merged with the TNC trip data by timestamp and community area.

🚊

CTA L Train Ridership

Daily ridership counts per station from the CTA open data portal. Low CTA ridership hours correlate with elevated rideshare demand, particularly in neighborhoods far from L stops.

🚌

Bus Route Proximity

Each H3 cell is enriched with the number of active CTA bus routes within the cell's boundary. Cells with fewer transit options show systematically higher rideshare demand per capita.

⏰

Last Train Feature

A binary flag is_post_last_train marks hours after the last CTA Blue/Red Line departure (~1–2 AM). This is one of the strongest late-night demand predictors in the feature set.

📍

Distance to Nearest L Stop

Computed per H3 centroid. Cells more than 1.2 km from any L station show a significant baseline demand lift — especially for inbound commute-direction trips.

python · CTA feature integration
import geopandas as gpd

# Load CTA L stops shapefile
cta_stops = gpd.read_file("cta_l_stops.geojson")

# Compute distance from each H3 cell centroid to nearest L stop
from shapely.ops import nearest_points
df["dist_to_l_km"] = df.apply(
    lambda r: cta_stops.distance(r.geometry).min() / 1000, axis=1
)

# Post-last-train flag (CTA Blue/Red Line ends ~1:30 AM)
df["is_post_last_train"] = ((df["hour"] >= 1) & (df["hour"] <= 5)).astype(int)

# CTA ridership merge (daily)
df = df.merge(cta_daily, on=["date", "nearest_l_stop"], how="left")

Weather Conditions

Chicago weather is extreme and highly predictive of rideshare demand. Heavy rain, snowstorms, and sub-zero wind chills reliably push riders off CTA platforms and into ride-hailing apps. Weather data is sourced from the OpenWeather API, joined to the trip data by hour and approximated as uniform across the city (given Chicago's relatively small geographic footprint).

🌧️

Precipitation (mm/hr)

Hourly rainfall and snowfall intensity. Rain events above 2mm/hr show a 15–35% demand lift vs. dry conditions at the same hour.

🌡️

Temperature & Wind Chill

Apparent temperature (°F). Temperatures below 15°F or above 90°F both correlate with demand increases — Chicago cold suppresses walking, heat drives convenience use.

🌨️

Snow Accumulation

Daily snow depth (inches). Heavy snow events are the single biggest demand spike triggers outside of holidays — demand can jump 40–60% during major snowstorms.

👁️

Visibility & Wind Speed

Low visibility conditions and high wind speeds (>25 mph) further amplify demand, especially for late-night riders unwilling to wait at exposed CTA stops.

Avg Demand Lift by Weather Condition vs. Clear Sky Baseline

Heavy Snow (>3in/day)+52%

Rain >2mm/hr+28%

Temp <15°F+22%

Temp >90°F+14%

Wind >25 mph+11%

Light Rain <1mm/hr+6%

Clear Sky (baseline)0%

NYE 2024 Validation: The visualizations throughout this documentation were captured on New Year's Eve 2024 — one of the highest-demand nights of the year. This date was deliberately chosen for validation because it combines extreme weather (cold Chicago winter), a major event (countdown crowds), post-last-train conditions, and a citywide surge. It is the hardest possible test of the model's temporal generalization.

🗄️

Data Source

Chicago TNC Dataset

The Chicago Data Portal provides the full Transportation Network Company (TNC) trip dataset — covering all trips by Uber, Lyft, and Via starting November 2018, updated continuously.

Privacy Note: To protect rider privacy, the city aggregates pickup/dropoff locations to the census tract level and suppresses exact timestamps when trip counts in a cell are below a threshold.

Raw Schema

Column	Type	Description	Used In
Trip ID	string	Unique trip identifier (anonymized)	Dedup
Trip Start Timestamp	datetime	Time the trip began (rounded to 15-min intervals)	NB 01–05
Trip End Timestamp	datetime	Time the trip ended	NB 01
Trip Seconds	float	Trip duration in seconds	NB 01, 02
Trip Miles	float	Trip distance	NB 01, 02
Pickup Community Area	int	Chicago community area number (1–77)	NB 01–06
Dropoff Community Area	int	Chicago community area number (1–77)	NB 02, 03
Fare	float	Base fare in USD	NB 02
Tip	float	Tip amount in USD	NB 02
Shared Trip Authorized	bool	Whether rider opted into pool / shared ride	NB 02
Pickup Centroid Latitude	float	Lat of pickup area centroid	NB 03, 04
Pickup Centroid Longitude	float	Lon of pickup area centroid	NB 03, 04

Volume at a Glance

Total raw trips

100M+

After dedup / clean

~82M

Rows w/ valid geo

~74M

Final model data

aggregated

Airport Note: O'Hare (Community Area 76) and Midway (56) exhibit atypically high demand spikes driven by flight schedules. These are retained but flagged separately with an is_airport feature during modeling.

🧹

Notebook 01 · data_cleaning

Data Cleaning

The first notebook ingests raw CSVs from the Chicago Data Portal and applies a systematic cleaning protocol to produce a reliable, analysis-ready dataframe. Given the sheer volume of records, chunked reading and early column dropping are used to keep memory usage manageable.

Cleaning Steps

Load & sample: Read the dataset in chunks (or a stratified sample), immediately dropping unused columns to reduce memory footprint.
Parse timestamps: Convert Trip Start Timestamp and Trip End Timestamp to datetime64; extract hour, day_of_week, month, year.
Remove nulls: Drop rows missing Pickup Community Area or timestamp — these cannot be geo-located and are not recoverable.
Filter outliers: Remove trips with duration < 60 seconds or > 3 hours, and distance > 100 miles — likely data-entry errors or test rides.
Deduplicate: Drop exact duplicate Trip ID entries to prevent count inflation in aggregation.
Type casting: Ensure fare, tip, miles are float32 (halving memory vs. float64); community area as int8.
Export: Save cleaned DataFrame to a compressed .parquet file for fast downstream reads.

python · 01data_cleaning.ipynb
import pandas as pd

# Load in chunks to manage memory
chunks = []
for chunk in pd.read_csv("tnp_trips.csv", chunksize=500_000,
                          usecols=KEEP_COLS, low_memory=False):
    chunk["trip_start"] = pd.to_datetime(chunk["Trip Start Timestamp"])
    chunk = chunk.dropna(subset=["Pickup Community Area", "trip_start"])
    chunk = chunk[chunk["Trip Miles"].between(0.1, 100)]
    chunk = chunk[chunk["Trip Seconds"].between(60, 10800)]
    chunks.append(chunk)

df = pd.concat(chunks, ignore_index=True)
df.drop_duplicates(subset="Trip ID", inplace=True)
df.to_parquet("cleaned_trips.parquet", compression="snappy")

Output Schema

Column	Type	Notes
trip_start	datetime64	Parsed, validated timestamp
hour / dow / month	int8	Extracted temporal components
pickup_area	int8	1–77 community area code
trip_miles / trip_seconds	float32	Outlier-filtered
fare / tip	float32	USD values
shared_authorized	bool	Pool/shared eligibility flag

📊

Notebook 02 · eda

Exploratory Data Analysis

The EDA notebook investigates the statistical and temporal structure of Chicago rideshare demand. Key questions: When do trips peak? Where is demand concentrated? How do fare, distance, and tipping behave?

Temporal Patterns

Rideshare demand in Chicago follows two distinct daily rhythms. A smaller morning commute peak occurs around 7–9 AM, while a far more pronounced evening/nightlife peak dominates 10 PM – 2 AM on Fridays and Saturdays. This bimodal structure directly informs the time-based features constructed in Notebook 05.

Visualization · Hourly trip distribution

Trip Count by Hour of Day × Day of Week

Key EDA Findings

🌙

Nightlife Dominance

Friday and Saturday nights account for a disproportionate share of all weekly trips. The Loop, River North, and Wicker Park are the primary nightlife origin/destination clusters.

✈️

Airport Spikes

O'Hare and Midway show demand spikes pegged to flight arrival windows — a pattern distinct from the rest of the city and important to isolate during modeling.

💵

Fare Distribution

Most trips cluster between $8–$20 with a long right tail (airport runs). Tipping is sparse (~15% of trips) and skewed toward longer, non-shared rides.

🤝

Shared Rides

Roughly 26% of trips were shared-eligible. Shared trips concentrate in the Loop and near-north neighborhoods, suggesting higher density and willingness to share.

📍

Geographic Skew

The top 5 community areas account for over 50% of all pickups: The Loop, Near North Side, Lake View, O'Hare, and Near West Side.

📅

Seasonal Signal

Summer months (June–August) see higher overall demand; December shows a dip aside from holiday weekend spikes — a clear seasonal pattern for the model to learn.

Visualization · Geographic concentration

Top 20 Pickup Community Areas by Trip Volume

🗺️

Notebook 03 · map_visualizations

Map Visualizations

The map visualizations notebook creates interactive and static choropleth maps that overlay rideshare demand onto Chicago's geography using GeoPandas for spatial joins and Folium for interactive HTML maps. Community area boundary shapefiles from the Chicago Data Portal are merged with aggregated trip counts.

Map Types Produced

🔥

Pickup Choropleth

Community areas shaded by total pickup volume. Reveals the stark north-south divide in rideshare adoption across Chicago.

🟠

Dropoff Flow Map

Origin-destination pairs as arc flows, showing that the majority of trips converge into and out of the Loop corridor.

⏱️

Time-Sliced Maps

Separate choropleth layers for AM peak, PM peak, and overnight hours — animatable in Folium's layer control widget.

💙

Fare Density

Average fare per community area, normalized by trip count. Shows that per-trip costs are actually fairly uniform city-wide (~$14 avg).

All maps are built on a CartoDB Dark Matter basemap — designed to maximize contrast of demand heat signals against Chicago's street grid. Interactive Folium maps are exported as standalone HTML files in the maps/ folder of the project.

python · 03map_visualizations.ipynb
import geopandas as gpd
import folium
from folium.plugins import HeatMap, TimestampedGeoJson

# Load Chicago community areas shapefile
gdf = gpd.read_file("community_areas.geojson")
gdf = gdf.merge(trip_counts, on="area_number")

# Base dark map centered on Chicago
m = folium.Map(location=[41.85, -87.65], zoom_start=11, tiles="CartoDB dark_matter")

# Pickup density choropleth layer
folium.Choropleth(
    geo_data=gdf,
    data=gdf,
    columns=["area_number", "trip_count"],
    key_on="feature.properties.area_number",
    fill_color="YlOrRd",
    fill_opacity=0.75,
    line_opacity=0.3,
    legend_name="Pickup Trip Count",
    name="Pickup Density"
).add_to(m)

# Heatmap layer from raw centroid coordinates
heat_data = [[row.lat, row.lon, row.weight] for _, row in heat_df.iterrows()]
HeatMap.add_to(m, heat_data, radius=12, blur=8, name="Heat Map")

folium.LayerControl().add_to(m)
m.save("maps/pickup_density.html")

Map 1 — Trip Fare Density · 3D Hex Columns (NYE 2024)

Each H3 cell is rendered as a 3D column — height encodes total trip count, color encodes average fare. Yellow columns represent high-fare cells (airports, long-haul pickups), orange represents the dense mid-fare core. The single purple column marks an anomalous high-fare outlier cell. Captured on New Year's Eve 2024 — the highest-demand night of the year.

Chicago TNC Trip Fare Density · 3D H3 Hex Columns · NYE 2024 New Year's Eve 2024

Chicago TNC Trip Fare Density 3D Hex Map NYE 2024

Yellow = High fare (avg $22+) — mostly airports & long-distance pickups | Orange = Core demand zone ($10–$18) — Loop, Near North, Wicker Park | Purple = Anomalous outlier cell | Column height ∝ trip count

Chicago TNC Trip Fare Density · H3 Resolution 7 · 3D Hex Columns · NYE 2024 New Year's Eve 2024

Low

High

Map 2 — Demand Percentile Distribution · H3 Hexagons (NYE 2024)

Flat-top H3 hexagons colored by demand percentile rank — yellow = top 10% highest-demand cells, orange = 50th–90th percentile, purple = bottom 50%. This view makes the spatial inequality stark: a small cluster of lakefront hexagons captures the vast majority of all demand while hundreds of outer cells sit in the lowest decile. Captured on New Year's Eve 2024.

H3 Demand Percentile Distribution · Flat Hex Map · NYE 2024 New Year's Eve 2024

$Chicago H3 Demand Percentile Distribution NYE 2024$

Yellow = Top 10% demand cells | Orange = 50th–90th percentile | Purple = Bottom 50% (low demand) | Flat hexagons at H3 Resolution 7 · ~5.2 km² each

The three time-slice panels below show how this distribution morphs through the day:

7AM – 9AM · WEEKDAY

Morning Commute

Downtown cluster. Transit hub spillover. Moderate volume.

10PM – 2AM · FRI/SAT

Nightlife Peak ★

Concentrated in nightlife corridors. Highest absolute volumes.

3AM – 6AM · ANY DAY

Dead Hours

Near-zero citywide. Only O'Hare and Midway remain active.

Map 3 — Origin–Destination Flow Arcs

Trip origin-destination pairs are visualized as arcs, revealing the dominant spoke-and-hub pattern: nearly all significant OD flows terminate in or originate from the Loop corridor and Near North Side. Outer neighborhoods generate mostly short local trips, while airport communities generate long-distance flows.

OD Arc Map — Top 30 Community Area Pairs by Trip Volume

Map 4 — Average Fare per Community Area

Despite the massive volume imbalance between neighborhoods, average trip fares remain surprisingly uniform city-wide — hovering near $14 per trip. Longer trips from the far south and suburbs drive fares up slightly, while dense short-hop trips in the Loop keep averages down despite high frequency.

Avg Fare Distribution

Notable Outliers

O'Hare Airport $28.40

Midway Airport $22.10

The Loop $11.20

City Average $14.00

⬡

Notebook 04 · h3_analysis

H3 Hexagonal Analysis

A critical design decision in this project is switching from administrative community areas to Uber's H3 hierarchical hexagonal grid as the spatial unit of analysis. H3 cells are equal-area, neighbor-consistent, and resolution-scalable — making them far more suitable for demand modeling than irregular census geographies.

Why H3? Community areas vary enormously in size (1.6 km² to 71 km²), making raw trip counts incomparable across areas. H3 Resolution 7 cells (~5.16 km² each) provide uniform spatial granularity, and neighbor lookups are O(1) — ideal for lag features.

Resolution Selection

H3 Resolution	Avg Cell Area	Cells in Chicago	Suitability
6	36.1 km²	~18	Too coarse — loses neighborhood detail
7	5.16 km²	~120	✓ Selected — neighborhood-level granularity
8	0.74 km²	~850	Too fine — many cells with zero demand
9	0.10 km²	~6,000	Extreme sparsity, not suitable

H3 Processing Steps

Convert pickup centroid coordinates to H3 cell index using h3.geo_to_h3(lat, lon, resolution=7).
Aggregate trip counts per (H3 cell, hour) bin, producing a sparse demand matrix.
Fill zero-demand cells with explicit zeros (critical for time-series continuity).
Compute k-ring neighbors for each cell (k=1, 2) to enable spatial lag features.
Visualize H3 demand density using pydeck or Folium hexagon layers.

python · 04h3_analysis.ipynb
import h3
import pandas as pd

RESOLUTION = 7

# Assign each trip to an H3 cell
df["h3_cell"] = df.apply(
    lambda r: h3.geo_to_h3(r.pickup_lat, r.pickup_lon, RESOLUTION),
    axis=1
)

# Aggregate to hourly demand per cell
demand = (df
    .groupby(["h3_cell", pd.Grouper(key="trip_start", freq="1H")])
    .size()
    .reset_index(name="trip_count")
)

# Compute ring-1 neighbor cells for spatial lag features
demand["neighbors"] = demand["h3_cell"].map(
    lambda c: list(h3.k_ring(c, 1) - {c})
)

python · 04h3_analysis.ipynb
import h3
import pandas as pd

RESOLUTION = 7

# Assign each trip to an H3 cell
df["h3_cell"] = df.apply(
    lambda r: h3.geo_to_h3(r.pickup_lat, r.pickup_lon, RESOLUTION),
    axis=1
)

# Aggregate to hourly demand per cell
demand = (df
    .groupby(["h3_cell", pd.Grouper(key="trip_start", freq="1H")])
    .size()
    .reset_index(name="trip_count")
)

# Compute ring-1 neighbor cells for spatial lag features
demand["neighbors"] = demand["h3_cell"].map(
    lambda c: list(h3.k_ring(c, 1) - {c})
)

H3 Viz 1 — Trip Count 3D Columns · Citywide (NYE 2024)

Each H3 cell is extruded into a 3D cylinder — height and color both encode total trip count. Purple/blue = lower volume, yellow = extreme outliers (Loop core and airport cells that tower over the rest of the city). This rendering was captured on New Year's Eve 2024. The dramatic height asymmetry between the lakefront cluster and the western/southern neighborhoods is immediately apparent.

Chicago Ride Demand — H3 3D Trip Count Cylinders · NYE 2024 · PyDeck New Year's Eve 2024

Chicago H3 Trip Count 3D Cylinders NYE 2024

Yellow cylinders = Extreme demand (Loop, lakefront near-north) — height exceeds all surrounding cells | Purple = High-mid demand | Blue = Low-volume residential and outer areas | Rendered with PyDeck ColumnLayer · H3 Resolution 7

The full SVG grid below shows the demand structure in a schematic form for reference:

H3 Resolution 7 · ~120 cells cover Chicago · Demand aggregated over all hours

0 trips

500+

H3 Viz 2 — Why Hexagons? Equidistance Property

This is the core insight from H3's design, drawn directly from the mathematics of tiling: hexagons are the only regular polygon where all neighbors are equidistant from the center. This makes distance-based smoothing, neighbor aggregation, and k-ring convolution geometrically consistent — a property squares and triangles don't have.

Triangle

12 neighbors
Unequal distances

Square

8 neighbors
2 different distances

Hexagon ✓

6 neighbors
All equidistant ✓

H3 Viz 3 — k-Ring Neighbor Expansion

The h3.k_ring(cell, k) function returns all cells within k steps of a center cell. For CHIride Demand, k=1 (6 direct neighbors) and k=2 (18 cells including k=1) are used to build spatial lag features. As k grows, the ring approximates a circle — a property unique to hexagonal grids and exploited by computer vision convolution techniques applied to geospatial ML.

k = 0 · origin

1 cell

k = 1 · ring-1

7 cells total

k = 2 · ring-2

19 cells total

k = 3 · ring-3

37 cells total

H3 Viz 4 — Hourly Demand Time Series by Cell

Each H3 cell has its own unique demand profile. The Loop cell shows two daily peaks on weekdays and a massive late-night spike on weekends. The O'Hare airport cell has a completely different pattern — driven by flight arrival schedules with broad morning and evening clusters. Outer residential cells show nearly flat, low-amplitude series.

24-Hour Demand Profile · Average across all days · By selected H3 cells

The Loop

O'Hare Airport

Wicker Park

Outer South

H3 Viz 5 — Cell × Hour Demand Matrix

The full modeling dataset is a (H3 cell) × (hour of week) matrix. This heatmap shows a subset of top cells — each row is one H3 cell, each column one hour of the week (Monday 0AM to Sunday 11PM). The weekend nightlife intensification shows clearly as bright vertical bands in the rightmost columns.

H3 Cell × Hour-of-Week Demand Matrix · Top 8 cells by mean demand · 168 columns = Mon 0AM → Sun 11PM

Mon

Tue

Wed

Thu

Fri

Sat

Sun

⚙️

Notebook 05 · feature_engineering

Feature Engineering

Feature engineering transforms the raw demand time series into a rich tabular dataset that ML models can consume. The engineered features span four categories: temporal, spatial, lag/window, and contextual.

Temporal Features

Cyclical encoding: Hours, days of week, and months are encoded as sine/cosine pairs so the model understands their circular nature — e.g., 11 PM and 1 AM are close in time, not far apart.

Spatial Features

Lag & Rolling Window Features

python · 05feature_eng.ipynb — lag features
import numpy as np

# Sort by cell + time, then create lag features within each cell
df = df.sort_values(["h3_cell", "trip_start"])

for lag in [1, 2, 3, 24, 168]:
    df[f"lag_{lag}h"] = df.groupby("h3_cell")["trip_count"].shift(lag)

# Cyclical encoding for hour
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

# Spatial lag: mean demand of ring-1 neighbors
cell_demand = df.set_index(["h3_cell", "trip_start"])["trip_count"]
df["ring1_mean_demand"] = df.apply(
    lambda r: cell_demand.loc[
        [(n, r.trip_start) for n in r.neighbors if (n, r.trip_start) in cell_demand.index]
    ].mean(), axis=1
)

Final Feature Matrix

After engineering, each row represents a single (H3 cell × hour) observation with approximately 38 features. The target variable is trip_count — the number of rides starting in that cell during that hour.

Feature Group	Count	Key Members
Temporal (raw)	6	hour, dow, month, year, is_weekend, is_holiday
Temporal (cyclical)	6	hour_sin, hour_cos, dow_sin, dow_cos, month_sin, month_cos
Spatial	5	h3_cell (encoded), is_downtown, is_airport, dist_to_loop, pop_density
Lag features	5	lag_1h, lag_2h, lag_3h, lag_24h, lag_168h
Rolling windows	5	roll_mean_3h, roll_mean_6h, roll_mean_24h, roll_std_6h, ewm
Spatial lag	4	ring1_mean, ring2_mean, ring1_std, ring1_max
Context	3	is_rush_hour, is_nightlife_hour, days_since_holiday

🤖

Notebook 06 · model_training

Model Training

Multiple models are trained for the four prediction tasks. The task type determines model selection: TFT for the citywide time-series problem (T1), XGBoost regressors for hex-level demand (T3), and classifiers for the binary repositioning and surge tasks (T2, T4). A time-based train/test split ensures no data leakage from the future across all tasks.

T1 uses Google's Temporal Fusion Transformer (TFT) — a deep learning architecture combining LSTM encoders, multi-head attention, and variable selection networks. It achieves 95% accuracy on citywide demand forecasting by learning which temporal patterns are most predictive at each horizon.

Train / Validation / Test Split

python · 06model_training.ipynb — temporal split
# Time-based split — NO random shuffling for time series data
cutoff_val  = "2022-09-01"
cutoff_test = "2022-11-01"

train = df[df.trip_start <  cutoff_val]
val   = df[(df.trip_start >= cutoff_val) & (df.trip_start <  cutoff_test)]
test  = df[df.trip_start >= cutoff_test]

X_train, y_train = train[FEATURES], train["trip_count"]
X_val,   y_val   = val[FEATURES],   val["trip_count"]
X_test,  y_test  = test[FEATURES],  test["trip_count"]

Models Evaluated

XGBoost Regressor Best

Gradient-boosted trees · n_estimators=500 · max_depth=7 · lr=0.05 · subsample=0.8

0.912

R² test

3.24

RMSE

Random Forest Regressor

Bagged trees · n_estimators=300 · max_depth=15 · min_samples_leaf=4

0.887

R² test

4.01

RMSE

LightGBM Regressor

Leaf-wise gradient boosting · num_leaves=63 · learning_rate=0.05

0.903

R² test

3.68

RMSE

Ridge Regression Baseline

Linear model · alpha=1.0 · StandardScaler preprocessing

0.741

R² test

7.92

RMSE

Feature Importance (XGBoost)

lag_1h

0.213

lag_24h

0.181

lag_168h

0.163

ring1_mean_demand

0.137

hour_sin

0.108

is_nightlife_hour

0.086

roll_mean_6h

0.063

is_weekend

0.047

Key insight: Recent lag features (1h, 24h, 168h) dominate — demand is highly auto-correlated. Spatial lag (ring1 neighbors) ranks 4th, confirming that nearby cells are predictive. Temporal features matter most for unusual demand spikes (nights out, events).

✅

Notebook 07 · model_eval_api

Model Evaluation & API

The final notebook evaluates the best model (XGBoost) across multiple dimensions — overall accuracy, per-cell residuals, error by time-of-day, and spatial error patterns — then wraps the model in a lightweight prediction API for real-time inference.

Evaluation Metrics

Metric	Formula	XGBoost (test)	Interpretation
RMSE	√mean((ŷ−y)²)	3.24 trips	Avg prediction error of ~3 trips/hour/cell
MAE	mean\|ŷ−y\|	2.11 trips	Median error around 2 trips
R²	1 − SS_res/SS_tot	0.912	Model explains 91.2% of variance
MAPE	mean\|ŷ−y\|/y × 100	14.3%	Higher in low-demand cells (few trips)
Spearman ρ	rank correlation	0.958	Excellent ranking of high-demand cells

Error Analysis

By time of day: The model performs best during steady off-peak hours (2 AM – 6 AM) and daytime. Error is highest in the 10 PM – 1 AM window when demand is volatile and event-driven.

By community area: Airport cells (O'Hare, Midway) have the highest absolute errors due to flight-schedule dependency not captured in temporal features.

Residuals by Hour

Prediction API

The notebook serializes the trained XGBoost model with joblib and exposes a prediction function that takes an H3 cell ID and timestamp, constructs the feature vector from historical demand, and returns a predicted trip count.

python · 07model_eval_api.ipynb — inference function
import joblib
from datetime import datetime
import h3

model = joblib.load("xgb_demand_model.joblib")

def predict_demand(lat: float, lon: float, timestamp: datetime) -> dict:
    """
    Predict rideshare demand for a given location and time.
    Returns trip count estimate + confidence interval.
    """
    cell = h3.geo_to_h3(lat, lon, resolution=7)
    features = build_feature_vector(cell, timestamp)
    prediction = model.predict([features])[0]
    
    return {
        "h3_cell":    cell,
        "timestamp":  timestamp.isoformat(),
        "predicted_trips": round(float(prediction), 2),
        "confidence": "±3.2 trips (1σ)"
    }

# Example call
result = predict_demand(
    lat=41.8819, lon=-87.6278,     # The Loop
    timestamp=datetime(2024, 3, 15, 23, 0)  # Friday 11 PM
)
# → {"h3_cell": "872a1072fffffff", "predicted_trips": 47.3, ...}

Visualization · Predicted vs Actual

Scatter: Predicted Trip Count vs Actual (Test Set)

🏆

Summary

Results & Key Findings

The project trains separate best-in-class models for each of the four prediction tasks. Below is a consolidated view of model choices, accuracy metrics, and key findings across T1 through T4.

Task-by-Task Model Results

T1 · CITYWIDE DEMAND FORECAST

Google Temporal Fusion Transformer (TFT)

Multi-horizon time-series forecasting with attention mechanism · 95% accuracy

95%

Accuracy

0.95

R² Test

TFT was selected for T1 because citywide aggregate demand is a pure time-series problem with strong temporal dependencies, multi-seasonality (daily + weekly + annual), and known static covariates (holidays, events). TFT's attention mechanism explicitly learns which historical timesteps are most relevant for each forecast horizon — outperforming XGBoost by 8 percentage points on this task.

Attention mechanism Multi-horizon forecasting Encoder–Decoder LSTM Variable selection networks

T1 Model Comparison — Accuracy (%)

TFT (best)

95%

XGBoost

87%

LightGBM

85%

T2 · DRIVER REPOSITIONING SIGNAL

XGBoost Classifier

Binary classification (Move=1 / Stay=0) per H3 cell per hour

88%

F1 Score

91%

AUC-ROC

The repositioning label is derived from the T3 predictions: a cell receives a Move=1 signal when its predicted demand in the next hour is below the 25th percentile of its historical distribution AND a neighboring ring-1 cell exceeds the 75th percentile. XGBoost's feature importance naturally captures the spatial gradient between adjacent cells.

Precision: 0.86 Recall: 0.90 Derived from T3 predictions

T3 · H3 HEXAGON DEMAND

XGBoost Regressor

Per-cell trip count regression · ~120 H3 cells · R² = 0.912

0.912

R² Test

3.24

RMSE

The core spatial task. XGBoost was selected over LightGBM and Random Forest after exhaustive comparison — it handles the mixed feature types (temporal cyclical encodings, spatial lag floats, binary flags) most robustly and converges faster with early stopping on the hourly-aggregated training set.

MAE: 2.11 trips MAPE: 14.3% Spearman ρ: 0.958 n_estimators: 500

T4 · SURGE DETECTION

Random Forest Classifier

Binary surge event prediction per H3 cell · AUC-ROC = 0.94

0.94

AUC-ROC

85%

F1 Score

Random Forest was preferred for surge detection because the surge label is inherently noisy and threshold-dependent — ensemble averaging over many trees provides better-calibrated probability estimates than XGBoost's boosted structure. The model uses T3 demand predictions, weather flags, CTA service status, and historical surge frequency as features.

Precision: 0.83 Recall: 0.87 n_estimators: 300 Uses weather + CTA

Top Spatial Findings

🔺

50% of trips in 5 areas

Extreme spatial concentration — just 5 of 77 community areas account for more than half of all rideshare trips city-wide, driven by the Loop, Near North Side, and airports.

⬡

H3 Res 7 is optimal

Resolution 6 loses neighborhood detail (only 18 cells city-wide). Resolution 8 produces 850+ cells with extreme sparsity. Res 7's ~120 cells is the Goldilocks zone.

🔗

Strong spatial autocorrelation

The k=1 ring neighbor mean demand (ring1_mean_demand) ranks as the 4th most important feature — nearby cells are highly predictive of each other, confirming spatial spillover effects.

✈️

Airports are structural outliers

O'Hare and Midway exhibit demand patterns driven by flight schedules, not city rhythms. Their MAPE is 28% vs the city-wide 14.3% — a clear modeling challenge for future work.

Top Temporal Findings

Total Citywide Trip Volume — by Day of Week + Hour of Day

Each column = one day (Mon–Sun). Y-axis = hour 0–23. Brightness encodes trip volume. Friday/Saturday nights dominate.

Map Gallery — Heatmap Comparisons Across Event Types

These three annotated heatmaps from the project's map visualization notebook show how Chicago's rideshare demand geography changes dramatically depending on the day type — and how the model must generalize across all of them.

Weekday Demand Pattern · Mon–Fri · December 2024 · 22 Weekdays · 200k points sampled Baseline Pattern

Very High · High · Moderate · Low · Very Low | 2,808,065 total trips · Avg 127,639/day · Dominant pattern: bimodal commute (AM + PM) · Hottest zone: Loop business district

New Year's Eve Midnight · Dec 31 10PM – Jan 1 2AM · 4-Hour Window Peak Demand Event

Extreme Demand · Very High · High · Moderate · Low | 27,714 trips in 4 hours · 1.2× surge vs normal · Top zone: River North / Near North · Entire city coast lit citywide purple

Christmas Day · Full 24 Hours of December 25, 2024 Holiday Drop −53%

Highest (still below normal) · Moderate · Low · Very Low · Minimal | 62,768 trips (47% of avg) · −53% below normal · Most active: O'Hare Airport · City nearly dark except airport + Loop

Model Challenge — Holiday Generalization: The Christmas Day map shows the model must handle a −53% demand drop with almost no signal in outer neighborhoods. The TFT's variable selection network and holiday embedding explicitly learned this pattern. The NYE midnight window shows a +23% lift above normal 4-hour blocks concentrated in a tight geographic corridor — tested the T4 surge classifier hardest.

Key takeaway: The lag features (lag_1h, lag_24h, lag_168h) are the dominant predictors by a wide margin — accounting for ~56% of model importance. This confirms that rideshare demand is highly auto-correlated: what happened at the same place last hour, yesterday, and last week is the best predictor of what will happen now. Spatial neighbor features add a further ~14%, reducing prediction error especially in transitional zones between hot and cold cells.

🛠️

Reference

Tech Stack

🐼

pandas / numpy

Core data wrangling. Chunked CSV reads, datetime parsing, aggregation, and memory-optimized dtypes.

⬡

H3-py (Uber)

Hexagonal hierarchical spatial indexing. Converts lat/lon to H3 cells, computes k-ring neighbors for spatial features.

🗺️

GeoPandas + Folium

Spatial joins with Chicago shapefiles and interactive choropleth HTML maps rendered with Leaflet under the hood.

🚀

XGBoost

Primary model. Gradient-boosted trees with GPU acceleration support. Best overall performance on the spatiotemporal regression task.

🌲

scikit-learn

Baseline models (Ridge, Random Forest), preprocessing pipelines, cross-validation utilities, and metrics computation.

💡

LightGBM

Competitive alternative to XGBoost. Faster training on large datasets with leaf-wise tree growth and category feature support.

📊

matplotlib / seaborn

Static visualization for EDA plots — distribution histograms, time-series demand charts, and correlation heatmaps.

💾

joblib / parquet

Model serialization and compressed columnar storage for intermediate datasets. Parquet is ~10× faster to read than CSV at this scale.

Environment

requirements.txt
pandas>=2.0
numpy>=1.24
h3>=3.7
geopandas>=0.13
folium>=0.14
xgboost>=1.7
lightgbm>=3.3
scikit-learn>=1.2
matplotlib>=3.7
seaborn>=0.12
joblib>=1.2
pyarrow>=11.0          # parquet engine
jupyter>=1.0

🚀

Reference

Quickstart

1. Clone & Install

bash
git clone https://github.com/Kslaxman/chiride-demand.git
cd chiride-demand
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

2. Download Data

Download the TNC Trips dataset from the Chicago Data Portal. The direct export URL for the full dataset is:

bash
# Chicago TNC Trips (2018 - present, ~8GB CSV)
curl -o data/tnp_trips.csv \
  "https://data.cityofchicago.org/api/views/m6dm-c72p/rows.csv?accessType=DOWNLOAD"

3. Run the Pipeline

Execute notebooks in order. Each notebook reads from and writes to the data/ folder. All intermediate files are stored as .parquet for fast I/O.

bash
jupyter nbconvert --to notebook --execute 01data_cleaning.ipynb
jupyter nbconvert --to notebook --execute 02eda.ipynb
jupyter nbconvert --to notebook --execute 03map_visualizations.ipynb
jupyter nbconvert --to notebook --execute 04h3_analysis.ipynb
jupyter nbconvert --to notebook --execute 05feature_eng.ipynb
jupyter nbconvert --to notebook --execute 06model_training.ipynb
jupyter nbconvert --to notebook --execute 07model_eval_api.ipynb

4. Run a Prediction

python
from src.predict import predict_demand
from datetime import datetime

result = predict_demand(
    lat=41.8819, lon=-87.6278,
    timestamp=datetime(2024, 6, 21, 22, 0)
)
print(result)
# → {"h3_cell": "872a1072fffffff", "predicted_trips": 47.3}

Note on memory: Processing the full 100M+ row dataset requires ~16 GB RAM. For development, use the SAMPLE_FRAC=0.05 environment variable to work with a 5% random sample: SAMPLE_FRAC=0.05 jupyter notebook 01data_cleaning.ipynb

Project Structure

project layout
chiride-demand/
├── 01data_cleaning.ipynb
├── 02eda.ipynb
├── 03map_visualizations.ipynb
├── 04h3_analysis.ipynb
├── 05feature_eng.ipynb
├── 06model_training.ipynb
├── 07model_eval_api.ipynb
├── data/
│   ├── raw/                  # downloaded CSVs
│   ├── interim/              # cleaned .parquet files
│   └── processed/            # final feature matrices
├── models/
│   └── xgb_demand_model.joblib
├── maps/
│   └── *.html                # Folium interactive maps
├── src/
│   └── predict.py            # inference API
└── requirements.txt

      Chicago Ride Demand Forecast · Sailaxman Kotha · github.com/Kslaxman/chiride-demand
    

Chicago RideDemand Forecast

What is CHIride Demand?

Why Does This Problem Matter?

End-to-End Pipeline

Goal

Input

Output

Scale

Four Prediction Tasks

Task Dependency Flow

External Factors: CTA & Weather

Chicago Transit Authority (CTA)

CTA L Train Ridership

Bus Route Proximity

Last Train Feature

Distance to Nearest L Stop

Weather Conditions

Precipitation (mm/hr)

Temperature & Wind Chill

Snow Accumulation

Visibility & Wind Speed

Chicago TNC Dataset

Raw Schema

Volume at a Glance

Data Cleaning

Cleaning Steps

Output Schema

Exploratory Data Analysis

Temporal Patterns

Key EDA Findings

Nightlife Dominance

Airport Spikes

Fare Distribution

Shared Rides

Geographic Skew

Seasonal Signal

Map Visualizations

Map Types Produced

Pickup Choropleth

Dropoff Flow Map

Time-Sliced Maps

Fare Density

Map 1 — Trip Fare Density · 3D Hex Columns (NYE 2024)

Map 2 — Demand Percentile Distribution · H3 Hexagons (NYE 2024)

Map 3 — Origin–Destination Flow Arcs

Map 4 — Average Fare per Community Area

H3 Hexagonal Analysis

Resolution Selection

H3 Processing Steps

H3 Viz 1 — Trip Count 3D Columns · Citywide (NYE 2024)

H3 Viz 2 — Why Hexagons? Equidistance Property

H3 Viz 3 — k-Ring Neighbor Expansion

H3 Viz 4 — Hourly Demand Time Series by Cell

H3 Viz 5 — Cell × Hour Demand Matrix

Feature Engineering

Temporal Features

Spatial Features

Lag & Rolling Window Features

Final Feature Matrix

Model Training

Train / Validation / Test Split

Models Evaluated

Feature Importance (XGBoost)

Model Evaluation & API

Evaluation Metrics

Error Analysis

Prediction API

Results & Key Findings

Task-by-Task Model Results

Top Spatial Findings

50% of trips in 5 areas

H3 Res 7 is optimal

Strong spatial autocorrelation

Airports are structural outliers

Top Temporal Findings

Map Gallery — Heatmap Comparisons Across Event Types

Tech Stack

pandas / numpy

H3-py (Uber)

GeoPandas + Folium

Chicago Ride
Demand Forecast