📍 Chicago Ride Demand Forecast — Sailaxman Kotha
View on GitHub
Chicago Ride Demand Forecast · Spatiotemporal ML Pipeline

Chicago Ride
Demand Forecast

A spatiotemporal machine learning pipeline that predicts rideshare trip demand across Chicago's community areas — combining H3 hexagonal indexing, time-series feature engineering, and ensemble models to answer where and when rides will be needed.

Python 3.10+ H3 · Uber Hexagons XGBoost / Random Forest Chicago Data Portal GeoPandas · Folium
100M+
TNC trips processed
77
community areas
7
notebook pipeline stages
H3 Res 7
spatial resolution

What is CHIride Demand?

CHIride Demand is a full end-to-end data science project that ingests years of public Transportation Network Company (TNC) trip records from the City of Chicago Open Data Portal and trains machine learning models to forecast how many rideshare trips will originate from each hexagonal grid cell in a given hour.

Unlike academic toy datasets, this pipeline handles real-world messiness: hundreds of millions of rows, missing geo data, irregular temporal gaps, and strong spatial autocorrelation — all requiring deliberate engineering choices at every step.

Why Does This Problem Matter?

Rideshare platforms like Uber and Lyft use demand forecasting to pre-position drivers, calibrate surge pricing, and minimize rider wait times. A model that can predict which neighborhoods will see demand spikes in the next hour lets the platform incentivize drivers to rebalance before surge occurs — reducing both empty miles and passenger frustration.

For the public sector, the same predictions inform transit planning, equity analysis, and infrastructure investment decisions.

Key Design Decisions
SPATIAL UNIT
H3 over Admin Boundaries
Chicago's 77 community areas range from 1.6–71 km². H3 Resolution 7 provides uniform ~5.2 km² cells, making trip counts directly comparable across space.
TEMPORAL SPLIT
Chronological, Never Shuffled
Time-series data must be split chronologically. Random cross-validation would leak future information into training and produce wildly optimistic metrics.
TARGET VARIABLE
Aggregated Trip Count
Rather than modeling individual trips, we aggregate to (H3 cell × hour) bins. This converts a point-process problem into a well-posed regression task.
Chicago Rideshare Context

The Chicago Data Portal publishes anonymized rideshare data covering all trips by Uber, Lyft, and Via starting November 2018. As of 2024, it holds over 100 million trip records and continues growing at ~4M trips/month.

Chicago rideshare is geographically concentrated: just 5 of 77 community areas (The Loop, Near North Side, Lake View, O'Hare Airport, Near West Side) account for over 50% of all pickups. This skew is a core challenge for the model — it must perform well in both dense urban cores and sparse outer neighborhoods.

Top 5 Pickup Areas — % of Total Trips
The Loop18.4%
Near North Side12.1%
Lake View8.7%
O'Hare Airport7.2%
Near West Side5.9%
All other 72 areas47.7%
Architecture

End-to-End Pipeline

The project is structured as a sequential, seven-notebook pipeline. Each notebook is self-contained yet feeds into the next, allowing any stage to be re-run independently without rebuilding the entire workflow from scratch.

NB 01
Data Cleaning
NB 02
EDA
NB 03
Map Viz
NB 04
H3 Analysis
NB 05
Feature Eng.
NB 06
Model Training
NB 07
Eval & API
🎯

Goal

Predict hourly rideshare trip demand per H3 hexagon cell across Chicago, enabling drivers, dispatchers, and planners to anticipate where demand will surge.

📦

Input

Chicago Data Portal TNC trip records: trip ID, start/end time, pickup/dropoff community area, fare, tip, distance, shared-trip flag.

📤

Output

A trained demand-prediction model and a lightweight API endpoint that returns predicted trip counts for any given H3 cell and time window.

🌐

Scale

Handles 100M+ rows from the Chicago Open Data Portal, down-sampled and aggregated to hourly-spatial bins for tractable modeling.

🎯
Prediction Goals

Four Prediction Tasks

This project is not a single monolithic prediction problem — it defines four distinct tasks, each answering a different operational question for the rideshare ecosystem. Tasks range from city-level macro forecasting down to binary hexagon-level signals for real-time driver decisions.

T1 · Citywide Demand
Total Rides Across All of Chicago — Next Hour
Forecasts the aggregate number of rideshare trips that will be requested across the entire city in the coming hour. This is a pure time-series regression problem — a single scalar output capturing macro-level demand dynamics driven by time-of-day, day-of-week, weather, and special events.
📈 Regression 🏙️ City-level TFT · 95% accuracy
T2 · Driver Repositioning Signal
Should Drivers in THIS Hexagon Move? Yes / No
A binary classification signal per H3 cell: should an idle driver currently in this hexagon reposition to a better area? Derived from predicted demand surplus/deficit. Gives drivers an actionable, real-time cue — move or stay — without requiring them to interpret raw demand numbers.
🔀 Binary Classification ⬡ H3 Cell-level Output: 0 / 1
T3 · H3 Hexagon Demand
Rides in THIS Specific Hexagon — Next Hour
Forecasts the trip count for each individual H3 Resolution-7 cell in the next hour. This is the core spatiotemporal regression task — ~120 simultaneous predictions, one per hexagon. Uses the full suite of lag features, spatial neighbor features, and external factors. Enables precise geographic demand heatmaps.
📊 Regression ⬡ Per-Hex ~120 outputs
T4 · Surge Detection
Is a Surge Pricing Event Imminent in THIS Hexagon?
A binary classification task predicting whether a hexagon will experience a surge pricing condition in the next hour — defined as demand exceeding supply by a threshold ratio. Enables proactive driver routing before surge occurs, rather than reactive repositioning after surge has already spiked fares.
⚡ Binary Classification ⬡ H3 Cell-level Output: 0 / 1
Why four tasks? Each task serves a different stakeholder. T1 helps operations teams scale infrastructure. T2 gives individual drivers a simple binary signal. T3 powers visualization dashboards and APIs. T4 enables dynamic pricing systems to anticipate surge windows before they occur.

Task Dependency Flow

T1
Citywide
Demand
T3
Hex-Level
Demand
T2
Driver
Reposition
T4
Surge
Detection
T3 hex-level forecasts feed both the repositioning signal (T2) and surge detection (T4)
🌦️
External Data Sources

External Factors: CTA & Weather

Rideshare demand doesn't exist in isolation — it is directly shaped by two major external forces: Chicago's public transit network (CTA) and real-time weather conditions. Both are integrated as features in the pipeline.

Chicago Transit Authority (CTA)

The CTA operates the elevated L train and an extensive bus network. Rideshare demand spikes when CTA service is disrupted, delayed, or inadequate — especially late at night when train frequency drops. The following CTA-derived features are engineered and merged with the TNC trip data by timestamp and community area.

🚊

CTA L Train Ridership

Daily ridership counts per station from the CTA open data portal. Low CTA ridership hours correlate with elevated rideshare demand, particularly in neighborhoods far from L stops.

🚌

Bus Route Proximity

Each H3 cell is enriched with the number of active CTA bus routes within the cell's boundary. Cells with fewer transit options show systematically higher rideshare demand per capita.

Last Train Feature

A binary flag is_post_last_train marks hours after the last CTA Blue/Red Line departure (~1–2 AM). This is one of the strongest late-night demand predictors in the feature set.

📍

Distance to Nearest L Stop

Computed per H3 centroid. Cells more than 1.2 km from any L station show a significant baseline demand lift — especially for inbound commute-direction trips.

python · CTA feature integration
import geopandas as gpd # Load CTA L stops shapefile cta_stops = gpd.read_file("cta_l_stops.geojson") # Compute distance from each H3 cell centroid to nearest L stop from shapely.ops import nearest_points df["dist_to_l_km"] = df.apply( lambda r: cta_stops.distance(r.geometry).min() / 1000, axis=1 ) # Post-last-train flag (CTA Blue/Red Line ends ~1:30 AM) df["is_post_last_train"] = ((df["hour"] >= 1) & (df["hour"] <= 5)).astype(int) # CTA ridership merge (daily) df = df.merge(cta_daily, on=["date", "nearest_l_stop"], how="left")

Weather Conditions

Chicago weather is extreme and highly predictive of rideshare demand. Heavy rain, snowstorms, and sub-zero wind chills reliably push riders off CTA platforms and into ride-hailing apps. Weather data is sourced from the OpenWeather API, joined to the trip data by hour and approximated as uniform across the city (given Chicago's relatively small geographic footprint).

🌧️

Precipitation (mm/hr)

Hourly rainfall and snowfall intensity. Rain events above 2mm/hr show a 15–35% demand lift vs. dry conditions at the same hour.

🌡️

Temperature & Wind Chill

Apparent temperature (°F). Temperatures below 15°F or above 90°F both correlate with demand increases — Chicago cold suppresses walking, heat drives convenience use.

🌨️

Snow Accumulation

Daily snow depth (inches). Heavy snow events are the single biggest demand spike triggers outside of holidays — demand can jump 40–60% during major snowstorms.

👁️

Visibility & Wind Speed

Low visibility conditions and high wind speeds (>25 mph) further amplify demand, especially for late-night riders unwilling to wait at exposed CTA stops.

Avg Demand Lift by Weather Condition vs. Clear Sky Baseline
Heavy Snow (>3in/day)+52%
Rain >2mm/hr+28%
Temp <15°F+22%
Temp >90°F+14%
Wind >25 mph+11%
Light Rain <1mm/hr+6%
Clear Sky (baseline)0%
NYE 2024 Validation: The visualizations throughout this documentation were captured on New Year's Eve 2024 — one of the highest-demand nights of the year. This date was deliberately chosen for validation because it combines extreme weather (cold Chicago winter), a major event (countdown crowds), post-last-train conditions, and a citywide surge. It is the hardest possible test of the model's temporal generalization.
🗄️
Data Source

Chicago TNC Dataset

The Chicago Data Portal provides the full Transportation Network Company (TNC) trip dataset — covering all trips by Uber, Lyft, and Via starting November 2018, updated continuously.

Privacy Note: To protect rider privacy, the city aggregates pickup/dropoff locations to the census tract level and suppresses exact timestamps when trip counts in a cell are below a threshold.

Raw Schema

ColumnTypeDescriptionUsed In
Trip IDstringUnique trip identifier (anonymized)Dedup
Trip Start TimestampdatetimeTime the trip began (rounded to 15-min intervals)NB 01–05
Trip End TimestampdatetimeTime the trip endedNB 01
Trip SecondsfloatTrip duration in secondsNB 01, 02
Trip MilesfloatTrip distanceNB 01, 02
Pickup Community AreaintChicago community area number (1–77)NB 01–06
Dropoff Community AreaintChicago community area number (1–77)NB 02, 03
FarefloatBase fare in USDNB 02
TipfloatTip amount in USDNB 02
Shared Trip AuthorizedboolWhether rider opted into pool / shared rideNB 02
Pickup Centroid LatitudefloatLat of pickup area centroidNB 03, 04
Pickup Centroid LongitudefloatLon of pickup area centroidNB 03, 04

Volume at a Glance

Total raw trips
100M+
After dedup / clean
~82M
Rows w/ valid geo
~74M
Final model data
aggregated
Airport Note: O'Hare (Community Area 76) and Midway (56) exhibit atypically high demand spikes driven by flight schedules. These are retained but flagged separately with an is_airport feature during modeling.
🧹
Notebook 01 · data_cleaning

Data Cleaning

The first notebook ingests raw CSVs from the Chicago Data Portal and applies a systematic cleaning protocol to produce a reliable, analysis-ready dataframe. Given the sheer volume of records, chunked reading and early column dropping are used to keep memory usage manageable.

Cleaning Steps

  1. Load & sample: Read the dataset in chunks (or a stratified sample), immediately dropping unused columns to reduce memory footprint.
  2. Parse timestamps: Convert Trip Start Timestamp and Trip End Timestamp to datetime64; extract hour, day_of_week, month, year.
  3. Remove nulls: Drop rows missing Pickup Community Area or timestamp — these cannot be geo-located and are not recoverable.
  4. Filter outliers: Remove trips with duration < 60 seconds or > 3 hours, and distance > 100 miles — likely data-entry errors or test rides.
  5. Deduplicate: Drop exact duplicate Trip ID entries to prevent count inflation in aggregation.
  6. Type casting: Ensure fare, tip, miles are float32 (halving memory vs. float64); community area as int8.
  7. Export: Save cleaned DataFrame to a compressed .parquet file for fast downstream reads.
python · 01data_cleaning.ipynb
import pandas as pd # Load in chunks to manage memory chunks = [] for chunk in pd.read_csv("tnp_trips.csv", chunksize=500_000, usecols=KEEP_COLS, low_memory=False): chunk["trip_start"] = pd.to_datetime(chunk["Trip Start Timestamp"]) chunk = chunk.dropna(subset=["Pickup Community Area", "trip_start"]) chunk = chunk[chunk["Trip Miles"].between(0.1, 100)] chunk = chunk[chunk["Trip Seconds"].between(60, 10800)] chunks.append(chunk) df = pd.concat(chunks, ignore_index=True) df.drop_duplicates(subset="Trip ID", inplace=True) df.to_parquet("cleaned_trips.parquet", compression="snappy")

Output Schema

ColumnTypeNotes
trip_startdatetime64Parsed, validated timestamp
hour / dow / monthint8Extracted temporal components
pickup_areaint81–77 community area code
trip_miles / trip_secondsfloat32Outlier-filtered
fare / tipfloat32USD values
shared_authorizedboolPool/shared eligibility flag
📊
Notebook 02 · eda

Exploratory Data Analysis

The EDA notebook investigates the statistical and temporal structure of Chicago rideshare demand. Key questions: When do trips peak? Where is demand concentrated? How do fare, distance, and tipping behave?

Temporal Patterns

Rideshare demand in Chicago follows two distinct daily rhythms. A smaller morning commute peak occurs around 7–9 AM, while a far more pronounced evening/nightlife peak dominates 10 PM – 2 AM on Fridays and Saturdays. This bimodal structure directly informs the time-based features constructed in Notebook 05.

Visualization · Hourly trip distribution
Trip Count by Hour of Day × Day of Week
10PM–2AM peak 7–9AM

Key EDA Findings

🌙

Nightlife Dominance

Friday and Saturday nights account for a disproportionate share of all weekly trips. The Loop, River North, and Wicker Park are the primary nightlife origin/destination clusters.

✈️

Airport Spikes

O'Hare and Midway show demand spikes pegged to flight arrival windows — a pattern distinct from the rest of the city and important to isolate during modeling.

💵

Fare Distribution

Most trips cluster between $8–$20 with a long right tail (airport runs). Tipping is sparse (~15% of trips) and skewed toward longer, non-shared rides.

🤝

Shared Rides

Roughly 26% of trips were shared-eligible. Shared trips concentrate in the Loop and near-north neighborhoods, suggesting higher density and willingness to share.

📍

Geographic Skew

The top 5 community areas account for over 50% of all pickups: The Loop, Near North Side, Lake View, O'Hare, and Near West Side.

📅

Seasonal Signal

Summer months (June–August) see higher overall demand; December shows a dip aside from holiday weekend spikes — a clear seasonal pattern for the model to learn.

Visualization · Geographic concentration
Top 20 Pickup Community Areas by Trip Volume
🗺️
Notebook 03 · map_visualizations

Map Visualizations

The map visualizations notebook creates interactive and static choropleth maps that overlay rideshare demand onto Chicago's geography using GeoPandas for spatial joins and Folium for interactive HTML maps. Community area boundary shapefiles from the Chicago Data Portal are merged with aggregated trip counts.

Map Types Produced

🔥

Pickup Choropleth

Community areas shaded by total pickup volume. Reveals the stark north-south divide in rideshare adoption across Chicago.

🟠

Dropoff Flow Map

Origin-destination pairs as arc flows, showing that the majority of trips converge into and out of the Loop corridor.

⏱️

Time-Sliced Maps

Separate choropleth layers for AM peak, PM peak, and overnight hours — animatable in Folium's layer control widget.

💙

Fare Density

Average fare per community area, normalized by trip count. Shows that per-trip costs are actually fairly uniform city-wide (~$14 avg).

All maps are built on a CartoDB Dark Matter basemap — designed to maximize contrast of demand heat signals against Chicago's street grid. Interactive Folium maps are exported as standalone HTML files in the maps/ folder of the project.

python · 03map_visualizations.ipynb
import geopandas as gpd import folium from folium.plugins import HeatMap, TimestampedGeoJson # Load Chicago community areas shapefile gdf = gpd.read_file("community_areas.geojson") gdf = gdf.merge(trip_counts, on="area_number") # Base dark map centered on Chicago m = folium.Map(location=[41.85, -87.65], zoom_start=11, tiles="CartoDB dark_matter") # Pickup density choropleth layer folium.Choropleth( geo_data=gdf, data=gdf, columns=["area_number", "trip_count"], key_on="feature.properties.area_number", fill_color="YlOrRd", fill_opacity=0.75, line_opacity=0.3, legend_name="Pickup Trip Count", name="Pickup Density" ).add_to(m) # Heatmap layer from raw centroid coordinates heat_data = [[row.lat, row.lon, row.weight] for _, row in heat_df.iterrows()] HeatMap.add_to(m, heat_data, radius=12, blur=8, name="Heat Map") folium.LayerControl().add_to(m) m.save("maps/pickup_density.html")

Map 1 — Trip Fare Density · 3D Hex Columns (NYE 2024)

Each H3 cell is rendered as a 3D column — height encodes total trip count, color encodes average fare. Yellow columns represent high-fare cells (airports, long-haul pickups), orange represents the dense mid-fare core. The single purple column marks an anomalous high-fare outlier cell. Captured on New Year's Eve 2024 — the highest-demand night of the year.

Chicago TNC Trip Fare Density · 3D H3 Hex Columns · NYE 2024 New Year's Eve 2024
Chicago TNC Trip Fare Density 3D Hex Map NYE 2024
Yellow = High fare (avg $22+) — mostly airports & long-distance pickups  |  Orange = Core demand zone ($10–$18) — Loop, Near North, Wicker Park  |  Purple = Anomalous outlier cell  |  Column height ∝ trip count
Chicago TNC Trip Fare Density · H3 Resolution 7 · 3D Hex Columns · NYE 2024 New Year's Eve 2024
Low
High
LAKE MICHIGAN Rogers Park Edgewater Uptown Jefferson Park Lincoln Square Lake View ★ Portage Park Logan Square Wicker Park ★ Near North Side ★★★ River North THE LOOP ★★★★ Highest West Town Near West Side ★★ South Loop / Douglas Bridgeport Englewood Chatham South Shore Far South Side — Roseland · Pullman · Hegewisch Austin / Garfield Pk O'Hare ✈ ← W Chicago E → N

Map 2 — Demand Percentile Distribution · H3 Hexagons (NYE 2024)

Flat-top H3 hexagons colored by demand percentile rank — yellow = top 10% highest-demand cells, orange = 50th–90th percentile, purple = bottom 50%. This view makes the spatial inequality stark: a small cluster of lakefront hexagons captures the vast majority of all demand while hundreds of outer cells sit in the lowest decile. Captured on New Year's Eve 2024.

H3 Demand Percentile Distribution · Flat Hex Map · NYE 2024 New Year's Eve 2024
Chicago H3 Demand Percentile Distribution NYE 2024
Yellow = Top 10% demand cells  |  Orange = 50th–90th percentile  |  Purple = Bottom 50% (low demand)  |  Flat hexagons at H3 Resolution 7 · ~5.2 km² each

The three time-slice panels below show how this distribution morphs through the day:

7AM – 9AM · WEEKDAY
Morning Commute
LOOP
Downtown cluster. Transit hub spillover. Moderate volume.
10PM – 2AM · FRI/SAT
Nightlife Peak ★
Wicker Pk LOOP River N.
Concentrated in nightlife corridors. Highest absolute volumes.
3AM – 6AM · ANY DAY
Dead Hours
loop
Near-zero citywide. Only O'Hare and Midway remain active.

Map 3 — Origin–Destination Flow Arcs

Trip origin-destination pairs are visualized as arcs, revealing the dominant spoke-and-hub pattern: nearly all significant OD flows terminate in or originate from the Loop corridor and Near North Side. Outer neighborhoods generate mostly short local trips, while airport communities generate long-distance flows.

OD Arc Map — Top 30 Community Area Pairs by Trip Volume
LAKE MICHIGAN O'Hare Wicker Pk Near North LOOP Lake View Near West Midway✈ High volume Low volume

Map 4 — Average Fare per Community Area

Despite the massive volume imbalance between neighborhoods, average trip fares remain surprisingly uniform city-wide — hovering near $14 per trip. Longer trips from the far south and suburbs drive fares up slightly, while dense short-hop trips in the Loop keep averages down despite high frequency.

Avg Fare Distribution
$8 $10 $12 $14 $16 $18 $20+ ← city avg
Notable Outliers
O'Hare Airport $28.40
Midway Airport $22.10
The Loop $11.20
City Average $14.00
Notebook 04 · h3_analysis

H3 Hexagonal Analysis

A critical design decision in this project is switching from administrative community areas to Uber's H3 hierarchical hexagonal grid as the spatial unit of analysis. H3 cells are equal-area, neighbor-consistent, and resolution-scalable — making them far more suitable for demand modeling than irregular census geographies.

Why H3? Community areas vary enormously in size (1.6 km² to 71 km²), making raw trip counts incomparable across areas. H3 Resolution 7 cells (~5.16 km² each) provide uniform spatial granularity, and neighbor lookups are O(1) — ideal for lag features.

Resolution Selection

H3 ResolutionAvg Cell AreaCells in ChicagoSuitability
636.1 km²~18Too coarse — loses neighborhood detail
75.16 km²~120✓ Selected — neighborhood-level granularity
80.74 km²~850Too fine — many cells with zero demand
90.10 km²~6,000Extreme sparsity, not suitable

H3 Processing Steps

  1. Convert pickup centroid coordinates to H3 cell index using h3.geo_to_h3(lat, lon, resolution=7).
  2. Aggregate trip counts per (H3 cell, hour) bin, producing a sparse demand matrix.
  3. Fill zero-demand cells with explicit zeros (critical for time-series continuity).
  4. Compute k-ring neighbors for each cell (k=1, 2) to enable spatial lag features.
  5. Visualize H3 demand density using pydeck or Folium hexagon layers.
python · 04h3_analysis.ipynb
import h3 import pandas as pd RESOLUTION = 7 # Assign each trip to an H3 cell df["h3_cell"] = df.apply( lambda r: h3.geo_to_h3(r.pickup_lat, r.pickup_lon, RESOLUTION), axis=1 ) # Aggregate to hourly demand per cell demand = (df .groupby(["h3_cell", pd.Grouper(key="trip_start", freq="1H")]) .size() .reset_index(name="trip_count") ) # Compute ring-1 neighbor cells for spatial lag features demand["neighbors"] = demand["h3_cell"].map( lambda c: list(h3.k_ring(c, 1) - {c}) )
python · 04h3_analysis.ipynb
import h3 import pandas as pd RESOLUTION = 7 # Assign each trip to an H3 cell df["h3_cell"] = df.apply( lambda r: h3.geo_to_h3(r.pickup_lat, r.pickup_lon, RESOLUTION), axis=1 ) # Aggregate to hourly demand per cell demand = (df .groupby(["h3_cell", pd.Grouper(key="trip_start", freq="1H")]) .size() .reset_index(name="trip_count") ) # Compute ring-1 neighbor cells for spatial lag features demand["neighbors"] = demand["h3_cell"].map( lambda c: list(h3.k_ring(c, 1) - {c}) )

H3 Viz 1 — Trip Count 3D Columns · Citywide (NYE 2024)

Each H3 cell is extruded into a 3D cylinder — height and color both encode total trip count. Purple/blue = lower volume, yellow = extreme outliers (Loop core and airport cells that tower over the rest of the city). This rendering was captured on New Year's Eve 2024. The dramatic height asymmetry between the lakefront cluster and the western/southern neighborhoods is immediately apparent.

Chicago Ride Demand — H3 3D Trip Count Cylinders · NYE 2024 · PyDeck New Year's Eve 2024
Chicago H3 Trip Count 3D Cylinders NYE 2024
Yellow cylinders = Extreme demand (Loop, lakefront near-north) — height exceeds all surrounding cells  |  Purple = High-mid demand  |  Blue = Low-volume residential and outer areas  |  Rendered with PyDeck ColumnLayer · H3 Resolution 7

The full SVG grid below shows the demand structure in a schematic form for reference:

H3 Resolution 7 · ~120 cells cover Chicago · Demand aggregated over all hours
0 trips
500+
LAKE MICHIGAN THE LOOP ✈ ORD Chicago, IL — H3 Resolution 7 · Each hex ≈ 5.2 km²

H3 Viz 2 — Why Hexagons? Equidistance Property

This is the core insight from H3's design, drawn directly from the mathematics of tiling: hexagons are the only regular polygon where all neighbors are equidistant from the center. This makes distance-based smoothing, neighbor aggregation, and k-ring convolution geometrically consistent — a property squares and triangles don't have.

Triangle
3 different distances
12 neighbors
Unequal distances
Square
8 neighbors
2 different distances
Hexagon ✓
6 neighbors
All equidistant ✓

H3 Viz 3 — k-Ring Neighbor Expansion

The h3.k_ring(cell, k) function returns all cells within k steps of a center cell. For CHIride Demand, k=1 (6 direct neighbors) and k=2 (18 cells including k=1) are used to build spatial lag features. As k grows, the ring approximates a circle — a property unique to hexagonal grids and exploited by computer vision convolution techniques applied to geospatial ML.

k = 0 · origin
1 cell
k = 1 · ring-1
7 cells total
k = 2 · ring-2
19 cells total
k = 3 · ring-3
≈ circle
37 cells total

H3 Viz 4 — Hourly Demand Time Series by Cell

Each H3 cell has its own unique demand profile. The Loop cell shows two daily peaks on weekdays and a massive late-night spike on weekends. The O'Hare airport cell has a completely different pattern — driven by flight arrival schedules with broad morning and evening clusters. Outer residential cells show nearly flat, low-amplitude series.

24-Hour Demand Profile · Average across all days · By selected H3 cells
The Loop
O'Hare Airport
Wicker Park
Outer South
0 100 200 300 0 6 12 18 22 Hour of Day Loop peak 11PM

H3 Viz 5 — Cell × Hour Demand Matrix

The full modeling dataset is a (H3 cell) × (hour of week) matrix. This heatmap shows a subset of top cells — each row is one H3 cell, each column one hour of the week (Monday 0AM to Sunday 11PM). The weekend nightlife intensification shows clearly as bright vertical bands in the rightmost columns.

H3 Cell × Hour-of-Week Demand Matrix · Top 8 cells by mean demand · 168 columns = Mon 0AM → Sun 11PM
Mon
Tue
Wed
Thu
Fri
Sat
Sun
Loop 872a1072f… Near North 872a1073f… Wicker Pk 872a1074f… O'Hare ✈ 872a107ff… Lake View 872a1075f… South Shore 872a1088f… low high demand
⚙️
Notebook 05 · feature_engineering

Feature Engineering

Feature engineering transforms the raw demand time series into a rich tabular dataset that ML models can consume. The engineered features span four categories: temporal, spatial, lag/window, and contextual.

Temporal Features

hour day_of_week month is_weekend is_rush_hour is_nightlife_hour hour_sin / hour_cos dow_sin / dow_cos month_sin / month_cos is_holiday days_since_holiday
Cyclical encoding: Hours, days of week, and months are encoded as sine/cosine pairs so the model understands their circular nature — e.g., 11 PM and 1 AM are close in time, not far apart.

Spatial Features

h3_cell community_area_id ring1_mean_demand ring2_mean_demand is_downtown is_airport area_population_density dist_to_loop_km

Lag & Rolling Window Features

lag_1h lag_2h lag_3h lag_24h lag_168h rolling_mean_3h rolling_mean_6h rolling_mean_24h rolling_std_6h ewm_alpha_0.3
python · 05feature_eng.ipynb — lag features
import numpy as np # Sort by cell + time, then create lag features within each cell df = df.sort_values(["h3_cell", "trip_start"]) for lag in [1, 2, 3, 24, 168]: df[f"lag_{lag}h"] = df.groupby("h3_cell")["trip_count"].shift(lag) # Cyclical encoding for hour df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24) df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24) # Spatial lag: mean demand of ring-1 neighbors cell_demand = df.set_index(["h3_cell", "trip_start"])["trip_count"] df["ring1_mean_demand"] = df.apply( lambda r: cell_demand.loc[ [(n, r.trip_start) for n in r.neighbors if (n, r.trip_start) in cell_demand.index] ].mean(), axis=1 )

Final Feature Matrix

After engineering, each row represents a single (H3 cell × hour) observation with approximately 38 features. The target variable is trip_count — the number of rides starting in that cell during that hour.

Feature GroupCountKey Members
Temporal (raw)6hour, dow, month, year, is_weekend, is_holiday
Temporal (cyclical)6hour_sin, hour_cos, dow_sin, dow_cos, month_sin, month_cos
Spatial5h3_cell (encoded), is_downtown, is_airport, dist_to_loop, pop_density
Lag features5lag_1h, lag_2h, lag_3h, lag_24h, lag_168h
Rolling windows5roll_mean_3h, roll_mean_6h, roll_mean_24h, roll_std_6h, ewm
Spatial lag4ring1_mean, ring2_mean, ring1_std, ring1_max
Context3is_rush_hour, is_nightlife_hour, days_since_holiday
🤖
Notebook 06 · model_training

Model Training

Multiple models are trained for the four prediction tasks. The task type determines model selection: TFT for the citywide time-series problem (T1), XGBoost regressors for hex-level demand (T3), and classifiers for the binary repositioning and surge tasks (T2, T4). A time-based train/test split ensures no data leakage from the future across all tasks.

T1 uses Google's Temporal Fusion Transformer (TFT) — a deep learning architecture combining LSTM encoders, multi-head attention, and variable selection networks. It achieves 95% accuracy on citywide demand forecasting by learning which temporal patterns are most predictive at each horizon.

Train / Validation / Test Split

python · 06model_training.ipynb — temporal split
# Time-based split — NO random shuffling for time series data cutoff_val = "2022-09-01" cutoff_test = "2022-11-01" train = df[df.trip_start < cutoff_val] val = df[(df.trip_start >= cutoff_val) & (df.trip_start < cutoff_test)] test = df[df.trip_start >= cutoff_test] X_train, y_train = train[FEATURES], train["trip_count"] X_val, y_val = val[FEATURES], val["trip_count"] X_test, y_test = test[FEATURES], test["trip_count"]

Models Evaluated

1
XGBoost Regressor Best
Gradient-boosted trees · n_estimators=500 · max_depth=7 · lr=0.05 · subsample=0.8
0.912
R² test
3.24
RMSE
2
Random Forest Regressor
Bagged trees · n_estimators=300 · max_depth=15 · min_samples_leaf=4
0.887
R² test
4.01
RMSE
3
LightGBM Regressor
Leaf-wise gradient boosting · num_leaves=63 · learning_rate=0.05
0.903
R² test
3.68
RMSE
4
Ridge Regression Baseline
Linear model · alpha=1.0 · StandardScaler preprocessing
0.741
R² test
7.92
RMSE

Feature Importance (XGBoost)

lag_1h
0.213
lag_24h
0.181
lag_168h
0.163
ring1_mean_demand
0.137
hour_sin
0.108
is_nightlife_hour
0.086
roll_mean_6h
0.063
is_weekend
0.047
Key insight: Recent lag features (1h, 24h, 168h) dominate — demand is highly auto-correlated. Spatial lag (ring1 neighbors) ranks 4th, confirming that nearby cells are predictive. Temporal features matter most for unusual demand spikes (nights out, events).
Notebook 07 · model_eval_api

Model Evaluation & API

The final notebook evaluates the best model (XGBoost) across multiple dimensions — overall accuracy, per-cell residuals, error by time-of-day, and spatial error patterns — then wraps the model in a lightweight prediction API for real-time inference.

Evaluation Metrics

MetricFormulaXGBoost (test)Interpretation
RMSE√mean((ŷ−y)²)3.24 tripsAvg prediction error of ~3 trips/hour/cell
MAEmean|ŷ−y|2.11 tripsMedian error around 2 trips
1 − SS_res/SS_tot0.912Model explains 91.2% of variance
MAPEmean|ŷ−y|/y × 10014.3%Higher in low-demand cells (few trips)
Spearman ρrank correlation0.958Excellent ranking of high-demand cells

Error Analysis

By time of day: The model performs best during steady off-peak hours (2 AM – 6 AM) and daytime. Error is highest in the 10 PM – 1 AM window when demand is volatile and event-driven.

By community area: Airport cells (O'Hare, Midway) have the highest absolute errors due to flight-schedule dependency not captured in temporal features.

Residuals by Hour
higher err

Prediction API

The notebook serializes the trained XGBoost model with joblib and exposes a prediction function that takes an H3 cell ID and timestamp, constructs the feature vector from historical demand, and returns a predicted trip count.

python · 07model_eval_api.ipynb — inference function
import joblib from datetime import datetime import h3 model = joblib.load("xgb_demand_model.joblib") def predict_demand(lat: float, lon: float, timestamp: datetime) -> dict: """ Predict rideshare demand for a given location and time. Returns trip count estimate + confidence interval. """ cell = h3.geo_to_h3(lat, lon, resolution=7) features = build_feature_vector(cell, timestamp) prediction = model.predict([features])[0] return { "h3_cell": cell, "timestamp": timestamp.isoformat(), "predicted_trips": round(float(prediction), 2), "confidence": "±3.2 trips (1σ)" } # Example call result = predict_demand( lat=41.8819, lon=-87.6278, # The Loop timestamp=datetime(2024, 3, 15, 23, 0) # Friday 11 PM ) # → {"h3_cell": "872a1072fffffff", "predicted_trips": 47.3, ...}
Visualization · Predicted vs Actual
Scatter: Predicted Trip Count vs Actual (Test Set)
0 actual → pred ↑
🏆
Summary

Results & Key Findings

The project trains separate best-in-class models for each of the four prediction tasks. Below is a consolidated view of model choices, accuracy metrics, and key findings across T1 through T4.

Task-by-Task Model Results

T1 · CITYWIDE DEMAND FORECAST
Google Temporal Fusion Transformer (TFT)
Multi-horizon time-series forecasting with attention mechanism · 95% accuracy
95%
Accuracy
0.95
R² Test

TFT was selected for T1 because citywide aggregate demand is a pure time-series problem with strong temporal dependencies, multi-seasonality (daily + weekly + annual), and known static covariates (holidays, events). TFT's attention mechanism explicitly learns which historical timesteps are most relevant for each forecast horizon — outperforming XGBoost by 8 percentage points on this task.

Attention mechanism Multi-horizon forecasting Encoder–Decoder LSTM Variable selection networks
T1 Model Comparison — Accuracy (%)
TFT (best)
95%
XGBoost
87%
LightGBM
85%
T2 · DRIVER REPOSITIONING SIGNAL
XGBoost Classifier
Binary classification (Move=1 / Stay=0) per H3 cell per hour
88%
F1 Score
91%
AUC-ROC

The repositioning label is derived from the T3 predictions: a cell receives a Move=1 signal when its predicted demand in the next hour is below the 25th percentile of its historical distribution AND a neighboring ring-1 cell exceeds the 75th percentile. XGBoost's feature importance naturally captures the spatial gradient between adjacent cells.

Precision: 0.86 Recall: 0.90 Derived from T3 predictions
T3 · H3 HEXAGON DEMAND
XGBoost Regressor
Per-cell trip count regression · ~120 H3 cells · R² = 0.912
0.912
R² Test
3.24
RMSE

The core spatial task. XGBoost was selected over LightGBM and Random Forest after exhaustive comparison — it handles the mixed feature types (temporal cyclical encodings, spatial lag floats, binary flags) most robustly and converges faster with early stopping on the hourly-aggregated training set.

MAE: 2.11 trips MAPE: 14.3% Spearman ρ: 0.958 n_estimators: 500
T4 · SURGE DETECTION
Random Forest Classifier
Binary surge event prediction per H3 cell · AUC-ROC = 0.94
0.94
AUC-ROC
85%
F1 Score

Random Forest was preferred for surge detection because the surge label is inherently noisy and threshold-dependent — ensemble averaging over many trees provides better-calibrated probability estimates than XGBoost's boosted structure. The model uses T3 demand predictions, weather flags, CTA service status, and historical surge frequency as features.

Precision: 0.83 Recall: 0.87 n_estimators: 300 Uses weather + CTA

Top Spatial Findings

🔺

50% of trips in 5 areas

Extreme spatial concentration — just 5 of 77 community areas account for more than half of all rideshare trips city-wide, driven by the Loop, Near North Side, and airports.

H3 Res 7 is optimal

Resolution 6 loses neighborhood detail (only 18 cells city-wide). Resolution 8 produces 850+ cells with extreme sparsity. Res 7's ~120 cells is the Goldilocks zone.

🔗

Strong spatial autocorrelation

The k=1 ring neighbor mean demand (ring1_mean_demand) ranks as the 4th most important feature — nearby cells are highly predictive of each other, confirming spatial spillover effects.

✈️

Airports are structural outliers

O'Hare and Midway exhibit demand patterns driven by flight schedules, not city rhythms. Their MAPE is 28% vs the city-wide 14.3% — a clear modeling challenge for future work.

Top Temporal Findings

Total Citywide Trip Volume — by Day of Week + Hour of Day
Mon Tue Wed Thu Fri ↑ Sat ★ Sun 12AM 12PM 11PM
Each column = one day (Mon–Sun). Y-axis = hour 0–23. Brightness encodes trip volume. Friday/Saturday nights dominate.

Map Gallery — Heatmap Comparisons Across Event Types

These three annotated heatmaps from the project's map visualization notebook show how Chicago's rideshare demand geography changes dramatically depending on the day type — and how the model must generalize across all of them.

Weekday Demand Pattern · Mon–Fri · December 2024 · 22 Weekdays · 200k points sampled Baseline Pattern
Chicago Weekday Demand Heatmap
Very High · High · Moderate · Low · Very Low  |  2,808,065 total trips · Avg 127,639/day · Dominant pattern: bimodal commute (AM + PM) · Hottest zone: Loop business district
New Year's Eve Midnight · Dec 31 10PM – Jan 1 2AM · 4-Hour Window Peak Demand Event
New Year's Eve Midnight Demand Heatmap
Extreme Demand · Very High · High · Moderate · Low  |  27,714 trips in 4 hours · 1.2× surge vs normal · Top zone: River North / Near North · Entire city coast lit citywide purple
Christmas Day · Full 24 Hours of December 25, 2024 Holiday Drop −53%
Christmas Day Demand Heatmap
Highest (still below normal) · Moderate · Low · Very Low · Minimal  |  62,768 trips (47% of avg) · −53% below normal · Most active: O'Hare Airport · City nearly dark except airport + Loop
Model Challenge — Holiday Generalization: The Christmas Day map shows the model must handle a −53% demand drop with almost no signal in outer neighborhoods. The TFT's variable selection network and holiday embedding explicitly learned this pattern. The NYE midnight window shows a +23% lift above normal 4-hour blocks concentrated in a tight geographic corridor — tested the T4 surge classifier hardest.
Key takeaway: The lag features (lag_1h, lag_24h, lag_168h) are the dominant predictors by a wide margin — accounting for ~56% of model importance. This confirms that rideshare demand is highly auto-correlated: what happened at the same place last hour, yesterday, and last week is the best predictor of what will happen now. Spatial neighbor features add a further ~14%, reducing prediction error especially in transitional zones between hot and cold cells.
🛠️
Reference

Tech Stack

🐼

pandas / numpy

Core data wrangling. Chunked CSV reads, datetime parsing, aggregation, and memory-optimized dtypes.

H3-py (Uber)

Hexagonal hierarchical spatial indexing. Converts lat/lon to H3 cells, computes k-ring neighbors for spatial features.

🗺️

GeoPandas + Folium

Spatial joins with Chicago shapefiles and interactive choropleth HTML maps rendered with Leaflet under the hood.

🚀

XGBoost

Primary model. Gradient-boosted trees with GPU acceleration support. Best overall performance on the spatiotemporal regression task.

🌲

scikit-learn

Baseline models (Ridge, Random Forest), preprocessing pipelines, cross-validation utilities, and metrics computation.

💡

LightGBM

Competitive alternative to XGBoost. Faster training on large datasets with leaf-wise tree growth and category feature support.

📊

matplotlib / seaborn

Static visualization for EDA plots — distribution histograms, time-series demand charts, and correlation heatmaps.

💾

joblib / parquet

Model serialization and compressed columnar storage for intermediate datasets. Parquet is ~10× faster to read than CSV at this scale.

Environment

requirements.txt
pandas>=2.0 numpy>=1.24 h3>=3.7 geopandas>=0.13 folium>=0.14 xgboost>=1.7 lightgbm>=3.3 scikit-learn>=1.2 matplotlib>=3.7 seaborn>=0.12 joblib>=1.2 pyarrow>=11.0 # parquet engine jupyter>=1.0
🚀
Reference

Quickstart

1. Clone & Install

bash
git clone https://github.com/Kslaxman/chiride-demand.git cd chiride-demand python -m venv venv && source venv/bin/activate pip install -r requirements.txt

2. Download Data

Download the TNC Trips dataset from the Chicago Data Portal. The direct export URL for the full dataset is:

bash
# Chicago TNC Trips (2018 - present, ~8GB CSV) curl -o data/tnp_trips.csv \ "https://data.cityofchicago.org/api/views/m6dm-c72p/rows.csv?accessType=DOWNLOAD"

3. Run the Pipeline

Execute notebooks in order. Each notebook reads from and writes to the data/ folder. All intermediate files are stored as .parquet for fast I/O.

bash
jupyter nbconvert --to notebook --execute 01data_cleaning.ipynb jupyter nbconvert --to notebook --execute 02eda.ipynb jupyter nbconvert --to notebook --execute 03map_visualizations.ipynb jupyter nbconvert --to notebook --execute 04h3_analysis.ipynb jupyter nbconvert --to notebook --execute 05feature_eng.ipynb jupyter nbconvert --to notebook --execute 06model_training.ipynb jupyter nbconvert --to notebook --execute 07model_eval_api.ipynb

4. Run a Prediction

python
from src.predict import predict_demand from datetime import datetime result = predict_demand( lat=41.8819, lon=-87.6278, timestamp=datetime(2024, 6, 21, 22, 0) ) print(result) # → {"h3_cell": "872a1072fffffff", "predicted_trips": 47.3}
Note on memory: Processing the full 100M+ row dataset requires ~16 GB RAM. For development, use the SAMPLE_FRAC=0.05 environment variable to work with a 5% random sample: SAMPLE_FRAC=0.05 jupyter notebook 01data_cleaning.ipynb

Project Structure

project layout
chiride-demand/ ├── 01data_cleaning.ipynb ├── 02eda.ipynb ├── 03map_visualizations.ipynb ├── 04h3_analysis.ipynb ├── 05feature_eng.ipynb ├── 06model_training.ipynb ├── 07model_eval_api.ipynb ├── data/ │ ├── raw/ # downloaded CSVs │ ├── interim/ # cleaned .parquet files │ └── processed/ # final feature matrices ├── models/ │ └── xgb_demand_model.joblib ├── maps/ │ └── *.html # Folium interactive maps ├── src/ │ └── predict.py # inference API └── requirements.txt
Chicago Ride Demand Forecast · Sailaxman Kotha · github.com/Kslaxman/chiride-demand