Toki Hirose

Posted on Mar 15 • Edited on Apr 25

1. Extracting Pedestrian Trajectories from Street Video as JSON

#data #datascience #machinelearning #science

Note: I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Motivation

Why extract pedestrian trajectories from smartphone video footage? This approach serves multiple purposes in my research on urban social movements:

GIS-ready data: JSON output integrates seamlessly with geographic information systems and mapping tools
Cost-effective data collection: Eliminates the need for expensive GPS trackers or surveillance infrastructure
Understanding pedestrian behavior: Reveals how people move and interact in urban environments
Measuring protest reactions: Quantifies how standing demonstrations affect surrounding pedestrian flow

This project emphasizes rapid deployment for protest monitoring. The entire setup requires only a smartphone and tripod, enabling quick response to emerging events.

Introduction

In urban planning and transportation studies, understanding pedestrian movement patterns is crucial for designing safer and more efficient public spaces. Previous methods like manual observation or GPS tracking have limitations in coverage and cost. Computer vision offers a scalable alternative through video analysis.

This article demonstrates how to extract pedestrian trajectories from street video footage using open-source tools. I'll use YOLOX-Tiny for real-time person detection and implement a custom centroid-based tracker to generate structured JSON trajectory data. The sample videos used in this project were captured on a smartphone, which keeps the setup lightweight and easy to deploy.

Methodology

Person Detection with YOLOX-Tiny

YOLOX-Tiny is a lightweight object detection model optimized for real-time inference. I use the ONNX export for cross-platform compatibility with OpenCV and ONNX Runtime.

The detection pipeline:

Preprocessing: Letterbox resizing to maintain aspect ratio
Inference: YOLOX model processes the frame
Postprocessing: Convert detections to bounding boxes
Filtering: Confidence thresholding and non-maximum suppression

Centroid-Based Tracking

For tracking detected persons across frames, I implement a simple but effective centroid tracker:

Each detection's bounding box center becomes a centroid
Tracks are maintained by matching centroids between frames
New tracks are registered for unmatched detections
Lost tracks are deregistered after a maximum disappearance threshold
For visualize, footage tracks are also detected

Trajectory Analysis

For each complete trajectory, I extract:

Duration: Total time the person was tracked
Distance: Total pixels traveled
Direction: Movement angle in degrees
Start/End positions: Entry and exit points
Screen exit detection: Whether the person left the frame

Implementation

Setup and Dependencies

# Required packages
pip install opencv-python numpy onnxruntime

# Download YOLOX-Tiny ONNX model
# From: https://github.com/Megvii-BaseDetection/YOLOX

Core Detection Function

def detect_persons(frame, session):
    # Preprocess frame
    blob, ratio = preprocess_yolox(frame, 416, 416)

    # Run inference
    output = session.run(None, {session.get_inputs()[0].name: blob})[0]

    # Postprocess detections
    # ... (filter by confidence, apply NMS)

    return boxes, confidences

Tracking Implementation

Full CentroidTracker Implementation ```python from collections import defaultdict import numpy as np class CentroidTracker: """ centroid-based tracking algorithm for associating detected bounding boxes across frames. In addition to tracking centroids, it also maintains trajectories based on the foot point of the bounding box (the point where the person touches the ground), which is more stable for movement analysis. """ def __init__(self, max_disappeared=50): self.next_object_id = 0 self.objects = {} # ID: (centroid_x, centroid_y) self.disappeared = {} # ID: disappeared_frame_count self.trajectories = defaultdict(list) # ID: [(x, y, frame), ...] self.first_seen = {} # ID: first frame detected self.last_seen = {} # ID: last frame detected self.max_disappeared = max_disappeared def register(self, centroid, foot_point, frame_num): """register a new object with a unique ID""" self.objects[self.next_object_id] = centroid self.disappeared[self.next_object_id] = 0 self.trajectories[self.next_object_id].append( (foot_point[0], foot_point[1], frame_num) ) self.first_seen[self.next_object_id] = frame_num self.last_seen[self.next_object_id] = frame_num self.next_object_id += 1 def deregister(self, object_id): """deregister an object and remove it from tracking""" del self.objects[object_id] del self.disappeared[object_id] def update(self, rects, frame_num): """ update the tracker with new bounding box detections Args: rects: the list of detected bounding boxes [(x1, y1, x2, y2), ...] frame_num: the current frame number Returns: objects: a dictionary mapping object IDs to their current centroids {(cx, cy)} """ # when no detections are present, mark existing objects as disappeared if len(rects) == 0: for object_id in list(self.disappeared.keys()): self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) return self.objects # conpute centroids and foot points for the current detections input_centroids = np.zeros((len(rects), 2), dtype="int") input_feet = np.zeros((len(rects), 2), dtype="int") for i, (x1, y1, x2, y2) in enumerate(rects): cx = int((x1 + x2) / 2.0) input_centroids[i] = (cx, int((y1 + y2) / 2.0)) input_feet[i] = (cx, y2) # foot point is the bottom center of the bounding box # if no existing objects, register all input centroids if len(self.objects) == 0: for i in range(len(input_centroids)): self.register(input_centroids[i], input_feet[i], frame_num) # existing objects are present, match input centroids to existing object centroids else: object_ids = list(self.objects.keys()) object_centroids = list(self.objects.values()) # conpute distance matrix between existing object centroids and input centroids D = np.zeros((len(object_centroids), len(input_centroids))) for i, oc in enumerate(object_centroids): for j, ic in enumerate(input_centroids): D[i, j] = np.linalg.norm(oc - ic) # find the smallest distance pairs (existing object to input centroid) rows = D.min(axis=1).argsort() cols = D.argmin(axis=1)[rows] used_rows = set() used_cols = set() for (row, col) in zip(rows, cols): if row in used_rows or col in used_cols: continue # when distance is lower than a threshold, consider it a match if D[row, col] > 100: # if the distance is too large, ignore the match (this threshold can be tuned) continue object_id = object_ids[row] self.objects[object_id] = input_centroids[col] # using the centroid for tracking self.disappeared[object_id] = 0 self.trajectories[object_id].append( (input_feet[col][0], input_feet[col][1], frame_num) # using the foot point for trajectory analysis ) self.last_seen[object_id] = frame_num used_rows.add(row) used_cols.add(col) # not matched existing objects unused_rows = set(range(D.shape[0])) - used_rows for row in unused_rows: object_id = object_ids[row] self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) # not matched input centroids unused_cols = set(range(D.shape[1])) - used_cols for col in unused_cols: self.register(input_centroids[col], input_feet[col], frame_num) return self.objects ```

JSON Output Structure

The trajectory data is saved as structured JSON:

{
  "video_name": "street_footage.mp4",
  "fps": 30,
  "resolution": "1920x1080",
  "tracks": [
    {
      "id": 1,
      "duration": 12.5,
      "total_distance": 320.4,
      "trajectory": [
        {"x": 100, "y": 200, "frame": 10, "time_sec": 0.333},
        {"x": 105, "y": 202, "frame": 11, "time_sec": 0.367}
      ],
      "geometry": {
        "type": "LineString",
        "coordinates": [[100, 200], [105, 202]]
      }
    }
  ]
}

Results

Processing a 5min video at 30 FPS typically yields:

The JSON output provides rich data for further analysis:

Spatial patterns of movement
Temporal distribution of pedestrian activity
Flow direction analysis

Discussion

Advantages of This Approach

Cost-effective: Uses commodity hardware and free software
Scalable: Can process hours of footage automatically
Structured output: JSON format integrates with GIS and analysis tools
Real-time capable: YOLOX-Tiny enables live processing

The sample videos in this project were captured on a smartphone, but the same pipeline can be applied to fixed surveillance cameras for longer-term monitoring.

Interpretation

Processing demonstration videos from Shinbashi station revealed insights about centroid tracking performance and pedestrian behavior during protests:

Commuter indifference: In Japan, individual protests are uncommon, so commuters typically ignore demonstrators. Additionally, most people are busy office workers who tend to focus on their commute rather than noticing activities around them.
Camera height issues: Using a smartphone camera with a low tripod created unreliable detections. People near the camera appeared with unnatural up-and-down trajectories due to the low-angle perspective.
ID swapping during interactions: When pedestrians crossed paths or interacted closely, their tracking IDs would swap, creating fragmented trajectories for the same individuals.

Overall, the system successfully captured general movement patterns. Future improvements could include filtering trajectories with sudden angle changes after intersections or removing outliers based on historical movement differences.

Limitations and Future Improvements

Occlusion handling: Simple centroid tracking fails in crowds
Camera motion: Assumes static camera position
Identity persistence: No re-identification across camera cuts
Stopping behavior: People who stop moving in videos sometimes lose their tracking ID due to centroid distance thresholds, leading to fragmented trajectories (e.g., ID 5 → 110 → 430 as the same person gets re-detected with new IDs)

For crowded scenes, more sophisticated trackers like DeepSORT or ByteTrack would improve performance. Camera motion compensation using optical flow could extend applicability to moving platforms.

In this project, I prioritized spending time on analysis and visualization rather than implementing the most advanced tracking pipeline; that tradeoff made it easier to iterate quickly with real data.

Applications

This trajectory data serves as input for:

Urban planning: Identifying pedestrian flow bottlenecks
Safety analysis: Detecting high-risk crossing patterns
Traffic engineering: Optimizing signal timing
Accessibility studies: Understanding mobility patterns

The structured JSON format makes it easy to integrate with mapping libraries like MapLibre GL JS for visualization, as I'll explore in the next article.

Conclusion

By combining YOLOX-Tiny detection with centroid tracking, I can extract meaningful pedestrian trajectory data from video footage. The resulting JSON structure provides a foundation for spatial analysis of urban movement patterns. While the current implementation works well for moderate-density scenarios, future enhancements could address occlusion and camera motion challenges.

In the next article, I'll visualize these trajectories on an interactive map using MapLibre GL JS.