This paper focuses on the computer vision stack required for such systems, moving beyond controlled settings to address the challenges of real-world outdoor deployment.
The core perception challenge is multi-faceted: robots must locate themselves without reliable GPS (in urban canyons or near buildings), identify and segment traversable paths, detect and classify a vast array of static and dynamic obstacles (from parked cars and pedestrians to fallen branches), and do so under varying weather and lighting conditions, all with stringent real-time and power constraints.
A robust perception system for this domain relies on sensor fusion, combining complementary data streams. The typical suite includes visual cameras (color and/or monochrome), which provide rich texture and semantic information but are sensitive to lighting; LiDAR sensors, which offer precise, direct 3D distance measurements (point clouds) and work in the dark but can struggle with reflective surfaces and fog; and often radar, which is excellent for detecting moving objects and is robust to weather but provides lower spatial resolution. The fusion of these modalities, often at a deep feature level using neural networks, creates a more resilient and complete world model than any single sensor could provide.
A critical task is visual Simultaneous Localization and Mapping (vSLAM). This process allows the robot to build a map of its unknown environment while simultaneously tracking its position within it. Modern vSLAM systems, such as ORB-SLAM3, use features extracted from camera images to create a sparse 3D map and estimate camera pose.
For delivery robots, this map must be semantically enriched—not just a point cloud, but one where points are labeled as "road," "sidewalk," "building," or "vegetation." This is achieved through semantic segmentation models (e.g., based on architectures like DeepLabV3+ or HRNet) that run on the incoming visual data, labeling each pixel. The robot can then plan a path that stays on semantically labeled "drivable area" while avoiding "obstacle" regions.
Dynamic obstacle detection and tracking present another layer of complexity. This requires real-time object detection (using models like YOLO or efficient CNN derivatives) to identify entities like pedestrians, cyclists, and vehicles. These detections are then tracked across frames using algorithms like SORT or DeepSORT, which associate detections over time to estimate trajectories and velocities. This predictive capability is essential for safe navigation, allowing the robot to anticipate if a pedestrian will step onto its path. Finally, all this computation must happen on embedded hardware.
This necessitates model optimization through techniques like pruning, quantization, and the use of hardware-specific neural processing units (NPUs) to achieve the required frames-per-second processing within a tight power budget. We conclude that the computer vision stack for autonomous delivery is an integrated system of localization, semantic understanding, and dynamic scene analysis. Its success depends less on breakthroughs in any single algorithm and more on the robust, efficient, and safe integration of multiple perception modalities and reasoning layers to create a reliable and trustworthy autonomous navigator.