YOLO transforms passive CCTV footage into an intelligent detection system — identifying fights, intrusions, falls, and anomalies in real-time, at the edge, with millisecond latency.
In 2015, Joseph Redmon and Ali Farhadi published a paper that changed computer vision forever. Before YOLO, object detection was a two-stage process: first propose regions (R-CNN, Fast R-CNN, Faster R-CNN), then classify each one. Slow. Redundant. Impractical for real-time.
YOLO reframed detection as a single regression problem. One neural network pass. The entire image. All bounding boxes and class probabilities predicted simultaneously.
The input image is divided into an S×S grid. Each cell predicts B bounding boxes with confidence scores and C class probabilities. Non-maximum suppression filters overlapping detections. The result: real-time detection at 45+ FPS — orders of magnitude faster than anything before it.
This single insight — you only look once — spawned an entire family of models that now power everything from autonomous driving to surveillance to medical imaging.
From a grad student's insight to the most deployed object detection model on earth. Each version pushed the boundary of what's possible in real-time perception.
Joseph Redmon, University of Washington. Single-pass detection as regression. 45 FPS on Titan X. 24 conv layers + 2 FC. 63.4 mAP on VOC. Changed everything.
Batch normalization, anchor boxes, Darknet-19 backbone, multi-scale training. Detects 9000+ categories via hierarchical WordTree joint training on ImageNet + COCO.
Feature Pyramid Network (FPN) for multi-scale predictions at 3 scales. Darknet-53 backbone (53 conv layers with residual connections). 57.9 mAP on COCO. The workhorse for years.
Alexey Bochkovskiy takes over. CSPDarknet53 backbone, SPP, PANet. Introduced "bag of freebies" (data aug, CIoU loss) and "bag of specials" (Mish activation, SAM). 43.5 AP on COCO at 65 FPS.
Glenn Jocher's PyTorch implementation. No paper, just code that works. AutoAnchor, mosaic augmentation, model scaling (n/s/m/l/x). Became the most-used YOLO in production.
EfficientRep backbone, Rep-PAN neck. Hardware-friendly reparameterization. Optimized for industrial deployment. 52.5 mAP, 1234 FPS (T4, FP16).
Extended efficient layer aggregation. Trainable bag-of-freebies. Model re-parameterization. 56.8 mAP at 30 FPS (V100). State of the art at the time.
Ultralytics' next gen. Anchor-free detection head. C2f modules. Decoupled head for classification and regression. Supports detection, segmentation, pose, OBB, classification. 50.2 mAP (medium).
Programmable Gradient Information (PGI) prevents information loss in deep networks. Generalized ELAN (GELAN) architecture. 55.6 mAP with efficient parameter use.
Tsinghua University. Consistent dual assignments for NMS-free training. End-to-end latency optimization. Quantization-aware design for edge deployment.
Ultralytics. Optimized CSP blocks, reduced FLOPs, higher mAP. 54.7 mAP (x-large), 1.5ms on T4. Fewer parameters than v8 with better accuracy. The current state of the art.
Open-vocabulary detection with text prompts. Detect any object described in natural language without retraining. Vision-language model integration. Zero-shot detection.
The average security guard monitors 16 screens. Studies show attention degrades after just 20 minutes. After 12 hours of footage review, critical events are missed at rates exceeding 95%.
A mid-size city generates petabytes of CCTV footage daily. Most of it is never watched. When a crime occurs, investigators scrub through hours of recordings manually — often days after the event.
Motion detection, the legacy approach, triggers on every shadow, swaying tree, and passing car. Alert fatigue sets in fast. Operators learn to ignore alarms — defeating the purpose entirely.
What's needed isn't more cameras. It's intelligent perception — systems that understand what is happening, not just that something moved. YOLO provides exactly this: semantic understanding of scenes at real-time speed.
A trained operator's effective monitoring drops below 50% after 22 minutes of continuous viewing. In a 12-hour shift, most events are effectively invisible.
London alone has 691,000+ CCTV cameras. At 30 FPS, that's 1.79 trillion frames per day. No human workforce can process this. AI must.
A fight escalates in 3-5 seconds. A perimeter breach gives you 10 seconds to respond. Detection must be real-time — not "we found it in yesterday's footage."
From raw video stream to actionable alert in under 50 milliseconds. Here's how every frame flows through the system.
YOLO-based systems don't just detect objects — they detect situations. By combining object detection with tracking, pose estimation, and temporal analysis, these systems identify critical events.
Detect aggressive body poses and rapid person-to-person proximity changes. Pose estimation + velocity vectors identify altercations before they escalate.
Define virtual zones and tripwires. YOLO detects persons entering restricted areas with sub-second latency. Integrates with access control systems.
Critical for elderly care facilities. Pose estimation tracks body keypoints — when a person transitions from upright to horizontal with sudden velocity, an alert fires.
Track objects across frames. When a bag, package, or item remains stationary without an owner for a configured duration, flag it as potentially suspicious.
Count persons per zone in real-time. When density exceeds thresholds or flow patterns become chaotic, trigger crowd management alerts before crush events.
Detect collisions, wrong-way driving, sudden stops, and erratic lane changes. Velocity estimation from object tracking enables traffic anomaly detection.
Custom-trained YOLO models detect fire and smoke with high accuracy. Early detection in warehouses, forests, and industrial sites saves lives and property.
Track person dwell time within defined zones. When an individual remains stationary or circulates in a small area beyond a time threshold, raise a loitering alert.
Detect hard hats, safety vests, goggles, and masks on construction sites. Ensure workplace safety compliance in real-time without manual inspection.
Ultralytics makes YOLO absurdly easy. A unified Python API for detection, segmentation, tracking, pose estimation, and classification. Three lines to inference. Ten to train.
from ultralytics import YOLO # Load a pretrained model model = YOLO("yolo11n.pt") # Run inference on an image results = model("cctv_frame.jpg") # Process results for r in results: boxes = r.boxes for box in boxes: cls = int(box.cls[0]) conf = float(box.conf[0]) label = model.names[cls] print(f"{label}: {conf:.2f}")
from ultralytics import YOLO model = YOLO("yolo11s.pt") # Stream from RTSP camera results = model( "rtsp://admin:pass@192.168.1.64/stream1", stream=True, show=True, conf=0.5 ) for r in results: # Each r contains detections for one frame if len(r.boxes) > 0: process_detections(r)
from ultralytics import YOLO model = YOLO("yolo11m.pt") # Track objects across frames results = model.track( source="cctv_footage.mp4", tracker="bytetrack.yaml", persist=True, show=True ) # Each detection now has a unique ID for r in results: if r.boxes.id is not None: ids = r.boxes.id.int().tolist() print(f"Tracking {len(ids)} objects")
import numpy as np from ultralytics import YOLO from shapely.geometry import Point, Polygon model = YOLO("yolo11s.pt") # Define restricted zone (polygon coords) restricted = Polygon([ (100, 200), (400, 200), (400, 500), (100, 500) ]) for r in model.track("cam1.mp4", stream=True): for box in r.boxes: cx, cy = box.xywh[0][:2] if restricted.contains(Point(cx, cy)): trigger_alert("INTRUSION", box)
Pre-trained YOLO models detect 80 COCO classes. But CCTV event detection requires custom classes: weapons, smoke, PPE, specific vehicle types. Transfer learning makes this practical with as few as 500 labeled images.
Annotate images using Roboflow or CVAT. Export in YOLO format: one .txt per image with class x_center y_center width height (normalized 0-1). Split 80/10/10 train/val/test.
path: /data/cctv-events train: images/train val: images/val test: images/test names: 0: person 1: weapon 2: fire 3: smoke 4: abandoned_bag
Include nighttime/IR footage. Add motion blur augmentation. Use frames from actual deployment cameras for domain adaptation. Balance classes with oversampling or focal loss.
from ultralytics import YOLO # Start from pretrained weights (transfer learning) model = YOLO("yolo11s.pt") # Fine-tune on custom dataset results = model.train( data="dataset.yaml", epochs=100, imgsz=640, batch=16, lr0=0.01, patience=20, # Early stopping augment=True, mosaic=1.0, mixup=0.1, device="0", # GPU index project="cctv-events", name="yolo11s-custom" ) # Validate metrics = model.val() print(f"mAP50-95: {metrics.box.map}") # Export for deployment model.export(format="onnx", dynamic=True) model.export(format="engine", half=True) # TensorRT
# ONNX (cross-platform) model.export(format="onnx") # TensorRT (NVIDIA GPUs / Jetson) model.export(format="engine", half=True) # OpenVINO (Intel CPUs/VPUs) model.export(format="openvino") # CoreML (Apple devices) model.export(format="coreml") # TFLite (Coral / mobile) model.export(format="tflite", int8=True)
Cloud inference adds 50-200ms of network latency. For real-time CCTV, that's unacceptable. Edge deployment runs YOLO directly on cameras or local hardware — no internet required, no data leaves the premises.
The gold standard for edge AI. Jetson Orin Nano: 40 TOPS, runs YOLO11s at 60+ FPS with TensorRT FP16. Orin NX pushes 100+ FPS. Full CUDA support, 15W power envelope.
Export YOLO to OpenVINO IR format for Intel CPUs, iGPUs, and VPUs (Movidius). The Neural Compute Stick 2 runs YOLO11n at 15-20 FPS. Ideal for retrofit installations.
Edge TPU runs quantized TFLite models at 4 TOPS. YOLO11n (INT8) achieves 25+ FPS. USB Accelerator plugs into any Linux box. Ultra-low power for battery-powered deployments.
Quantization: FP32 → FP16 (2x speedup, ~0.5% mAP loss) → INT8 (4x speedup, ~1-2% mAP loss). Pruning: Remove redundant weights. Knowledge distillation: Train small model to mimic large one.
Real numbers. COCO val2017, 640px input, NVIDIA T4 GPU with TensorRT FP16. mAP50-95 and inference speed compared across YOLO generations.
YOLO11 achieves higher mAP with fewer parameters and lower FLOPs than YOLOv8. The nano variant runs at 1.5ms per frame — that's 666 FPS on a single T4 GPU. Even the largest model (11x) runs under 12ms, enabling real-time multi-camera processing.
YOLO isn't a research toy. It's deployed at scale across industries, processing millions of frames per day in production environments.
Singapore's Safe City initiative deploys YOLO across 90,000+ cameras for crowd monitoring, traffic incident detection, and public safety alerts. Custom models detect fights, jaywalking, and PMD violations. Response times reduced by 40%.
Major retailers use YOLO-based systems to detect shoplifting behaviors: concealment gestures, cart abandonment patterns, and self-checkout fraud. Shrinkage reduced by 15-30% in pilot deployments. Runs on in-store edge servers.
Amazon, DHL, and logistics companies deploy YOLO for PPE compliance (hard hats, vests), forklift-pedestrian collision avoidance, and zone violation detection. OSHA incident rates dropped 25% in equipped facilities.
Departments of Transportation use YOLO for real-time traffic flow analysis, wrong-way driver detection, accident identification, and automated vehicle counting. Deployed on Jetson Orin units mounted alongside existing cameras.
From zero to real-time detection. No PhD required.
pip install ultralytics
yolo detect predict \ model=yolo11n.pt \ source="https://youtu.be/example" \ show=True conf=0.5
# Webcam yolo detect predict model=yolo11s.pt source=0 # RTSP stream yolo detect predict model=yolo11s.pt \ source="rtsp://192.168.1.64:554/stream"
yolo track model=yolo11s.pt \ source="surveillance.mp4" \ tracker=bytetrack.yaml show=True
from ultralytics import YOLO import cv2 # Initialize model = YOLO("yolo11s.pt") # Open RTSP stream cap = cv2.VideoCapture( "rtsp://admin:pass@cam1/stream" ) while cap.isOpened(): ret, frame = cap.read() if not ret: break # Detect + Track results = model.track( frame, persist=True, conf=0.5, tracker="bytetrack.yaml" ) # Draw results annotated = results[0].plot() # Check for events for box in results[0].boxes: cls = model.names[int(box.cls)] if cls in ["fire", "weapon"]: send_alert(cls, frame) cv2.imshow("CCTV", annotated) if cv2.waitKey(1) == ord("q"): break cap.release()