Real-Time Computer Vision

Seeing Everything. Missing Nothing.

YOLO transforms passive CCTV footage into an intelligent detection system — identifying fights, intrusions, falls, and anomalies in real-time, at the edge, with millisecond latency.

1000+
FPS on GPU
54.7
mAP (YOLO11x)
1.5ms
Inference Latency

What is YOLO?

In 2015, Joseph Redmon and Ali Farhadi published a paper that changed computer vision forever. Before YOLO, object detection was a two-stage process: first propose regions (R-CNN, Fast R-CNN, Faster R-CNN), then classify each one. Slow. Redundant. Impractical for real-time.

YOLO reframed detection as a single regression problem. One neural network pass. The entire image. All bounding boxes and class probabilities predicted simultaneously.

The input image is divided into an S×S grid. Each cell predicts B bounding boxes with confidence scores and C class probabilities. Non-maximum suppression filters overlapping detections. The result: real-time detection at 45+ FPS — orders of magnitude faster than anything before it.

This single insight — you only look once — spawned an entire family of models that now power everything from autonomous driving to surveillance to medical imaging.

Input 448×448 24 Conv Layers + 2 FC S×S Grid + BBoxes + Class Probabilities YOLO Architecture (Simplified) Single forward pass → All detections vs. R-CNN: ~2000 region proposals × CNN forward pass each = 47s/image YOLO: 1 pass = 22ms/image (45 FPS)

A Decade of YOLO

From a grad student's insight to the most deployed object detection model on earth. Each version pushed the boundary of what's possible in real-time perception.

2015

YOLOv1 — The Original

Joseph Redmon, University of Washington. Single-pass detection as regression. 45 FPS on Titan X. 24 conv layers + 2 FC. 63.4 mAP on VOC. Changed everything.

2016

YOLOv2 / YOLO9000

Batch normalization, anchor boxes, Darknet-19 backbone, multi-scale training. Detects 9000+ categories via hierarchical WordTree joint training on ImageNet + COCO.

2018

YOLOv3 — Darknet-53

Feature Pyramid Network (FPN) for multi-scale predictions at 3 scales. Darknet-53 backbone (53 conv layers with residual connections). 57.9 mAP on COCO. The workhorse for years.

2020

YOLOv4 — Bag of Freebies

Alexey Bochkovskiy takes over. CSPDarknet53 backbone, SPP, PANet. Introduced "bag of freebies" (data aug, CIoU loss) and "bag of specials" (Mish activation, SAM). 43.5 AP on COCO at 65 FPS.

2020

YOLOv5 — Ultralytics

Glenn Jocher's PyTorch implementation. No paper, just code that works. AutoAnchor, mosaic augmentation, model scaling (n/s/m/l/x). Became the most-used YOLO in production.

2022

YOLOv6 — Meituan

EfficientRep backbone, Rep-PAN neck. Hardware-friendly reparameterization. Optimized for industrial deployment. 52.5 mAP, 1234 FPS (T4, FP16).

2022

YOLOv7 — E-ELAN

Extended efficient layer aggregation. Trainable bag-of-freebies. Model re-parameterization. 56.8 mAP at 30 FPS (V100). State of the art at the time.

2023

YOLOv8 — Anchor-Free

Ultralytics' next gen. Anchor-free detection head. C2f modules. Decoupled head for classification and regression. Supports detection, segmentation, pose, OBB, classification. 50.2 mAP (medium).

2024

YOLOv9 — PGI + GELAN

Programmable Gradient Information (PGI) prevents information loss in deep networks. Generalized ELAN (GELAN) architecture. 55.6 mAP with efficient parameter use.

2024

YOLOv10 — NMS-Free

Tsinghua University. Consistent dual assignments for NMS-free training. End-to-end latency optimization. Quantization-aware design for edge deployment.

2024

YOLO11 — Hybrid CNN-Transformer

Ultralytics. Optimized CSP blocks, reduced FLOPs, higher mAP. 54.7 mAP (x-large), 1.5ms on T4. Fewer parameters than v8 with better accuracy. The current state of the art.

2024

YOLO-World — Open Vocabulary

Open-vocabulary detection with text prompts. Detect any object described in natural language without retraining. Vision-language model integration. Zero-shot detection.

Why Traditional Surveillance Fails

The average security guard monitors 16 screens. Studies show attention degrades after just 20 minutes. After 12 hours of footage review, critical events are missed at rates exceeding 95%.

A mid-size city generates petabytes of CCTV footage daily. Most of it is never watched. When a crime occurs, investigators scrub through hours of recordings manually — often days after the event.

Motion detection, the legacy approach, triggers on every shadow, swaying tree, and passing car. Alert fatigue sets in fast. Operators learn to ignore alarms — defeating the purpose entirely.

What's needed isn't more cameras. It's intelligent perception — systems that understand what is happening, not just that something moved. YOLO provides exactly this: semantic understanding of scenes at real-time speed.

👁️

Human Attention Limit

A trained operator's effective monitoring drops below 50% after 22 minutes of continuous viewing. In a 12-hour shift, most events are effectively invisible.

📊

Data Overload

London alone has 691,000+ CCTV cameras. At 30 FPS, that's 1.79 trillion frames per day. No human workforce can process this. AI must.

Speed Requirement

A fight escalates in 3-5 seconds. A perimeter breach gives you 10 seconds to respond. Detection must be real-time — not "we found it in yesterday's footage."

Event Detection Pipeline

From raw video stream to actionable alert in under 50 milliseconds. Here's how every frame flows through the system.

📹 Camera Feed RTSP/ONVIF 🖼️ Frame Extract Decode + Resize ⚙️ Preprocess Normalize / Pad YOLO Inference ~1.5ms (T4) 🔍 Post-Process NMS + Tracking 🧠 Event Classify Rules + ML 🚨 Alert Webhook/SMS Total end-to-end: ~20-50ms per frame

Events Detected

YOLO-based systems don't just detect objects — they detect situations. By combining object detection with tracking, pose estimation, and temporal analysis, these systems identify critical events.

🥊

Fight / Violence

Detect aggressive body poses and rapid person-to-person proximity changes. Pose estimation + velocity vectors identify altercations before they escalate.

🚧

Intrusion / Perimeter Breach

Define virtual zones and tripwires. YOLO detects persons entering restricted areas with sub-second latency. Integrates with access control systems.

🧓

Fall Detection

Critical for elderly care facilities. Pose estimation tracks body keypoints — when a person transitions from upright to horizontal with sudden velocity, an alert fires.

🎒

Abandoned Object

Track objects across frames. When a bag, package, or item remains stationary without an owner for a configured duration, flag it as potentially suspicious.

👥

Crowd Density / Stampede Risk

Count persons per zone in real-time. When density exceeds thresholds or flow patterns become chaotic, trigger crowd management alerts before crush events.

🚗

Vehicle Incidents

Detect collisions, wrong-way driving, sudden stops, and erratic lane changes. Velocity estimation from object tracking enables traffic anomaly detection.

🔥

Fire & Smoke

Custom-trained YOLO models detect fire and smoke with high accuracy. Early detection in warehouses, forests, and industrial sites saves lives and property.

🚶

Loitering

Track person dwell time within defined zones. When an individual remains stationary or circulates in a small area beyond a time threshold, raise a loitering alert.

🛡️

PPE Compliance

Detect hard hats, safety vests, goggles, and masks on construction sites. Ensure workplace safety compliance in real-time without manual inspection.

The Ultralytics Library

Ultralytics makes YOLO absurdly easy. A unified Python API for detection, segmentation, tracking, pose estimation, and classification. Three lines to inference. Ten to train.

Basic Detection

detect.py
from ultralytics import YOLO

# Load a pretrained model
model = YOLO("yolo11n.pt")

# Run inference on an image
results = model("cctv_frame.jpg")

# Process results
for r in results:
    boxes = r.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        label = model.names[cls]
        print(f"{label}: {conf:.2f}")

Real-Time Video Stream

stream.py
from ultralytics import YOLO

model = YOLO("yolo11s.pt")

# Stream from RTSP camera
results = model(
    "rtsp://admin:pass@192.168.1.64/stream1",
    stream=True,
    show=True,
    conf=0.5
)

for r in results:
    # Each r contains detections for one frame
    if len(r.boxes) > 0:
        process_detections(r)

Object Tracking (ByteTrack / BoTSORT)

track.py
from ultralytics import YOLO

model = YOLO("yolo11m.pt")

# Track objects across frames
results = model.track(
    source="cctv_footage.mp4",
    tracker="bytetrack.yaml",
    persist=True,
    show=True
)

# Each detection now has a unique ID
for r in results:
    if r.boxes.id is not None:
        ids = r.boxes.id.int().tolist()
        print(f"Tracking {len(ids)} objects")

Zone-Based Event Detection

zone_detect.py
import numpy as np
from ultralytics import YOLO
from shapely.geometry import Point, Polygon

model = YOLO("yolo11s.pt")

# Define restricted zone (polygon coords)
restricted = Polygon([
    (100, 200), (400, 200),
    (400, 500), (100, 500)
])

for r in model.track("cam1.mp4", stream=True):
    for box in r.boxes:
        cx, cy = box.xywh[0][:2]
        if restricted.contains(Point(cx, cy)):
            trigger_alert("INTRUSION", box)

Fine-Tuning for Custom Events

Pre-trained YOLO models detect 80 COCO classes. But CCTV event detection requires custom classes: weapons, smoke, PPE, specific vehicle types. Transfer learning makes this practical with as few as 500 labeled images.

1. Prepare Your Dataset

Annotate images using Roboflow or CVAT. Export in YOLO format: one .txt per image with class x_center y_center width height (normalized 0-1). Split 80/10/10 train/val/test.

2. Dataset YAML

dataset.yaml
path: /data/cctv-events
train: images/train
val: images/val
test: images/test

names:
  0: person
  1: weapon
  2: fire
  3: smoke
  4: abandoned_bag

3. Tips for CCTV Data

Include nighttime/IR footage. Add motion blur augmentation. Use frames from actual deployment cameras for domain adaptation. Balance classes with oversampling or focal loss.

Training Script

train.py
from ultralytics import YOLO

# Start from pretrained weights (transfer learning)
model = YOLO("yolo11s.pt")

# Fine-tune on custom dataset
results = model.train(
    data="dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    patience=20,       # Early stopping
    augment=True,
    mosaic=1.0,
    mixup=0.1,
    device="0",         # GPU index
    project="cctv-events",
    name="yolo11s-custom"
)

# Validate
metrics = model.val()
print(f"mAP50-95: {metrics.box.map}")

# Export for deployment
model.export(format="onnx", dynamic=True)
model.export(format="engine", half=True)  # TensorRT

Export Formats

export.py
# ONNX (cross-platform)
model.export(format="onnx")

# TensorRT (NVIDIA GPUs / Jetson)
model.export(format="engine", half=True)

# OpenVINO (Intel CPUs/VPUs)
model.export(format="openvino")

# CoreML (Apple devices)
model.export(format="coreml")

# TFLite (Coral / mobile)
model.export(format="tflite", int8=True)

Deployment at the Edge

Cloud inference adds 50-200ms of network latency. For real-time CCTV, that's unacceptable. Edge deployment runs YOLO directly on cameras or local hardware — no internet required, no data leaves the premises.

🟢

NVIDIA Jetson Orin

The gold standard for edge AI. Jetson Orin Nano: 40 TOPS, runs YOLO11s at 60+ FPS with TensorRT FP16. Orin NX pushes 100+ FPS. Full CUDA support, 15W power envelope.

🔵

Intel OpenVINO

Export YOLO to OpenVINO IR format for Intel CPUs, iGPUs, and VPUs (Movidius). The Neural Compute Stick 2 runs YOLO11n at 15-20 FPS. Ideal for retrofit installations.

🪸

Google Coral

Edge TPU runs quantized TFLite models at 4 TOPS. YOLO11n (INT8) achieves 25+ FPS. USB Accelerator plugs into any Linux box. Ultra-low power for battery-powered deployments.

Model Optimization

Quantization: FP32 → FP16 (2x speedup, ~0.5% mAP loss) → INT8 (4x speedup, ~1-2% mAP loss). Pruning: Remove redundant weights. Knowledge distillation: Train small model to mimic large one.

Benchmarks

Real numbers. COCO val2017, 640px input, NVIDIA T4 GPU with TensorRT FP16. mAP50-95 and inference speed compared across YOLO generations.

mAP50-95 (higher is better)

YOLOv5s
37.4
YOLOv7
51.4
YOLOv8s
44.9
YOLOv8m
50.2
YOLOv9c
53.0
YOLO11s
47.0
YOLO11m
51.5
YOLO11x
54.7

Inference Speed — T4 TensorRT (ms, lower is better)

YOLO11n
1.5ms
YOLO11s
2.5ms
YOLO11m
4.7ms
YOLO11x
11.3ms
YOLOv8n
1.47ms
YOLOv8s
2.66ms
YOLOv8m
5.86ms
YOLOv8x
14.37ms

Key Takeaway

YOLO11 achieves higher mAP with fewer parameters and lower FLOPs than YOLOv8. The nano variant runs at 1.5ms per frame — that's 666 FPS on a single T4 GPU. Even the largest model (11x) runs under 12ms, enabling real-time multi-camera processing.

Case Studies

YOLO isn't a research toy. It's deployed at scale across industries, processing millions of frames per day in production environments.

🏙️ Smart City — Singapore

Singapore's Safe City initiative deploys YOLO across 90,000+ cameras for crowd monitoring, traffic incident detection, and public safety alerts. Custom models detect fights, jaywalking, and PMD violations. Response times reduced by 40%.

🛒 Retail Loss Prevention

Major retailers use YOLO-based systems to detect shoplifting behaviors: concealment gestures, cart abandonment patterns, and self-checkout fraud. Shrinkage reduced by 15-30% in pilot deployments. Runs on in-store edge servers.

🏭 Warehouse Safety

Amazon, DHL, and logistics companies deploy YOLO for PPE compliance (hard hats, vests), forklift-pedestrian collision avoidance, and zone violation detection. OSHA incident rates dropped 25% in equipped facilities.

🚦 Traffic Monitoring

Departments of Transportation use YOLO for real-time traffic flow analysis, wrong-way driver detection, accident identification, and automated vehicle counting. Deployed on Jetson Orin units mounted alongside existing cameras.

Get Started in 5 Minutes

From zero to real-time detection. No PhD required.

1

Install Ultralytics

terminal
pip install ultralytics
2

Run Your First Detection

terminal
yolo detect predict \
  model=yolo11n.pt \
  source="https://youtu.be/example" \
  show=True conf=0.5
3

Stream from Your Camera

terminal
# Webcam
yolo detect predict model=yolo11s.pt source=0

# RTSP stream
yolo detect predict model=yolo11s.pt \
  source="rtsp://192.168.1.64:554/stream"
4

Track Objects

terminal
yolo track model=yolo11s.pt \
  source="surveillance.mp4" \
  tracker=bytetrack.yaml show=True
5

Full Python Pipeline

cctv_pipeline.py
from ultralytics import YOLO
import cv2

# Initialize
model = YOLO("yolo11s.pt")

# Open RTSP stream
cap = cv2.VideoCapture(
    "rtsp://admin:pass@cam1/stream"
)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Detect + Track
    results = model.track(
        frame,
        persist=True,
        conf=0.5,
        tracker="bytetrack.yaml"
    )

    # Draw results
    annotated = results[0].plot()

    # Check for events
    for box in results[0].boxes:
        cls = model.names[int(box.cls)]
        if cls in ["fire", "weapon"]:
            send_alert(cls, frame)

    cv2.imshow("CCTV", annotated)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()

📚 Resources

Ultralytics Documentation
GitHub Repository
YOLO11 Model Page
Roboflow — Dataset Annotation
Original YOLO Paper (2015)