YOLO for CCTV Event Detection — Seeing Everything. Missing Nothing.

01 / The Breakthrough

What is YOLO?

In 2015, Joseph Redmon and Ali Farhadi published a paper that changed computer vision forever. Before YOLO, object detection was a two-stage process: first propose regions (R-CNN, Fast R-CNN, Faster R-CNN), then classify each one. Slow. Redundant. Impractical for real-time.

YOLO reframed detection as a single regression problem. One neural network pass. The entire image. All bounding boxes and class probabilities predicted simultaneously.

The input image is divided into an S×S grid. Each cell predicts B bounding boxes with confidence scores and C class probabilities. Non-maximum suppression filters overlapping detections. The result: real-time detection at 45+ FPS — orders of magnitude faster than anything before it.

This single insight — you only look once — spawned an entire family of models that now power everything from autonomous driving to surveillance to medical imaging.

02 / Evolution

A Decade of YOLO

From a grad student's insight to the most deployed object detection model on earth. Each version pushed the boundary of what's possible in real-time perception.

2015

YOLOv1 — The Original

Joseph Redmon, University of Washington. Single-pass detection as regression. 45 FPS on Titan X. 24 conv layers + 2 FC. 63.4 mAP on VOC. Changed everything.

2016

YOLOv2 / YOLO9000

Batch normalization, anchor boxes, Darknet-19 backbone, multi-scale training. Detects 9000+ categories via hierarchical WordTree joint training on ImageNet + COCO.

2018

YOLOv3 — Darknet-53

Feature Pyramid Network (FPN) for multi-scale predictions at 3 scales. Darknet-53 backbone (53 conv layers with residual connections). 57.9 mAP on COCO. The workhorse for years.

2020

YOLOv4 — Bag of Freebies

Alexey Bochkovskiy takes over. CSPDarknet53 backbone, SPP, PANet. Introduced "bag of freebies" (data aug, CIoU loss) and "bag of specials" (Mish activation, SAM). 43.5 AP on COCO at 65 FPS.

2020

YOLOv5 — Ultralytics

Glenn Jocher's PyTorch implementation. No paper, just code that works. AutoAnchor, mosaic augmentation, model scaling (n/s/m/l/x). Became the most-used YOLO in production.

2022

YOLOv6 — Meituan

EfficientRep backbone, Rep-PAN neck. Hardware-friendly reparameterization. Optimized for industrial deployment. 52.5 mAP, 1234 FPS (T4, FP16).

2022

YOLOv7 — E-ELAN

Extended efficient layer aggregation. Trainable bag-of-freebies. Model re-parameterization. 56.8 mAP at 30 FPS (V100). State of the art at the time.

2023

YOLOv8 — Anchor-Free

Ultralytics' next gen. Anchor-free detection head. C2f modules. Decoupled head for classification and regression. Supports detection, segmentation, pose, OBB, classification. 50.2 mAP (medium).

2024

YOLOv9 — PGI + GELAN

Programmable Gradient Information (PGI) prevents information loss in deep networks. Generalized ELAN (GELAN) architecture. 55.6 mAP with efficient parameter use.

2024

YOLOv10 — NMS-Free

Tsinghua University. Consistent dual assignments for NMS-free training. End-to-end latency optimization. Quantization-aware design for edge deployment.

2024

YOLO11 — Hybrid CNN-Transformer

Ultralytics. Optimized CSP blocks, reduced FLOPs, higher mAP. 54.7 mAP (x-large), 1.5ms on T4. Fewer parameters than v8 with better accuracy. The current state of the art.

2024

YOLO-World — Open Vocabulary

Open-vocabulary detection with text prompts. Detect any object described in natural language without retraining. Vision-language model integration. Zero-shot detection.

03 / The Problem

Why Traditional Surveillance Fails

The average security guard monitors 16 screens. Studies show attention degrades after just 20 minutes. After 12 hours of footage review, critical events are missed at rates exceeding 95%.

A mid-size city generates petabytes of CCTV footage daily. Most of it is never watched. When a crime occurs, investigators scrub through hours of recordings manually — often days after the event.

Motion detection, the legacy approach, triggers on every shadow, swaying tree, and passing car. Alert fatigue sets in fast. Operators learn to ignore alarms — defeating the purpose entirely.

What's needed isn't more cameras. It's intelligent perception — systems that understand what is happening, not just that something moved. YOLO provides exactly this: semantic understanding of scenes at real-time speed.

👁️

Human Attention Limit

A trained operator's effective monitoring drops below 50% after 22 minutes of continuous viewing. In a 12-hour shift, most events are effectively invisible.

📊

Data Overload

London alone has 691,000+ CCTV cameras. At 30 FPS, that's 1.79 trillion frames per day. No human workforce can process this. AI must.

⚡

Speed Requirement

A fight escalates in 3-5 seconds. A perimeter breach gives you 10 seconds to respond. Detection must be real-time — not "we found it in yesterday's footage."

05 / Capabilities

Events Detected

YOLO-based systems don't just detect objects — they detect situations. By combining object detection with tracking, pose estimation, and temporal analysis, these systems identify critical events.

🥊

Fight / Violence

Detect aggressive body poses and rapid person-to-person proximity changes. Pose estimation + velocity vectors identify altercations before they escalate.

🚧

Intrusion / Perimeter Breach

Define virtual zones and tripwires. YOLO detects persons entering restricted areas with sub-second latency. Integrates with access control systems.

🧓

Fall Detection

Critical for elderly care facilities. Pose estimation tracks body keypoints — when a person transitions from upright to horizontal with sudden velocity, an alert fires.

🎒

Abandoned Object

Track objects across frames. When a bag, package, or item remains stationary without an owner for a configured duration, flag it as potentially suspicious.

👥

Crowd Density / Stampede Risk

Count persons per zone in real-time. When density exceeds thresholds or flow patterns become chaotic, trigger crowd management alerts before crush events.

🚗

Vehicle Incidents

Detect collisions, wrong-way driving, sudden stops, and erratic lane changes. Velocity estimation from object tracking enables traffic anomaly detection.

🔥

Fire & Smoke

Custom-trained YOLO models detect fire and smoke with high accuracy. Early detection in warehouses, forests, and industrial sites saves lives and property.

🚶

Loitering

Track person dwell time within defined zones. When an individual remains stationary or circulates in a small area beyond a time threshold, raise a loitering alert.

🛡️

PPE Compliance

Detect hard hats, safety vests, goggles, and masks on construction sites. Ensure workplace safety compliance in real-time without manual inspection.

06 / Code

The Ultralytics Library

Ultralytics makes YOLO absurdly easy. A unified Python API for detection, segmentation, tracking, pose estimation, and classification. Three lines to inference. Ten to train.

Basic Detection

detect.py

from ultralytics import YOLO

# Load a pretrained model
model = YOLO("yolo11n.pt")

# Run inference on an image
results = model("cctv_frame.jpg")

# Process results
for r in results:
    boxes = r.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        label = model.names[cls]
        print(f"{label}: {conf:.2f}")

Real-Time Video Stream

stream.py

from ultralytics import YOLO

model = YOLO("yolo11s.pt")

# Stream from RTSP camera
results = model(
    "rtsp://admin:pass@192.168.1.64/stream1",
    stream=True,
    show=True,
    conf=0.5
)

for r in results:
    # Each r contains detections for one frame
    if len(r.boxes) > 0:
        process_detections(r)

Object Tracking (ByteTrack / BoTSORT)

track.py

from ultralytics import YOLO

model = YOLO("yolo11m.pt")

# Track objects across frames
results = model.track(
    source="cctv_footage.mp4",
    tracker="bytetrack.yaml",
    persist=True,
    show=True
)

# Each detection now has a unique ID
for r in results:
    if r.boxes.id is not None:
        ids = r.boxes.id.int().tolist()
        print(f"Tracking {len(ids)} objects")

Zone-Based Event Detection

zone_detect.py

import numpy as np
from ultralytics import YOLO
from shapely.geometry import Point, Polygon

model = YOLO("yolo11s.pt")

# Define restricted zone (polygon coords)
restricted = Polygon([
    (100, 200), (400, 200),
    (400, 500), (100, 500)
])

for r in model.track("cam1.mp4", stream=True):
    for box in r.boxes:
        cx, cy = box.xywh[0][:2]
        if restricted.contains(Point(cx, cy)):
            trigger_alert("INTRUSION", box)

07 / Training

Fine-Tuning for Custom Events

Pre-trained YOLO models detect 80 COCO classes. But CCTV event detection requires custom classes: weapons, smoke, PPE, specific vehicle types. Transfer learning makes this practical with as few as 500 labeled images.

1. Prepare Your Dataset

Annotate images using Roboflow or CVAT. Export in YOLO format: one .txt per image with class x_center y_center width height (normalized 0-1). Split 80/10/10 train/val/test.

2. Dataset YAML

dataset.yaml

path: /data/cctv-events
train: images/train
val: images/val
test: images/test

names:
  0: person
  1: weapon
  2: fire
  3: smoke
  4: abandoned_bag

3. Tips for CCTV Data

Include nighttime/IR footage. Add motion blur augmentation. Use frames from actual deployment cameras for domain adaptation. Balance classes with oversampling or focal loss.

Training Script

train.py

from ultralytics import YOLO

# Start from pretrained weights (transfer learning)
model = YOLO("yolo11s.pt")

# Fine-tune on custom dataset
results = model.train(
    data="dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    patience=20,       # Early stopping
    augment=True,
    mosaic=1.0,
    mixup=0.1,
    device="0",         # GPU index
    project="cctv-events",
    name="yolo11s-custom"
)

# Validate
metrics = model.val()
print(f"mAP50-95: {metrics.box.map}")

# Export for deployment
model.export(format="onnx", dynamic=True)
model.export(format="engine", half=True)  # TensorRT

Export Formats

export.py

# ONNX (cross-platform)
model.export(format="onnx")

# TensorRT (NVIDIA GPUs / Jetson)
model.export(format="engine", half=True)

# OpenVINO (Intel CPUs/VPUs)
model.export(format="openvino")

# CoreML (Apple devices)
model.export(format="coreml")

# TFLite (Coral / mobile)
model.export(format="tflite", int8=True)

08 / Edge

Deployment at the Edge

Cloud inference adds 50-200ms of network latency. For real-time CCTV, that's unacceptable. Edge deployment runs YOLO directly on cameras or local hardware — no internet required, no data leaves the premises.

🟢

NVIDIA Jetson Orin

The gold standard for edge AI. Jetson Orin Nano: 40 TOPS, runs YOLO11s at 60+ FPS with TensorRT FP16. Orin NX pushes 100+ FPS. Full CUDA support, 15W power envelope.

🔵

Intel OpenVINO

Export YOLO to OpenVINO IR format for Intel CPUs, iGPUs, and VPUs (Movidius). The Neural Compute Stick 2 runs YOLO11n at 15-20 FPS. Ideal for retrofit installations.

🪸

Google Coral

Edge TPU runs quantized TFLite models at 4 TOPS. YOLO11n (INT8) achieves 25+ FPS. USB Accelerator plugs into any Linux box. Ultra-low power for battery-powered deployments.

⚡

Model Optimization

Quantization: FP32 → FP16 (2x speedup, ~0.5% mAP loss) → INT8 (4x speedup, ~1-2% mAP loss). Pruning: Remove redundant weights. Knowledge distillation: Train small model to mimic large one.

09 / Performance

Benchmarks

Real numbers. COCO val2017, 640px input, NVIDIA T4 GPU with TensorRT FP16. mAP50-95 and inference speed compared across YOLO generations.

mAP50-95 (higher is better)

YOLOv5s

37.4

YOLOv7

51.4

YOLOv8s

44.9

YOLOv8m

50.2

YOLOv9c

53.0

YOLO11s

47.0

YOLO11m

51.5

YOLO11x

54.7

Inference Speed — T4 TensorRT (ms, lower is better)

YOLO11n

1.5ms

YOLO11s

2.5ms

YOLO11m

4.7ms

YOLO11x

11.3ms

YOLOv8n

1.47ms

YOLOv8s

2.66ms

YOLOv8m

5.86ms

YOLOv8x

14.37ms

Key Takeaway

YOLO11 achieves higher mAP with fewer parameters and lower FLOPs than YOLOv8. The nano variant runs at 1.5ms per frame — that's 666 FPS on a single T4 GPU. Even the largest model (11x) runs under 12ms, enabling real-time multi-camera processing.

10 / Real World

Case Studies

YOLO isn't a research toy. It's deployed at scale across industries, processing millions of frames per day in production environments.

🏙️ Smart City — Singapore

Singapore's Safe City initiative deploys YOLO across 90,000+ cameras for crowd monitoring, traffic incident detection, and public safety alerts. Custom models detect fights, jaywalking, and PMD violations. Response times reduced by 40%.

🛒 Retail Loss Prevention

Major retailers use YOLO-based systems to detect shoplifting behaviors: concealment gestures, cart abandonment patterns, and self-checkout fraud. Shrinkage reduced by 15-30% in pilot deployments. Runs on in-store edge servers.

🏭 Warehouse Safety

Amazon, DHL, and logistics companies deploy YOLO for PPE compliance (hard hats, vests), forklift-pedestrian collision avoidance, and zone violation detection. OSHA incident rates dropped 25% in equipped facilities.

🚦 Traffic Monitoring

Departments of Transportation use YOLO for real-time traffic flow analysis, wrong-way driver detection, accident identification, and automated vehicle counting. Deployed on Jetson Orin units mounted alongside existing cameras.

11 / Quick Start

Get Started in 5 Minutes

From zero to real-time detection. No PhD required.

Install Ultralytics

terminal

pip install ultralytics

Run Your First Detection

terminal

yolo detect predict \
  model=yolo11n.pt \
  source="https://youtu.be/example" \
  show=True conf=0.5

Stream from Your Camera

terminal

# Webcam
yolo detect predict model=yolo11s.pt source=0

# RTSP stream
yolo detect predict model=yolo11s.pt \
  source="rtsp://192.168.1.64:554/stream"

Track Objects

terminal

yolo track model=yolo11s.pt \
  source="surveillance.mp4" \
  tracker=bytetrack.yaml show=True

Full Python Pipeline

cctv_pipeline.py

from ultralytics import YOLO
import cv2

# Initialize
model = YOLO("yolo11s.pt")

# Open RTSP stream
cap = cv2.VideoCapture(
    "rtsp://admin:pass@cam1/stream"
)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Detect + Track
    results = model.track(
        frame,
        persist=True,
        conf=0.5,
        tracker="bytetrack.yaml"
    )

    # Draw results
    annotated = results[0].plot()

    # Check for events
    for box in results[0].boxes:
        cls = model.names[int(box.cls)]
        if cls in ["fire", "weapon"]:
            send_alert(cls, frame)

    cv2.imshow("CCTV", annotated)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()

📚 Resources

Ultralytics Documentation
GitHub Repository
YOLO11 Model Page
Roboflow — Dataset Annotation
Original YOLO Paper (2015)

Seeing Everything. Missing Nothing.

What is YOLO?

A Decade of YOLO

YOLOv1 — The Original

YOLOv2 / YOLO9000

YOLOv3 — Darknet-53

YOLOv4 — Bag of Freebies

YOLOv5 — Ultralytics

YOLOv6 — Meituan

YOLOv7 — E-ELAN

YOLOv8 — Anchor-Free

YOLOv9 — PGI + GELAN

YOLOv10 — NMS-Free

YOLO11 — Hybrid CNN-Transformer

YOLO-World — Open Vocabulary

Why Traditional Surveillance Fails

Human Attention Limit

Data Overload

Speed Requirement

Event Detection Pipeline

Events Detected

Fight / Violence

Intrusion / Perimeter Breach

Fall Detection

Abandoned Object

Crowd Density / Stampede Risk

Vehicle Incidents

Fire & Smoke

Loitering

PPE Compliance

The Ultralytics Library

Basic Detection

Real-Time Video Stream

Object Tracking (ByteTrack / BoTSORT)

Zone-Based Event Detection

Fine-Tuning for Custom Events

1. Prepare Your Dataset

2. Dataset YAML

3. Tips for CCTV Data

Training Script

Export Formats

Deployment at the Edge

NVIDIA Jetson Orin

Intel OpenVINO

Google Coral

Model Optimization

Benchmarks

mAP50-95 (higher is better)

Inference Speed — T4 TensorRT (ms, lower is better)

Key Takeaway

Case Studies

🏙️ Smart City — Singapore

🛒 Retail Loss Prevention

🏭 Warehouse Safety

🚦 Traffic Monitoring

Get Started in 5 Minutes

Install Ultralytics

Run Your First Detection

Stream from Your Camera

Track Objects

Full Python Pipeline

📚 Resources