AI at home: Image and video classification with AI – deepstack

After having played with coding assistants for a while, I decided to let that theme rest for a while. But that doesn’t mean I need to stop with exploring usage for my self-hosted AI.

One thing all of us have these days are an ever-growing library of pictures and videos. There exists quite a few software suites where you can label and tag them. Some of them do a good job of auto-tagging pictures and face recognizion, but can I do that myself? I wanted to test that idea.

One ready-made solution is deepstack. It focuses around on its own pre-trained model and can recognize 80 different objects and 365 different scenes according to the documentation. It can also do face detection, and that’s the only part of it you can train easily. There exists also some support for training for objects, and I haven’t tried that yet.

Deepstack actually pretty decent and quick, does a generally good job, and would be a pretty good match for a general photo/image library. It only analyzes images though. I decided to set it up in my Kubernetes cluster. It has its own file-based database for training data, and was pretty easy to set up through my standard pattern:

Storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: deepstack-datastore-pvc
  namespace: deepstack
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn-rwo-local-ssd
  resources:
    requests:
      storage: 10Gi

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepstack
  namespace: deepstack
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepstack
  template:
    metadata:
      labels:
        app: deepstack
    spec:
      containers:
        - name: deepstack
          image: deepquestai/deepstack:cpu
          imagePullPolicy: Always
          env:
            - name: MODE
              value: "High"          # or "High" if you want max accuracy[web:114][web:117]
            - name: VISION-FACE
              value: "True"
            - name: VISION-SCENE
              value: "True"
            - name: VISION-DETECTION
              value: "True"
          ports:
            - containerPort: 5000
          volumeMounts:
            - name: datastore
              mountPath: /datastore
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "8"
              memory: "8Gi"
      volumes:
        - name: datastore
          persistentVolumeClaim:
            claimName: deepstack-datastore-pvc

Service

apiVersion: v1
kind: Service
metadata:
  name: deepstack
  namespace: deepstack
spec:
  selector:
    app: deepstack
  ports:
    - name: http
      port: 5000
      targetPort: 5000

Ingressroute

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: deepstack
  namespace: traefik-external
  annotations:
    kubernetes.io/ingress.class: "traefik-external"
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`deepstack.engen.priv.no`)
      kind: Rule
      services:
        - name: deepstack
          namespace: deepstack
          port: 5000
  tls:
    certResolver: letsencrypt

Load balancer

apiVersion: v1
kind: Service
metadata:
    name: traefik-deepstack
    namespace: traefik-external
    annotations:
      projectcalico.org/ipv6pools: '["loadbalancer-ipv6-pool"]'
      external-dns.alpha.kubernetes.io/hostname: deepstack.engen.priv.no
      external-dns/external: "true"
      external-dns.alpha.kubernetes.io/ttl: "300"
      ipchanger.alpha.kubernetes.io/patch: "true"      

spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  ipFamilyPolicy: SingleStack
  ipFamilies:
    - IPv6
  ports:
    - name: web
      port: 80
    - name: websecure
      port: 443
  selector:
    app: traefik-external

DNS for IPv4

apiVersion: v1
metadata:
  name: deepstack-name
  namespace: deepstack
  annotations:
    external-dns.alpha.kubernetes.io/hostname: deepstack.engen.priv.no
    external-dns/external: "true"
    external-dns.alpha.kubernetes.io/ttl: "300"
    ipchanger.alpha.kubernetes.io/patch: "true"
spec:
  type: ExternalName
  externalName: <my external ip>

With this set up, I can call the deepstack API at https://deepstack.engen.priv.no/

Analyzing a video

To effectively use this, I can for example use this script. It will fetch one frame from a video and classify it with deepstack.

import os
import subprocess
import tempfile
from pathlib import Path
import requests

DEEPSTACK_URL = os.environ.get(
    "DEEPSTACK_URL",
    "https://deepstack.engen.priv.no",
)

# change this to a real path in your PVC
TEST_VIDEO = os.environ.get("TEST_VIDEO", "/media/test-workshop.mp4")


def extract_one_frame(video_path: str, seek_seconds: int = 30) -> str:
    """
    Extract a single JPEG frame from the video at seek_seconds.
    Returns the path to the JPEG file.
    """
    tmpdir = tempfile.mkdtemp()
    frame_path = str(Path(tmpdir) / "frame.jpg")

    cmd = [
        "ffmpeg",
        "-loglevel",
        "error",
        "-ss",
        str(seek_seconds),
        "-i",
        video_path,
        "-frames:v",
        "1",
        "-q:v",
        "2",
        frame_path,
    ]
    subprocess.run(cmd, check=True)
    return frame_path


def call_scene_api(image_path: str):
    with open(image_path, "rb") as f:
        r = requests.post(
            f"{DEEPSTACK_URL}/v1/vision/scene",
            files={"image": ("frame.jpg", f, "image/jpeg")},
            timeout=30,
        )
    r.raise_for_status()
    return r.json()


def main():
    print(f"Using video: {TEST_VIDEO}", flush=True)

    if not os.path.exists(TEST_VIDEO):
        raise FileNotFoundError(TEST_VIDEO)

    frame_path = extract_one_frame(TEST_VIDEO, seek_seconds=30)
    print(f"Extracted frame: {frame_path}", flush=True)

    res = call_scene_api(frame_path)
    print("Scene result:", res, flush=True)


if __name__ == "__main__":
    main()

To classify a whole video, you’d have to analyze a set of frames and do some sort of aggregation of the results. I have so far not created any scripts to do this. You could likely use another model, perhaps hosted on ollama, to analyze the results – something I explore in my next blog post where I do it all in ollama. I did decide to split this journey into two (so far) though, one for deepstack and one for my next, where I experiment how I can create it myself with ollama.

People

People recognizing is essentially a different API endpoint than scene recognizion.

You train it by feeding it images labeled with persons. Training works best when there’s only one face in the image and the face is a pretty prominent part of the image. I’ve had some success with cropping to the actual face from an image where it didn’t detect any faces.

A complete script

A script (mostly AI-generated too, I must admit) to utilize the deepstack tools to do object, scene and face recognizion on an image can be for example the following script. It will also write the tags into the images so that they can be used in a decent image management program.

#!/usr/bin/env python3
import argparse
from pathlib import Path
from typing import Dict, Any

import requests
from PIL import Image, ImageDraw, ImageFont

DEEPSTACK_URL = "https://deepstack.engen.priv.no"  # or http://host:port

def add_keywords_with_exiftool(image_path: Path, keywords):
    """
    Append keywords to XMP-dc:Subject so digiKam sees them as tags.

    keywords: list of strings
    """
    if not keywords:
        return
    args = ["exiftool", "-overwrite_original"]
    for kw in keywords:
        args.append(f"-XMP-dc:Subject+={kw}")
    args.append(str(image_path))
    subprocess.run(args, check=True)

def register_face(image_path: Path, userid: str) -> Dict[str, Any]:
    url = f"{DEEPSTACK_URL}/v1/vision/face/register"
    with image_path.open("rb") as f:
        files = {"image": f}
        data = {"userid": userid}
        resp = requests.post(url, files=files, data=data, verify=False)
    print("[face-register]", resp.status_code, resp.text)
    resp.raise_for_status()
    return resp.json()


def recognize_faces(image_path: Path) -> Dict[str, Any]:
    url = f"{DEEPSTACK_URL}/v1/vision/face/recognize"
    with image_path.open("rb") as f:
        files = {"image": f}
        resp = requests.post(url, files=files, verify=False)
    resp.raise_for_status()
    return resp.json()


def detect_objects(image_path: Path) -> Dict[str, Any]:
    """DeepStack object detection (80 COCO classes)."""
    url = f"{DEEPSTACK_URL}/v1/vision/detection"
    with image_path.open("rb") as f:
        files = {"image": f}
        resp = requests.post(url, files=files, verify=False)
    resp.raise_for_status()
    return resp.json()  # {"success":..., "predictions":[{"label":..., "confidence":..., "x_min":...}]}
    

def recognize_scene(image_path: Path) -> Dict[str, Any]:
    """DeepStack scene recognition (365 scenes)."""
    url = f"{DEEPSTACK_URL}/v1/vision/scene"
    with image_path.open("rb") as f:
        files = {"image": f}
        resp = requests.post(url, files=files, verify=False)
    resp.raise_for_status()
    return resp.json()  # {"success":..., "label": "...", "confidence": ...}

def draw_results(
    image_path: Path,
    face_resp: Dict[str, Any] | None,
    obj_resp: Dict[str, Any] | None,
    scene_resp: Dict[str, Any] | None,
    out_path: Path,
) -> None:
    img = Image.open(image_path).convert("RGB")
    draw = ImageDraw.Draw(img)

    try:
        font = ImageFont.load_default()
    except Exception:
        font = None

    # Faces (green boxes)
    if face_resp is not None:
        for face in face_resp.get("predictions", []):
            userid = face.get("userid", "unknown")
            x_min = int(face["x_min"])
            y_min = int(face["y_min"])
            x_max = int(face["x_max"])
            y_max = int(face["y_max"])

            draw.rectangle(((x_min, y_min), (x_max, y_max)), outline="green", width=2)
            label = f"{userid}"

            bbox = draw.textbbox((0, 0), label, font=font)
            text_w = bbox[2] - bbox[0]
            text_h = bbox[3] - bbox[1]

            box_top = max(0, y_min - text_h - 4)
            draw.rectangle(
                ((x_min, box_top), (x_min + text_w + 4, box_top + text_h + 4)),
                fill="green",
            )
            draw.text((x_min + 2, box_top + 2), label, fill="white", font=font)

    # Objects (red boxes)
    if obj_resp is not None:
        for obj in obj_resp.get("predictions", []):
            label = obj.get("label", "obj")
            conf = obj.get("confidence", 0)
            x_min = int(obj["x_min"])
            y_min = int(obj["y_min"])
            x_max = int(obj["x_max"])
            y_max = int(obj["y_max"])

            draw.rectangle(((x_min, y_min), (x_max, y_max)), outline="red", width=2)
            text = f"{label} {conf:.2f}"

            bbox = draw.textbbox((0, 0), text, font=font)
            text_w = bbox[2] - bbox[0]
            text_h = bbox[3] - bbox[1]

             box_top = max(0, y_min - text_h - 4)
            draw.rectangle(
                ((x_min, box_top), (x_min + text_w + 4, box_top + text_h + 4)),
                fill="red",
            )
            draw.text((x_min + 2, box_top + 2), text, fill="white", font=font)

    # Scene label (top-left corner)
    if scene_resp is not None and scene_resp.get("success"):
        scene_label = scene_resp.get("label", "unknown")
        scene_conf = scene_resp.get("confidence", 0)
        text = f"Scene: {scene_label} ({scene_conf:.2f})"
        bbox = draw.textbbox((0, 0), text, font=font)
        text_w = bbox[2] - bbox[0]
        text_h = bbox[3] - bbox[1]
        draw.rectangle(
            ((0, 0), (text_w + 8, text_h + 8)),
            fill="blue",
        )
        draw.text((4, 4), text, fill="white", font=font)

    img.save(out_path)
    print(f"[+] Saved annotated image to {out_path}")


def main():
    parser = argparse.ArgumentParser(
        description="DeepStack face + object + scene helper."
    )
    parser.add_argument(
        "--register",
        nargs=2,
        metavar=("IMAGE", "USERID"),
        action="append",
        help="Register a known face: IMAGE USERID (can be used multiple times).",
    )
    parser.add_argument(
        "--recognize",
        metavar="IMAGE",
        help="Run recognition/detection/scene on this image.",
    )
    parser.add_argument(
        "--output",
        default="recognized.jpg",
        help="Output image with boxes/labels drawn.",
    )
    args = parser.parse_args()

    # Register faces
    if args.register:
        for img_path_str, userid in args.register:
            img_path = Path(img_path_str)
            print(f"[+] Registering {img_path} as userid='{userid}'")
            resp = register_face(img_path, userid)
            print("    Response:", resp)

    # Recognize + detect + scene
    if args.recognize:
        test_path = Path(args.recognize)

        print(f"[+] Recognizing faces in {test_path}")
        face_resp = recognize_faces(test_path)
        print("[faces]", face_resp)

        print(f"[+] Detecting objects in {test_path}")
        obj_resp = detect_objects(test_path)
        print("[objects]", obj_resp)

        print(f"[+] Recognizing scene in {test_path}")
        scene_resp = recognize_scene(test_path)
        print("[scene]", scene_resp)
        all_keywords = []
        if scene_resp and scene_resp.get("success"):
            lbl = scene_resp.get("label", "").strip()
            if lbl:
                all_keywords.append(f"deepstack/scene/{lbl}")
        if obj_resp:
            for obj in obj_resp.get("predictions", []):
                lbl = obj.get("label", "").strip()
                if lbl:
                    all_keywords.append(f"deepstack/object/{lbl}")
        if face_resp:
            for face in face_resp.get("predictions", []):
                uid = face.get("userid", "").strip()
                if uid:
                    all_keywords.append(f"deepstack/person/{uid}")
    
        all_keywords = sorted(set(all_keywords))
        add_keywords_with_exiftool(test_path, all_keywords)

        out_path = Path(args.output)
        draw_results(test_path, face_resp, obj_resp, scene_resp, out_path)


if __name__ == "__main__":
    main()

Conclusion

Deepstack is decent, and can probably do its job for general labeling in an image library – and a video library with some scripting to aggregate results over frames in a video. It’s still built around a set of pre-defined tags, except for the faces that you’ll have to train yourself, obviously.

It does a pretty good job at this, and wasn’t hard to set up at all. But if you want to do more custom classification, i.e. detecting activities in a video and such, you’ll need something more advanced.

Stay tuned for my next blog post, about image and video classification with Ollama!

2 kommentarer til “AI at home: Image and video classification with AI – deepstack”

Huldr sier:

2. februar 2026, kl. 07:07

Hvor mange bilder har du? Prøvde digiKam, men den bare hang hos meg.

Svar
- vegard sier:
  
  2. februar 2026, kl. 21:08
  
  Jeg har ca 50.000 bilder.
  
  Svar

Legg igjen en kommentar Avbryt svar

Dette nettstedet bruker Akismet for å redusere spam. Finn ut mer om hvordan kommentardataene dine behandles.