After having played with coding assistants for a while, I decided to let that theme rest for a while. But that doesn’t mean I need to stop with exploring usage for my self-hosted AI.
One thing all of us have these days are an ever-growing library of pictures and videos. There exists quite a few software suites where you can label and tag them. Some of them do a good job of auto-tagging pictures and face recognizion, but can I do that myself? I wanted to test that idea.
One ready-made solution is deepstack. It focuses around on its own pre-trained model and can recognize 80 different objects and 365 different scenes according to the documentation. It can also do face detection, and that’s the only part of it you can train easily. There exists also some support for training for objects, and I haven’t tried that yet.
Deepstack actually pretty decent and quick, does a generally good job, and would be a pretty good match for a general photo/image library. It only analyzes images though. I decided to set it up in my Kubernetes cluster. It has its own file-based database for training data, and was pretty easy to set up through my standard pattern:
Storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: deepstack-datastore-pvc
namespace: deepstack
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn-rwo-local-ssd
resources:
requests:
storage: 10Gi
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepstack
namespace: deepstack
spec:
replicas: 1
selector:
matchLabels:
app: deepstack
template:
metadata:
labels:
app: deepstack
spec:
containers:
- name: deepstack
image: deepquestai/deepstack:cpu
imagePullPolicy: Always
env:
- name: MODE
value: "High" # or "High" if you want max accuracy[web:114][web:117]
- name: VISION-FACE
value: "True"
- name: VISION-SCENE
value: "True"
- name: VISION-DETECTION
value: "True"
ports:
- containerPort: 5000
volumeMounts:
- name: datastore
mountPath: /datastore
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "8"
memory: "8Gi"
volumes:
- name: datastore
persistentVolumeClaim:
claimName: deepstack-datastore-pvc
Service
apiVersion: v1
kind: Service
metadata:
name: deepstack
namespace: deepstack
spec:
selector:
app: deepstack
ports:
- name: http
port: 5000
targetPort: 5000
Ingressroute
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: deepstack
namespace: traefik-external
annotations:
kubernetes.io/ingress.class: "traefik-external"
spec:
entryPoints:
- websecure
routes:
- match: Host(`deepstack.engen.priv.no`)
kind: Rule
services:
- name: deepstack
namespace: deepstack
port: 5000
tls:
certResolver: letsencrypt
Load balancer
apiVersion: v1
kind: Service
metadata:
name: traefik-deepstack
namespace: traefik-external
annotations:
projectcalico.org/ipv6pools: '["loadbalancer-ipv6-pool"]'
external-dns.alpha.kubernetes.io/hostname: deepstack.engen.priv.no
external-dns/external: "true"
external-dns.alpha.kubernetes.io/ttl: "300"
ipchanger.alpha.kubernetes.io/patch: "true"
spec:
externalTrafficPolicy: Local
type: LoadBalancer
ipFamilyPolicy: SingleStack
ipFamilies:
- IPv6
ports:
- name: web
port: 80
- name: websecure
port: 443
selector:
app: traefik-external
DNS for IPv4
apiVersion: v1
metadata:
name: deepstack-name
namespace: deepstack
annotations:
external-dns.alpha.kubernetes.io/hostname: deepstack.engen.priv.no
external-dns/external: "true"
external-dns.alpha.kubernetes.io/ttl: "300"
ipchanger.alpha.kubernetes.io/patch: "true"
spec:
type: ExternalName
externalName: <my external ip>
With this set up, I can call the deepstack API at https://deepstack.engen.priv.no/
Analyzing a video
To effectively use this, I can for example use this script. It will fetch one frame from a video and classify it with deepstack.
import os
import subprocess
import tempfile
from pathlib import Path
import requests
DEEPSTACK_URL = os.environ.get(
"DEEPSTACK_URL",
"https://deepstack.engen.priv.no",
)
# change this to a real path in your PVC
TEST_VIDEO = os.environ.get("TEST_VIDEO", "/media/test-workshop.mp4")
def extract_one_frame(video_path: str, seek_seconds: int = 30) -> str:
"""
Extract a single JPEG frame from the video at seek_seconds.
Returns the path to the JPEG file.
"""
tmpdir = tempfile.mkdtemp()
frame_path = str(Path(tmpdir) / "frame.jpg")
cmd = [
"ffmpeg",
"-loglevel",
"error",
"-ss",
str(seek_seconds),
"-i",
video_path,
"-frames:v",
"1",
"-q:v",
"2",
frame_path,
]
subprocess.run(cmd, check=True)
return frame_path
def call_scene_api(image_path: str):
with open(image_path, "rb") as f:
r = requests.post(
f"{DEEPSTACK_URL}/v1/vision/scene",
files={"image": ("frame.jpg", f, "image/jpeg")},
timeout=30,
)
r.raise_for_status()
return r.json()
def main():
print(f"Using video: {TEST_VIDEO}", flush=True)
if not os.path.exists(TEST_VIDEO):
raise FileNotFoundError(TEST_VIDEO)
frame_path = extract_one_frame(TEST_VIDEO, seek_seconds=30)
print(f"Extracted frame: {frame_path}", flush=True)
res = call_scene_api(frame_path)
print("Scene result:", res, flush=True)
if __name__ == "__main__":
main()
To classify a whole video, you’d have to analyze a set of frames and do some sort of aggregation of the results. I have so far not created any scripts to do this. You could likely use another model, perhaps hosted on ollama, to analyze the results – something I explore in my next blog post where I do it all in ollama. I did decide to split this journey into two (so far) though, one for deepstack and one for my next, where I experiment how I can create it myself with ollama.
People
People recognizing is essentially a different API endpoint than scene recognizion.
You train it by feeding it images labeled with persons. Training works best when there’s only one face in the image and the face is a pretty prominent part of the image. I’ve had some success with cropping to the actual face from an image where it didn’t detect any faces.
A complete script
A script (mostly AI-generated too, I must admit) to utilize the deepstack tools to do object, scene and face recognizion on an image can be for example the following script. It will also write the tags into the images so that they can be used in a decent image management program.
#!/usr/bin/env python3
import argparse
from pathlib import Path
from typing import Dict, Any
import requests
from PIL import Image, ImageDraw, ImageFont
DEEPSTACK_URL = "https://deepstack.engen.priv.no" # or http://host:port
def add_keywords_with_exiftool(image_path: Path, keywords):
"""
Append keywords to XMP-dc:Subject so digiKam sees them as tags.
keywords: list of strings
"""
if not keywords:
return
args = ["exiftool", "-overwrite_original"]
for kw in keywords:
args.append(f"-XMP-dc:Subject+={kw}")
args.append(str(image_path))
subprocess.run(args, check=True)
def register_face(image_path: Path, userid: str) -> Dict[str, Any]:
url = f"{DEEPSTACK_URL}/v1/vision/face/register"
with image_path.open("rb") as f:
files = {"image": f}
data = {"userid": userid}
resp = requests.post(url, files=files, data=data, verify=False)
print("[face-register]", resp.status_code, resp.text)
resp.raise_for_status()
return resp.json()
def recognize_faces(image_path: Path) -> Dict[str, Any]:
url = f"{DEEPSTACK_URL}/v1/vision/face/recognize"
with image_path.open("rb") as f:
files = {"image": f}
resp = requests.post(url, files=files, verify=False)
resp.raise_for_status()
return resp.json()
def detect_objects(image_path: Path) -> Dict[str, Any]:
"""DeepStack object detection (80 COCO classes)."""
url = f"{DEEPSTACK_URL}/v1/vision/detection"
with image_path.open("rb") as f:
files = {"image": f}
resp = requests.post(url, files=files, verify=False)
resp.raise_for_status()
return resp.json() # {"success":..., "predictions":[{"label":..., "confidence":..., "x_min":...}]}
def recognize_scene(image_path: Path) -> Dict[str, Any]:
"""DeepStack scene recognition (365 scenes)."""
url = f"{DEEPSTACK_URL}/v1/vision/scene"
with image_path.open("rb") as f:
files = {"image": f}
resp = requests.post(url, files=files, verify=False)
resp.raise_for_status()
return resp.json() # {"success":..., "label": "...", "confidence": ...}
def draw_results(
image_path: Path,
face_resp: Dict[str, Any] | None,
obj_resp: Dict[str, Any] | None,
scene_resp: Dict[str, Any] | None,
out_path: Path,
) -> None:
img = Image.open(image_path).convert("RGB")
draw = ImageDraw.Draw(img)
try:
font = ImageFont.load_default()
except Exception:
font = None
# Faces (green boxes)
if face_resp is not None:
for face in face_resp.get("predictions", []):
userid = face.get("userid", "unknown")
x_min = int(face["x_min"])
y_min = int(face["y_min"])
x_max = int(face["x_max"])
y_max = int(face["y_max"])
draw.rectangle(((x_min, y_min), (x_max, y_max)), outline="green", width=2)
label = f"{userid}"
bbox = draw.textbbox((0, 0), label, font=font)
text_w = bbox[2] - bbox[0]
text_h = bbox[3] - bbox[1]
box_top = max(0, y_min - text_h - 4)
draw.rectangle(
((x_min, box_top), (x_min + text_w + 4, box_top + text_h + 4)),
fill="green",
)
draw.text((x_min + 2, box_top + 2), label, fill="white", font=font)
# Objects (red boxes)
if obj_resp is not None:
for obj in obj_resp.get("predictions", []):
label = obj.get("label", "obj")
conf = obj.get("confidence", 0)
x_min = int(obj["x_min"])
y_min = int(obj["y_min"])
x_max = int(obj["x_max"])
y_max = int(obj["y_max"])
draw.rectangle(((x_min, y_min), (x_max, y_max)), outline="red", width=2)
text = f"{label} {conf:.2f}"
bbox = draw.textbbox((0, 0), text, font=font)
text_w = bbox[2] - bbox[0]
text_h = bbox[3] - bbox[1]
box_top = max(0, y_min - text_h - 4)
draw.rectangle(
((x_min, box_top), (x_min + text_w + 4, box_top + text_h + 4)),
fill="red",
)
draw.text((x_min + 2, box_top + 2), text, fill="white", font=font)
# Scene label (top-left corner)
if scene_resp is not None and scene_resp.get("success"):
scene_label = scene_resp.get("label", "unknown")
scene_conf = scene_resp.get("confidence", 0)
text = f"Scene: {scene_label} ({scene_conf:.2f})"
bbox = draw.textbbox((0, 0), text, font=font)
text_w = bbox[2] - bbox[0]
text_h = bbox[3] - bbox[1]
draw.rectangle(
((0, 0), (text_w + 8, text_h + 8)),
fill="blue",
)
draw.text((4, 4), text, fill="white", font=font)
img.save(out_path)
print(f"[+] Saved annotated image to {out_path}")
def main():
parser = argparse.ArgumentParser(
description="DeepStack face + object + scene helper."
)
parser.add_argument(
"--register",
nargs=2,
metavar=("IMAGE", "USERID"),
action="append",
help="Register a known face: IMAGE USERID (can be used multiple times).",
)
parser.add_argument(
"--recognize",
metavar="IMAGE",
help="Run recognition/detection/scene on this image.",
)
parser.add_argument(
"--output",
default="recognized.jpg",
help="Output image with boxes/labels drawn.",
)
args = parser.parse_args()
# Register faces
if args.register:
for img_path_str, userid in args.register:
img_path = Path(img_path_str)
print(f"[+] Registering {img_path} as userid='{userid}'")
resp = register_face(img_path, userid)
print(" Response:", resp)
# Recognize + detect + scene
if args.recognize:
test_path = Path(args.recognize)
print(f"[+] Recognizing faces in {test_path}")
face_resp = recognize_faces(test_path)
print("[faces]", face_resp)
print(f"[+] Detecting objects in {test_path}")
obj_resp = detect_objects(test_path)
print("[objects]", obj_resp)
print(f"[+] Recognizing scene in {test_path}")
scene_resp = recognize_scene(test_path)
print("[scene]", scene_resp)
all_keywords = []
if scene_resp and scene_resp.get("success"):
lbl = scene_resp.get("label", "").strip()
if lbl:
all_keywords.append(f"deepstack/scene/{lbl}")
if obj_resp:
for obj in obj_resp.get("predictions", []):
lbl = obj.get("label", "").strip()
if lbl:
all_keywords.append(f"deepstack/object/{lbl}")
if face_resp:
for face in face_resp.get("predictions", []):
uid = face.get("userid", "").strip()
if uid:
all_keywords.append(f"deepstack/person/{uid}")
all_keywords = sorted(set(all_keywords))
add_keywords_with_exiftool(test_path, all_keywords)
out_path = Path(args.output)
draw_results(test_path, face_resp, obj_resp, scene_resp, out_path)
if __name__ == "__main__":
main()
Conclusion
Deepstack is decent, and can probably do its job for general labeling in an image library – and a video library with some scripting to aggregate results over frames in a video. It’s still built around a set of pre-defined tags, except for the faces that you’ll have to train yourself, obviously.
It does a pretty good job at this, and wasn’t hard to set up at all. But if you want to do more custom classification, i.e. detecting activities in a video and such, you’ll need something more advanced.
Stay tuned for my next blog post, about image and video classification with Ollama!
2 kommentarer til “AI at home: Image and video classification with AI – deepstack”
Hvor mange bilder har du? Prøvde digiKam, men den bare hang hos meg.
Jeg har ca 50.000 bilder.