AI@home: Classifying images with Ollama – part one.

After having tried deepseek a bit, I quickly decided it was more fun to dig a bit deeper. Deepseek looks like a solid tool for good results with not too much hardware resources, as opposed to what I am exploring in this article, classifying an image with a general purpose LLM. Nevertheless, it’s a fun experiment to see if I can get better and more customizable results with giving my own instructions and customizing it a bit more.

As I’m writing this, I’m finished with the first version of a script to call the LLM with an image to classify it, and have packaged it in a simple API. As with all my articles, I am of course using AI for research this, but I must admit I’m using hosted AIs at least as frequent as I used my own for this. It gives nice and quick results, compared to the sometimes painfully slow experience I have if I want to use larger models on my home node, and it probably also gives more complete results. My long term goal is, however, to be less and less reliant on things hosted on the net and more and more running my own stuff.

But, back to my image (and video) classifier. Let’s first dive a bit into the theory.

Picking the right (vision) model.

There are LLMs that are trained to understand images. They are called vision-capable models. Some examples are llava (Llamas (made by Meta) vision model) and qwen3-vl. As always, in general, a bigger model are better but slower. There are exceptions and models continuously evolve, so sometimes you can see a newer model performing just as well as the bigger older ones.

So far I have mostly been sticking to the llava models, because they have given consistent results and followed my prompts for how to structure the responses.

Architecture and tooling

it pays off to do some research and also build some flexibility into your tooling. In my openweb-ui and ollama setup, I have a nice and easy way to manage ollama and which models it hosts. For my coding assistant, I also set up an API endpoint for ollama, that I can reuse for this purpose. With openweb-ui I can send in an image and tell it to describe it, and it will send it on to ollama with the instructions. We’re going to have to do the same, just without openweb-ui in front, calling ollama directly.

I also knew I wanted to wrap this in an API framework and host it in my kubernetes setup, so I did my AI research around that. I landed on FastAPI for no good reasons except that was the first working example that my AI came up with, and for my simple needs it will do. FastAPI is an ASGI (Asynchronous Server Gateway Interface) implementation, and to run it you need to run it with a web server that can run such applications. The most common one is probably Unicorn. An explanation of how all of this ties together is at https://www.geeksforgeeks.org/python/fastapi-uvicorn/, but I’m not actually going to dive deeper into it in this blog post.

For my service, I need to package unicorn and my script into a container, and run it as a service in Kubernetes, but that isn’t the topic of this blog post. To explain my script, it’s only needed to know that the notion @/some/path means this will be run whenever the web server hits that part.

The program

Most of this program is written by an AI, though it’s built to my specifications, Some people feel this is a bit cheating, but I can tell you: I’d not had as much content on my blog without having an AI or three as a sparring partner, and sometimes, even writing all the code. It’s still a learning experience for me, and hopefully for the readers too. And this is my goal.

The program as of now is around 450 lines long. I’ll post bits and pieces of it, and include a link to the complete one later. It includes stubs for later extensions where I can give feedback on the answers and improve the model, but as of now that’s only dummy stuff that does not work.

The classify image endpoint

@app.post("/classify/image")
async def classify_image(
        file: UploadFile = File(...),
        vision_model: Optional[str] = Query(
            default=None,
          description="Override vision model name (e.g. 'llava:13b-q4')",
          ),
        ):
    logger.info(f"/classify/image filename={file.filename} ct={file.content_type}")

    if not file.content_type.startswith("image/"):
        raise HTTPException(status_code=400, detail="Uploaded file is not an image")

    media_id = str(uuid.uuid4())
    tmpdir = Path(tempfile.mkdtemp(prefix="img_"))
    img_path = tmpdir / file.filename

    with img_path.open("wb") as f:
        shutil.copyfileobj(file.file, f)

    result = classify_frame_with_llava(img_path, model=vision_model)
    logger.info(f"Image {img_path} classified result={result}")

    record = {
        "ts": datetime.utcnow().isoformat(),
        "media_id": media_id,
        "type": "image",
        "path": str(img_path),
        "model": vision_model,
        "frame_result": result,
    }
    _append_feedback_record(record)

    return JSONResponse({"media_id": media_id, "result": result})

This endpoint takes the uploaded file and a model as input and runs the subroutine classify_frame_with_llava, which handles the actual submission of the file and a prompt to ollama. This is the below function

def classify_frame_with_llava(frame_path: Path, model: Optional[str] = None) -> Dict[str, Any]:
    logger.info(f"Classifying frame with LLaVA model={model} path={frame_path}")
    with frame_path.open("rb") as f:
        image_bytes = f.read()

    prompt = """
You are a general image/video frame classification assistant.

Return your answer as VALID JSON only, no markdown, no code fences, no extra text.

Given a single frame, respond ONLY with JSON of the form:

{
  "primary_label": "one short category label for the main content",
  "secondary_labels": ["optional", "extra", "labels"],
  "confidence": 0.0_to_1.0,
  "short_description": "one short sentence describing the frame"
}

Choose primary_label to best summarize the main content (scene, activity, or key objects).
If unsure, use "unknown".
"""

    use_model = model or VISION_MODEL
    try:
        res = ollama.generate(
            model=use_model,
            prompt=prompt,
            images=[image_bytes],
            keep_alive=0,
        )
    except Exception as e:
        logger.exception(f"Ollama generate failed for frame {frame_path}: {e}")
        return {
            "primary_label": "error",
            "secondary_labels": [],
            "confidence": 0.0,
            "short_description": f"Ollama error: {e}",
        }

    raw = res["response"]
    logger.debug(f"Ollama raw response for {frame_path}: {raw!r}")

    try:
        cleaned = _extract_json_block(raw)
        data = json.loads(cleaned)
    except json.JSONDecodeError:
        logger.warning(f"JSON parse failed for {frame_path}, falling back to raw text")
        data = {
            "primary_label": "unknown",
            "secondary_labels": [],
            "confidence": 0.0,
            "short_description": raw.strip(),
        }
    return data

If everything goes well with the Ollama call and it returns the structured JSON I’ve asked it to return, res[«response»] will contain the answer from ollama. By the way, the ollama package is an import to my script, a ready-made package for interacting with Ollama. I have defined the environment variable OLLAMA_HOST to point to my Ollama instance, there’s no more magic to it than that.

Sometimes, it’s easier with a cleanup step after the AI call than actually doing it all in AI, so I have cleaned = _extract_json_block(raw) that will extract the actual answer from the raw answer:

def _extract_json_block(text: str) -> str:
    """
    Remove code fences etc. and return best-effort JSON substring.
    """
    t = text.strip()

    # Strip ```xxx ... ``` blocks if present
    if t.startswith("```"):
        t = re.sub(r"^```[a-zA-Z0-9]*\s*", "", t)
        t = re.sub(r"\s*```$", "", t)
        t = t.strip()

    return t

It just strips away some common AI artifacts and whitespaces etc so we hopefully get a clean JSON output that can be parsed by whatever program needed the image classified.

As the prompt shows, it will return some labels and the confidence by which it returns the label, plus an actual short description of the image. The quality of this depends on two things:

The quality of the AI model
The quality of the prompt

I’m by no means finished with that experimentation, but I’m already at something as usable that it’s natural with an initial blog post, and then we’ll dig further into the theme in followup-posts.

The result of the analysis is packaged and sent as a response:

    return JSONResponse({"media_id": media_id, "result": result})

The classify video endpoint

Classifying videos is done by extracting a sample of the frames. analyze the frames as images, and then do an aggregated analysis over the results.

@app.post("/classify/video")
async def classify_video(
        file: UploadFile = File(...),
        vision_model: Optional[str] = Query(
          default=None,
          description="Override vision model name",
        ),
        desired_fps: float = Query(
          default=1.0,
          ge=0.01,
          le=30.0,
          description="Target FPS for frame extraction",
        ),
        agg_model: Optional[str] = Query(
            default=AGG_MODEL,
            description="Override aggregation model name",
        ),
    ):
    logger.info(f"/classify/video filename={file.filename} ct={file.content_type}")

    # Allow video/* or generic octet-stream from curl
    ct = file.content_type or ""
    if not (ct.startswith("video/") or ct == "application/octet-stream"):
        raise HTTPException(status_code=400, detail=f"Not a video (content_type={ct})")

    media_id = str(uuid.uuid4())
    tmpdir = Path(tempfile.mkdtemp(prefix="vid_"))
    video_path = tmpdir / file.filename

    with video_path.open("wb") as f:
        shutil.copyfileobj(file.file, f)
    logger.info(f"Saved upload to {video_path}")

    effective_fps = choose_fps_with_min_frames(video_path, desired_fps=desired_fps, min_frames=15)
    logger.info(f"Using fps={effective_fps:.3f} for video {video_path}")

    per_frame_results: List[Dict[str, Any]] = []
    with tempfile.TemporaryDirectory(prefix="frames_") as frames_tmp:
        frames_dir = Path(frames_tmp)
        extract_frames(video_path, frames_dir, fps=effective_fps)
        frame_files = sorted(frames_dir.glob("frame_*.jpg"))
        logger.info(f"Extracted {len(frame_files)} frames from {video_path}")

        for idx, frame_path in enumerate(frame_files):
            logger.info(f"Analyzing frame {idx+1}/{len(frame_files)}: {frame_path.name}")
            r = classify_frame_with_llava(frame_path, model=vision_model)
            per_frame_results.append(r)

    label_counts = aggregate_labels(per_frame_results)
    logger.info(f"Label counts for media_id={media_id}: {label_counts}")

    agg_obj = aggregate_with_llm(per_frame_results, model=agg_model)
    logger.info(f"Aggregation done for media_id={media_id}")

    record = {
        "ts": datetime.utcnow().isoformat(),
        "media_id": media_id,
        "type": "video",
        "path": str(video_path),
        "frames": len(per_frame_results),
        "fps": effective_fps,
        "label_counts": label_counts,
        "per_frame_results": per_frame_results,
        "video_summary": agg_obj,
    }
    _append_feedback_record(record)

    return JSONResponse({
        "media_id": media_id,
        "frames": len(per_frame_results),
        "fps": effective_fps,
        "label_counts": label_counts,
        "video_summary": agg_obj,
    })

As you can see, there’s support for selecting both the model for image classification and how many frames per second should be selected. And this is where it starts to become interesting. I have no good answer, but I have some choices to make: Is quality or quantity better? Is it better with running a lighter model on more frames or a heavier model on fewer frames. As you can see, I’ve also done a safeguard, I won’t ever do a video analysis with less than 15 frames to choose from.

With the very lightweight llava-phy3:3.8b model, I can get each image classification done in 9 seconds on my hardware, like a clockwork.

Running llava:13b on the other hand gives you much better answers individially on the frames, at the cost of it taking roughly 1 minute per frame.

The most intesting part in this endpoint is the aggregation, It takes an array with the JSON answers from image classification and asks the aggregation model to summarize it.

def aggregate_with_llm(frame_results: List[Dict[str, Any]], model: Optional[str] = None) -> Dict[str, Any]:
    logger.info(f"Aggregating {len(frame_results)} frames with model={model}")

    lines = []
    for i, r in enumerate(frame_results):
        labels = []
        if r.get("primary_label"):
            labels.append(r["primary_label"])
        labels.extend(r.get("secondary_labels") or [])
        labels_str = ", ".join(labels)
        raw_conf = r.get("confidence", 0)
        conf = _normalize_confidence(raw_conf)
        desc = r.get("short_description", "")
        lines.append(
            f"frame={i} labels=[{labels_str}] confidence={conf:.2f} desc={desc}"
        )
    joined = "\n".join(lines)

    prompt = f"""
You are a general video classification system.

You are given frame-level analyses from a video, one per line.
Each line contains a frame index, labels, confidence, and a short description.

Frame analyses:
{joined}

Infer overall labels that best describe the video (scene types, activities, key objects).

Return ONLY JSON with this structure:

{{
  "video_labels": ["label1", "label2", ...],
  "primary_label": "one best overall label",
  "rationale": "short explanation based on the frames",
  "notable_segments": [
    {{
      "frame_start": 0,
      "frame_end": 0,
      "description": "optional segment description"
    }}
  ]
}}

If you don't have notable segments, return an empty list for "notable_segments".
No extra text outside the JSON.
"""

    use_model = model or AGG_MODEL
    try:
        res = ollama.generate(
            model=model,
            prompt=prompt,
            keep_alive=0,
            # If your Ollama supports it, you can try:
            # options={"thinking": False},
        )
    except Exception as e:
        logger.exception(f"Ollama aggregate failed: {e}")
        raise

    raw = res["response"]
    logger.debug(f"Aggregation raw response: {raw!r}")

    return _extract_think_and_json(raw)

The interesting thing here is really the prompt, which takes the content on the jsons, on JSON per line and does an analysis on it to get an overall result on the video.

As I have been using deepseek-r1:34b as the aggregation model, it also returns the thinking, I have a model that structures the thinking and returns it in its own JSON field along all the other results:


def _extract_think_and_json(text: str) -> Dict[str, Any]:
    """
    For DeepSeek-R1-style output:
      <think>...</think>
      {json}

    Returns { "thinking": "...", "result": { ... } }
    """
    # 1. Extract <think>...</think>
    think_match = re.search(r"<think>(.*?)</think>", text, flags=re.DOTALL | re.IGNORECASE)
    thinking = think_match.group(1).strip() if think_match else ""

    # 2. Remove the think block
    without_think = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE).strip()

    # 3. Extract JSON
    cleaned = _extract_json_block(without_think)
    try:
        result_obj = json.loads(cleaned)
    except json.JSONDecodeError:
        logger.warning("Failed to parse aggregation JSON, storing raw text")
        result_obj = {"raw": cleaned}

    return {"thinking": thinking, "result": result_obj}

It basically separates the thinking from the description before it’s returned to the caller.

Deployment to kubernetes

To deploy this to kubernetes, I basically need to create the Deployment, a volume for temporary data, a service, an ingressroute, and domain names and certificates etc for mc.engen.priv.no. All of this utilizes my already-made kubernetes infrastructure.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-classifier-data
  namespace: media-ml
spec:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: longhorn-rwo-local-ssd
  resources:
    requests:
      storage: 20Gi


apiVersion: apps/v1
kind: Deployment
metadata:
  name: media-classifier
  namespace: media-ml
spec:
  replicas: 1
  selector:
    matchLabels:
      app: media-classifier
  template:
    metadata:
      labels:
        app: media-classifier
    spec:
      containers:
        - name: media-classifier
          image: registry.engen.priv.no/ollama-classifier:3.11-slim
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
          env:
            - name: OLLAMA_HOST
              value: "http://ollama.engen.priv.no:11434"
          volumeMounts:
            - name: data
              mountPath: /data/ollama_classifier
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: media-classifier-data

apiVersion: v1
kind: Service
metadata:
  name: media-classifier
  namespace: media-ml
spec:
  type: ClusterIP
  selector:
    app: media-classifier
  ports:
    - name: http
      port: 80
      targetPort: 8000

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: media-classifier
  namespace: traefik-external
  annotations:
    kubernetes.io/ingress.class: "traefik-external"
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`mc.engen.priv.no`)
      kind: Rule
      middlewares:
        - name: internal-ipallowlist
      services:
        - name: media-classifier
          namespace: media-ml
          port: 80
  tls:
    certResolver: letsencrypt

apiVersion: v1
kind: Service
metadata:
    name: traefik-media-classifier
    namespace: traefik-external
    annotations:
      projectcalico.org/ipv6pools: '["loadbalancer-ipv6-pool-internal"]'
      projectcalico.org/ipv4pools: '["loadbalancer-ipv4-pool-internal"]'
      external-dns.alpha.kubernetes.io/hostname: mc.engen.priv.no
      external-dns/internal: "true"
      external-dns.alpha.kubernetes.io/ttl: "300"
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  ipFamilyPolicy: PreferDualStack
  ipFamilies:
    - IPv6
    - IPv4
  ports:
    - name: web
      port: 80
    - name: websecure
      port: 443
  selector:
    app: traefik-external

Complete script

!/usr/bin/env python3
import json
import logging
import re
import shutil
import subprocess
import tempfile
import uuid
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any,Optional
from pydantic import BaseModel

import ollama
from fastapi import FastAPI, UploadFile, File, HTTPException, Query
from fastapi.responses import JSONResponse

# -------------------------------------------------------------------
# Logging
# -------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
logger = logging.getLogger("media-classifier")

# -------------------------------------------------------------------
# Config
# -------------------------------------------------------------------

# Point to your Ollama instance (override via env if needed)
OLLAMA_HOST = "http://<redacted>:11434"
ollama.host = OLLAMA_HOST

DEFAULT_VISION_MODEL = "llava:13b"
DEFAULT_VIDEO_VISION_MODEL="llava-phi3:3.8b"
DEFAULT_AGG_MODEL = "deepseek-r1:32b"

VISION_MODEL = DEFAULT_VISION_MODEL
VIDEO_VISION_MODEL = DEFAULT_VIDEO_VISION_MODEL
AGG_MODEL = DEFAULT_AGG_MODEL

# Persistent data root (mount a PVC here in Kubernetes)
DATA_ROOT = Path("/data/media_classifier")
FEEDBACK_FILE = DATA_ROOT / "feedback.jsonl"
TRAIN_FLAG = DATA_ROOT / "train_requested"

DATA_ROOT.mkdir(parents=True, exist_ok=True)

app = FastAPI(
    title="General Media Classifier",
    description=(
        "Classify images and videos using Ollama + LLaVA, "
        "with feedback collection for future training."
    ),
)

# -------------------------------------------------------------------
# Helpers: ffmpeg / fps
# -------------------------------------------------------------------

def get_video_duration_seconds(video_path: Path) -> float:
    out = subprocess.check_output([
        "ffprobe", "-v", "quiet",
        "-show_format", "-print_format", "json",
        str(video_path),
    ])
    data = json.loads(out)
    return float(data["format"]["duration"])


def choose_fps_with_min_frames(video_path: Path, desired_fps: float, min_frames: int = 15) -> float:
    duration = get_video_duration_seconds(video_path)
    if duration <= 0:
        return desired_fps
    est_frames = duration * desired_fps
    if est_frames >= min_frames:
        return desired_fps
    return min_frames / duration


def extract_frames(video_path: Path, out_dir: Path, fps: float) -> None:
    out_dir.mkdir(parents=True, exist_ok=True)
    cmd = [
        "ffmpeg", "-i", str(video_path),
        "-vf", f"fps={fps}",
        str(out_dir / "frame_%06d.jpg"),
        "-hide_banner", "-loglevel", "error",
    ]
    subprocess.run(cmd, check=True)

# -------------------------------------------------------------------
# Helpers: JSON cleaning / DeepSeek thinking
# -------------------------------------------------------------------

def _extract_json_block(text: str) -> str:
    """
    Remove code fences etc. and return best-effort JSON substring.
    """
    t = text.strip()

    # Strip ```xxx ... ``` blocks if present
    if t.startswith("```"):
        t = re.sub(r"^```[a-zA-Z0-9]*\s*", "", t)
        t = re.sub(r"\s*```$", "", t)
        t = t.strip()

    return t


def _extract_think_and_json(text: str) -> Dict[str, Any]:
    """
    For DeepSeek-R1-style output:
      <think>...</think>
      {json}

    Returns { "thinking": "...", "result": { ... } }
    """
    # 1. Extract <think>...</think>
    think_match = re.search(r"<think>(.*?)</think>", text, flags=re.DOTALL | re.IGNORECASE)
    thinking = think_match.group(1).strip() if think_match else ""

    # 2. Remove the think block
    without_think = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE).strip()

    # 3. Extract JSON
    cleaned = _extract_json_block(without_think)
    try:
        result_obj = json.loads(cleaned)
    except json.JSONDecodeError:
        logger.warning("Failed to parse aggregation JSON, storing raw text")
        result_obj = {"raw": cleaned}

    return {"thinking": thinking, "result": result_obj}

# -------------------------------------------------------------------
# Helpers: LLaVA + aggregation
# -------------------------------------------------------------------

def _normalize_confidence(value) -> float:
    if isinstance(value, (int, float)):
        return float(value)
    if isinstance(value, str):
        try:
            return float(value)
        except ValueError:
            return 1.0
    if isinstance(value, (list, tuple)) and value:
        return _normalize_confidence(value)
    return 1.0


def classify_frame_with_llava(frame_path: Path, model: Optional[str] = None) -> Dict[str, Any]:
    logger.info(f"Classifying frame with LLaVA model={model} path={frame_path}")
    with frame_path.open("rb") as f:
        image_bytes = f.read()

    prompt = """
You are a general image/video frame classification assistant.

Return your answer as VALID JSON only, no markdown, no code fences, no extra text.

Given a single frame, respond ONLY with JSON of the form:

{
  "primary_label": "one short category label for the main content",
  "secondary_labels": ["optional", "extra", "labels"],
  "confidence": 0.0_to_1.0,
  "short_description": "one short sentence describing the frame"
}

Choose primary_label to best summarize the main content (scene, activity, or key objects).
If unsure, use "unknown".
"""

    use_model = model or VISION_MODEL
    try:
        res = ollama.generate(
            model=use_model,
            prompt=prompt,
            images=[image_bytes],
            keep_alive=0,
        )
    except Exception as e:
        logger.exception(f"Ollama generate failed for frame {frame_path}: {e}")
        return {
            "primary_label": "error",
            "secondary_labels": [],
            "confidence": 0.0,
            "short_description": f"Ollama error: {e}",
        }

    raw = res["response"]
    logger.debug(f"Ollama raw response for {frame_path}: {raw!r}")

    try:
        cleaned = _extract_json_block(raw)
        data = json.loads(cleaned)
    except json.JSONDecodeError:
        logger.warning(f"JSON parse failed for {frame_path}, falling back to raw text")
        data = {
            "primary_label": "unknown",
            "secondary_labels": [],
            "confidence": 0.0,
            "short_description": raw.strip(),
        }
    return data


def aggregate_labels(per_frame_results: List[Dict[str, Any]]) -> List[tuple[str, float]]:
    counter = Counter()
    for r in per_frame_results:
        labels: List[str] = []
        if r.get("primary_label"):
            labels.append(r["primary_label"])
        labels.extend(r.get("secondary_labels") or [])
        raw_conf = r.get("confidence", 1.0)
        conf = _normalize_confidence(raw_conf) or 1.0
        for label in labels:
            counter[label] += conf
    return counter.most_common()


def aggregate_with_llm(frame_results: List[Dict[str, Any]], model: Optional[str] = None) -> Dict[str, Any]:
    logger.info(f"Aggregating {len(frame_results)} frames with model={model}")

    lines = []
    for i, r in enumerate(frame_results):
        labels = []
        if r.get("primary_label"):
            labels.append(r["primary_label"])
        labels.extend(r.get("secondary_labels") or [])
        labels_str = ", ".join(labels)
        raw_conf = r.get("confidence", 0)
        conf = _normalize_confidence(raw_conf)
        desc = r.get("short_description", "")
        lines.append(
            f"frame={i} labels=[{labels_str}] confidence={conf:.2f} desc={desc}"
        )
    joined = "\n".join(lines)

    prompt = f"""
You are a general video classification system.

You are given frame-level analyses from a video, one per line.
Each line contains a frame index, labels, confidence, and a short description.

Frame analyses:
{joined}

Infer overall labels that best describe the video (scene types, activities, key objects).

Return ONLY JSON with this structure:

{{
  "video_labels": ["label1", "label2", ...],
  "primary_label": "one best overall label",
  "rationale": "short explanation based on the frames",
  "notable_segments": [
    {{
      "frame_start": 0,
      "frame_end": 0,
      "description": "optional segment description"
    }}
  ]
}}

If you don't have notable segments, return an empty list for "notable_segments".
No extra text outside the JSON.
"""

    use_model = model or AGG_MODEL
    try:
        res = ollama.generate(
            model=model,
            prompt=prompt,
            keep_alive=0,
            # If your Ollama supports it, you can try:
            # options={"thinking": False},
        )
    except Exception as e:
        logger.exception(f"Ollama aggregate failed: {e}")
        raise

    raw = res["response"]
    logger.debug(f"Aggregation raw response: {raw!r}")

    return _extract_think_and_json(raw)

# -------------------------------------------------------------------
# Helpers: feedback storage
# -------------------------------------------------------------------

def _append_feedback_record(obj: dict) -> None:
    DATA_ROOT.mkdir(parents=True, exist_ok=True)
    with FEEDBACK_FILE.open("a", encoding="utf-8") as f:
        f.write(json.dumps(obj) + "\n")

# -------------------------------------------------------------------
# API Endpoints
# -------------------------------------------------------------------

@app.post("/classify/image")
async def classify_image(
        file: UploadFile = File(...),
        vision_model: Optional[str] = Query(
            default=None,
          description="Override vision model name (e.g. 'llava:13b-q4')",
          ),
        ):
    logger.info(f"/classify/image filename={file.filename} ct={file.content_type}")

    if not file.content_type.startswith("image/"):
        raise HTTPException(status_code=400, detail="Uploaded file is not an image")

    media_id = str(uuid.uuid4())
    tmpdir = Path(tempfile.mkdtemp(prefix="img_"))
    img_path = tmpdir / file.filename

    with img_path.open("wb") as f:
        shutil.copyfileobj(file.file, f)

    result = classify_frame_with_llava(img_path, model=vision_model)
    logger.info(f"Image {img_path} classified result={result}")

    record = {
        "ts": datetime.utcnow().isoformat(),
        "media_id": media_id,
        "type": "image",
        "path": str(img_path),
        "model": vision_model,
        "frame_result": result,
    }
    _append_feedback_record(record)

    return JSONResponse({"media_id": media_id, "result": result})


app.post("/classify/video")
async def classify_video(
        file: UploadFile = File(...),
        vision_model: Optional[str] = Query(
          default=None,
          description="Override vision model name",
        ),
        desired_fps: float = Query(
          default=1.0,
          ge=0.01,
          le=30.0,
          description="Target FPS for frame extraction",
        ),
        agg_model: Optional[str] = Query(
            default=AGG_MODEL,
            description="Override aggregation model name",
        ),
    ):
    logger.info(f"/classify/video filename={file.filename} ct={file.content_type}")

    # Allow video/* or generic octet-stream from curl
    ct = file.content_type or ""
    if not (ct.startswith("video/") or ct == "application/octet-stream"):
        raise HTTPException(status_code=400, detail=f"Not a video (content_type={ct})")

    media_id = str(uuid.uuid4())
    tmpdir = Path(tempfile.mkdtemp(prefix="vid_"))
    video_path = tmpdir / file.filename

    with video_path.open("wb") as f:
        shutil.copyfileobj(file.file, f)
    logger.info(f"Saved upload to {video_path}")

    effective_fps = choose_fps_with_min_frames(video_path, desired_fps=desired_fps, min_frames=15)
    logger.info(f"Using fps={effective_fps:.3f} for video {video_path}")

    per_frame_results: List[Dict[str, Any]] = []
    with tempfile.TemporaryDirectory(prefix="frames_") as frames_tmp:
        frames_dir = Path(frames_tmp)
        extract_frames(video_path, frames_dir, fps=effective_fps)
        frame_files = sorted(frames_dir.glob("frame_*.jpg"))
        logger.info(f"Extracted {len(frame_files)} frames from {video_path}")

        for idx, frame_path in enumerate(frame_files):
            logger.info(f"Analyzing frame {idx+1}/{len(frame_files)}: {frame_path.name}")
            r = classify_frame_with_llava(frame_path, model=vision_model)
            per_frame_results.append(r)

    label_counts = aggregate_labels(per_frame_results)
    logger.info(f"Label counts for media_id={media_id}: {label_counts}")

    agg_obj = aggregate_with_llm(per_frame_results, model=agg_model)
    logger.info(f"Aggregation done for media_id={media_id}")

    record = {
        "ts": datetime.utcnow().isoformat(),
        "media_id": media_id,
        "type": "video",
        "path": str(video_path),
        "frames": len(per_frame_results),
        "fps": effective_fps,
        "label_counts": label_counts,
        "per_frame_results": per_frame_results,
        "video_summary": agg_obj,
    }
    _append_feedback_record(record)

    return JSONResponse({
        "media_id": media_id,
        "frames": len(per_frame_results),
        "fps": effective_fps,
        "label_counts": label_counts,
        "video_summary": agg_obj,
    })

Dockerfile can be for example the below one, which I build and deploy to my registry.engen.priv.no docker registry in my build pipeline.

FROM python:3.11-slim

ENV PYTHONUNBUFFERED=1

# Install ffmpeg + ffprobe
RUN apt-get update && \
    apt-get install -y --no-install-recommends ffmpeg && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Requirements (adjust as needed)
RUN pip install --no-cache-dir fastapi uvicorn[standard] ollama python-multipart python-multipart pydantic

# Copy app
COPY app.py /app/app.py

# Data dir (mount PVC here in Kubernetes)
RUN mkdir -p /data/ollama_classifier
VOLUME ["/data/ollama_classifier"]

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Testing

All this leads to an API I can call and test.

Test 1: An image

curl -X POST "https://mc.engen.priv.no/classify/image" -F "file=@/Users/vegardengen/Nextcloud/Direkteopplasting/2024/20240626_131919.jpg"

{"media_id":"a8425c19-883d-461d-aa5a-c817240ff2bf","result":{"primary_label":"garden","secondary_labels":[],"confidence":0.97,"short_description":"Park in daytime with many trees and shrubs."}}

Ok, fair enough.

Test 2: Another image

curl -X POST "https://mc.engen.priv.no/classify/image" -F "file=@/Users/vegardengen/Nextcloud/Direkteopplasting/2024/20240630_153719.jpg"

{"media_id":"b0a7912b-c17f-4d31-9d6c-3dd2b2ba5c6d","result":{"primary_label":"cityscape with mountain in background","secondary_labels":["town","buildings","sky"],"confidence":0.98,"short_description":"View of town and building with large mountain rising in the distance."}}

Test 3: Video, 1 fps and model llava-phi3:3.8n (lightweight)

curl -X POST "https://mc.engen.priv.no/classify/video?vision_model=llama-phi3:3.8b" -F "file=@/Users/vegardengen/Nextcloud/Direkteopplasting/2024/20240626_163718.mp4"

{
   "fps" : 1,
   "frames" : 51,
   "label_counts" : [
      [
         "colorful",
         4.74
      ],
      [
         "stage",
         4.67
      ],
      [
         "performance",
         4.47
      ],
      [
         "lights",
         3.97
      ],
      [
         "carnival",
         3.88
      ],
      [
         "music",
         2.95
      ],
      [
         "dance",
         2.82
      ],
      [
         "concert",
         2.51
      ],
      [
         "entertainment",
         2.35
      ],
      [
         "Dance performance",
         2.33
      ],
      [
         "circus",
         2.32
      ],
      [
         "stage show",
         1.79
      ],
      [
         "colorful lights",
         1.77
      ],
      [
         "animals",
         1.65
      ],
      [
         "carnival ride",
         1.64
      ],
      [
         "costumes",
         1.6
      ],
      [
         "performance art",
         1.46
      ],
      [
         "ride",
         1.45
      ],
      [
         "parade",
         1.45
      ],
      [
         "theatre",
         1.27
      ],
      [
         "puppet show",
         0.92
      ],
      [
         "children's entertainment",
         0.92
      ],
      [
         "Parade float",
         0.92
      ],
      [
         "game",
         0.91
      ],
      [
         "decorated",
         0.91
      ],
      [
         "Circus",
         0.86
      ],
      [
         "Lights",
         0.86
      ],
      [
         "People",
         0.86
      ],
      [
         "colorful lights and props",
         0.85
      ],
      [
         "video",
         0.85
      ],
      [
         "Performance",
         0.85
      ],
      [
         "Clown Show",
         0.85
      ],
      [
         "Rock Band",
         0.85
      ],
      [
         "music performance",
         0.84
      ],
      [
         "artists",
         0.83
      ],
      [
         "Music concert",
         0.83
      ],
      [
         "Artistic expression",
         0.83
      ],
      [
         "dance show",
         0.82
      ],
      [
         "fire",
         0.82
      ],
      [
         "Masquerade party on stage",
         0.82
      ],
      [
         "nightlife",
         0.82
      ],
      [
         "amusement ride",
         0.81
      ],
      [
         "thanksgiving",
         0.78
      ],
      [
         "Dragon",
         0.78
      ],
      [
         "Masquerade Ball",
         0.78
      ],
      [
         "Halloween party",
         0.78
      ],
      [
         "dance performance",
         0.75
      ],
      [
         "art performance",
         0.75
      ],
      [
         "lighting",
         0.75
      ],
      [
         "light show",
         0.74
      ],
      [
         "show",
         0.74
      ],
      [
         "Stage",
         0.73
      ],
      [
         "parade float",
         0.71
      ],
      [
         "lit",
         0.71
      ],
      [
         "Bizarre",
         0.71
      ],
      [
         "Celebration",
         0.71
      ],
      [
         "Illusion",
         0.71
      ],
      [
         "Theatrical props",
         0.67
      ],
      [
         "Stage decorations",
         0.67
      ],
      [
         "colorful parade float",
         0.67
      ],
      [
         "theme park",
         0.67
      ],
      [
         "festival",
         0.67
      ],
      [
         "colorful train with lighting",
         0.65
      ],
      [
         "stage performance",
         0.65
      ],
      [
         "stages",
         0.65
      ],
      [
         "dolls",
         0.65
      ],
      [
         "halloween",
         0.65
      ],
      [
         "A parade float with people riding on it",
         0.64
      ],
      [
         "party decorations",
         0.64
      ],
      [
         "dance party",
         0.62
      ],
      [
         "performers",
         0.56
      ],
      [
         "lights and stage",
         0.53
      ],
      [
         "fire effect",
         0.53
      ],
      [
         "performance stage",
         0.53
      ],
      [
         "artistic expression",
         0.53
      ]
   ],
   "media_id" : "ba230f73-8e9e-403b-a227-bafed0c12255",
   "video_summary" : {
      "result" : {
         "notable_segments" : [],
         "primary_label" : "Performance",
         "rationale" : "The video captures various types of performances, including parades, carnivals, concerts, and dance shows, with a focus on entertainment and colorful settings.",
         "video_labels" : [
            "Parade",
            "Carnival",
            "Concert",
            "Dance Performance",
            "Stage Show",
            "Art Performance"
         ]
      },
      "thinking" : "Alright, I'm looking at this problem where I need to help infer overall labels for a video based on frame-level analyses. The user provided a lot of data, each frame with its own set of labels and descriptions. My task is to analyze all these frames and come up with a concise JSON output that includes the main labels, the primary label, an explanation, and any notable segments.\n\nFirst, I'll start by reading through each frame's information to get a sense of what the video is about. Let me go through them one by one:\n\n- Frame 0: Talks about a stage with colorful lights and an orchestra.\n- Frames 1-3: Mention dance performances, circuses, parades, and party decorations.\n- Frames 4-7: Focus on carnivals, rides, parades, and people in costumes.\n- Frames 8-9: Back to dance performances and puppet shows for children.\n- Frames 10-21: Include various carnival elements like carousels, lights, decorations, animals, and different types of rides.\n- Frames 22-34: Continue with colorful parade floats, theme parks, festivals, and stages with performances.\n- Frames 35-49: Cover concerts, dance parties, art performances, circuses, clowns, and rock bands.\n\nFrom this, I can see that the video is a mix of different performance types. It includes parades, carnivals, stage shows, dances, music concerts, and even some circus elements. The common threads are colorful settings, stages, performers, and various forms of entertainment.\n\nNext, I need to determine the primary label. Since the video encompasses multiple performance types—parade, carnival, concert, dance—it's clear that it's a multi-faceted event. However, \"performance\" seems to be the overarching theme because all these activities involve some form of performing or entertaining.\n\nNow, compiling the video_labels, I'll include the most prominent categories: Parade, Carnival, Concert, Dance Performance, Stage Show, and Art Performance. These cover the main activities observed across the frames.\n\nFor the rationale, I should explain that the video captures various performance events with a focus on entertainment and colorful settings. Notable segments might highlight key moments like parades or specific performances, but since the user said to return an empty list if none are notable, I'll leave that part empty unless there's something significant.\n\nPutting it all together, the JSON structure should reflect these findings accurately."
   }
}

Test 4: Same video with 15 frames instead of 51, and llava:13b

curl -X POST "https://mc.engen.priv.no/classify/video?vision_model=llava:13b&desired_fps=0.1" -F "file=@/Users/vegardengen/Nextcloud/Direkteopplasting/2024/20240626_163718.mp4"

{
   "fps" : 0.291902808741368,
   "frames" : 15,
   "label_counts" : [
      [
         "lights",
         6.93456792
      ],
      [
         "performance",
         6.023
      ],
      [
         "stage",
         6.01456792
      ],
      [
         "concert",
         4.21456792
      ],
      [
         "music",
         2.62756792
      ],
      [
         "performers",
         2.51456792
      ],
      [
         "entertainment",
         2.473
      ],
      [
         "error",
         2
      ],
      [
         "performer",
         1.75
      ],
      [
         "audience",
         1.75
      ],
      [
         "circus",
         0.95
      ],
      [
         "costume",
         0.95
      ],
      [
         "acrobat",
         0.95
      ],
      [
         "stage show",
         0.92
      ],
      [
         "show",
         0.873
      ],
      [
         "decorations",
         0.85
      ]
   ],
   "media_id" : "1ebe7531-69af-4249-8f3d-ee15c059c215",
   "video_summary" : {
      "result" : {
         "notable_segments" : [],
         "primary_label" : "concert",
         "rationale" : "The video predominantly features scenes of concerts and performances, with consistent elements such as stage settings, performers, vibrant lights, and audience presence.",
         "video_labels" : [
            "concert",
            "performance",
            "stage",
            "lights",
            "performers"
         ]
      },
      "thinking" : "Alright, so I need to figure out how to respond to this query where the user is asking me to act as a general video classification system. They've provided frame-level analyses from a video and want me to infer overall labels that describe the video.\n\nFirst, I'll look at each frame's data. Each frame has a label list, confidence score, and description. The goal is to aggregate these to find common themes across all frames. \n\nLooking at the frames:\n\n- Frames 0,1,2,4,8,13: All mention 'concert', 'stage', 'performers', and 'lights'. These are consistent throughout.\n- Frame 3 stands out with 'circus' elements, but that's just one frame in a series of concert-related frames. So it might be an outlier or maybe part of a special performance within the concert.\n- Some frames mention 'music', 'entertainment', which align with concerts and performances.\n\nThe confidence scores are mostly high except for some errors at frames 5 and 7, but those can be ignored as they're system errors.\n\nSo, the primary label should be something that captures the essence of all these frames. 'Concert' seems to fit because most frames describe a concert setting with performers on stage, lights, etc. \n\nFor video_labels, I'll include related terms like 'performance', 'stage', 'lights', and maybe 'entertainment' or 'music'. But looking at the confidence in each frame, 'concert' is mentioned more frequently.\n\nNotable segments: The video seems consistent across frames without any specific notable segments beyond the primary concert setting. So, I'll leave notable_segments as an empty list unless there's a part that stands out, but here it doesn't seem so.\n\nPutting it all together, the JSON should have 'concert' as the primary label with supporting labels and a rationale explaining why."
   }
}

So, all in all, there wasn’t any clear difference between these two, in my opinion….

Conclusions and improvements

Putting together an image/video classifier based on ollama is clearly viable. Note that in my experiment, I have done nothing with regards to tayloring prompts for better results.

If you are predominantly interested in a particular set of features instead of generalized labels you’ll get with for example deepseek, your possibilities are probably better with ollama.

I’d like to test fine tuning prompts a bit, perhaps setting together a standard set of labels I like.

One thing I have not explored is how to train the model to be better over time. I have a bunch of West Coast Swing instruction video summaries, and I’d like it to consistently recognize west coast swing videos – and a plus would be if it could also recognize dance patterns, that’d be something!

Face recognizing is something I haven’t mentioned either, but it’s definitely on the want-list.

The API calls are blocking, something that’s not good UX for tasks that takes a long time. I’d like to explore a proper queuing mechanism and submitting jobs which you could poll for status on.

Some of this will of course be the topic of future blog posts. Stay tuned!

Vegards Blog

AI@home: Classifying images with Ollama – part one.

Picking the right (vision) model.

Architecture and tooling

The program

The classify image endpoint

The classify video endpoint

Deployment to kubernetes

Complete script

Testing

Test 1: An image

Test 2: Another image

Test 3: Video, 1 fps and model llava-phi3:3.8n (lightweight)

Test 4: Same video with 15 frames instead of 51, and llava:13b

Conclusions and improvements

Legg igjen en kommentar Avbryt svar

AI@home: Classifying images with Ollama – part one.

Picking the right (vision) model.

Architecture and tooling

The program

The classify image endpoint

The classify video endpoint

Deployment to kubernetes

Complete script

Testing

Test 1: An image

Test 2: Another image

Test 3: Video, 1 fps and model llava-phi3:3.8n (lightweight)

Test 4: Same video with 15 frames instead of 51, and llava:13b

Conclusions and improvements

Del dette:

Legg igjen en kommentar Avbryt svar