Building a Privacy-First Local AI System to Solve a Real Problem at Home

Introduction

Most journeys into AI start with tutorials. Mine didn’t.

I had a very real, persistent problem at home, my daughter has a habit of keeping her thumb in her mouth. We tried everything: bitter solutions, pediatric advice, constant reminders. Nothing worked consistently.

Except one thing.

Whenever I said, “remove your thumb”, she would immediately stop.

That raised a simple but powerful question:

Can I automate this intervention using AI?

At the same time, I wanted to deeply understand LLMs and multimodal systems, not just conceptually, but by building something real.


Constraints That Shaped the System

Before writing any code, I defined constraints:

  1. Privacy First

    • No images of my child should leave my machine

    • No cloud APIs

  2. Cost Efficiency

    • No per-request billing

    • Continuous monitoring shouldn’t cost money

  3. Near Real-Time

    • System should react within seconds

These constraints naturally led to one direction:

Run everything locally


High-Level Architecture

[Webcam]
   ↓
[OpenCV Capture]
   ↓
[Image Saved to Disk]
   ↓
[Local Vision LLM (LLaVA via Ollama)]
   ↓
[Structured JSON Output]
   ↓
[Decision Engine]
   ↓
[Audio Feedback (TTS)]

Why I Chose This Stack (After Some Research)

Before jumping into implementation, I spent some time evaluating options. The goal wasn’t just to “make it work,” but to choose tools aligned with my constraints: local, private, and practical.


Why a Vision LLM (and not just Computer Vision)?

I had two possible approaches:

  1. Traditional CV (YOLO / MediaPipe / custom model)

  2. Vision Language Models (VLMs)

I chose a VLM approach because:

  • I didn’t want to train a model from scratch initially

  • I wanted semantic understanding, not just detection

  • I needed flexibility to evolve prompts instead of retraining constantly

Instead of building:

“detect thumb near mouth using coordinates”

I could simply ask:

“Is a child putting their thumb in their mouth?”

That abstraction is incredibly powerful.


Why I Picked LLaVA

I evaluated a few multimodal models and landed on LLaVA.

Reasons:

  • Open-source and actively maintained

  • Works well for general image understanding

  • Compatible with local runtimes

  • Good balance between capability and resource usage

It’s not perfect for fine-grained detection, but it’s excellent for rapid prototyping on real-world problems.


Why I Used Ollama

To run models locally, I chose Ollama.

I explored alternatives like manual Hugging Face setups and llama.cpp, but Ollama stood out because:

  • Extremely simple setup (ollama pullollama run)

  • Built-in API server

  • No need to manage inference servers manually

  • Works seamlessly with multimodal models

For a local-first system, this dramatically reduces complexity.


Design Philosophy

Keep it local → Keep it simple → Keep it controllable
  • Local → for privacy

  • Simple → for faster iteration

  • Controllable → for debugging and learning


Tradeoffs I Accepted

  • LLaVA is not highly precise for subtle gestures

  • CPU inference introduces latency

  • Prompt-based detection is less deterministic than CV

But for solving a real-world problem quickly, this was the right balance.


Step 1: Setting Up Local LLM Runtime

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull LLaVA model:

ollama pull llava

Step 2: Python Environment Setup

python3 -m pip install requests opencv-python pillow pyttsx3

Step 3: Capturing Images from Webcam

import cv2
from datetime import datetime
import os

OUTPUT_DIR = "captures"
os.makedirs(OUTPUT_DIR, exist_ok=True)

def capture_image():
    cap = cv2.VideoCapture(0, cv2.CAP_AVFOUNDATION)

    # Warm-up frames (important)
    for _ in range(10):
        cap.read()

    ret, frame = cap.read()
    cap.release()

    if not ret:
        return None, None

    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"{OUTPUT_DIR}/{timestamp}.jpg"

    cv2.imwrite(filename, frame)
    return filename, timestamp

Step 4: Sending Image to LLM

import requests
import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def ask_llm(image_path):
    prompt = """
Analyze this image.

Return ONLY JSON:
{
  "alert": true or false
}

Set alert=true ONLY if:
- a child is visible AND
- the child has a finger or thumb inside their mouth
"""

    payload = {
        "model": "llava",
        "prompt": prompt,
        "images": [encode_image(image_path)],
        "stream": False
    }

    res = requests.post("http://localhost:11434/api/generate", json=payload)
    return res.json()["response"]

Step 5: Handling LLM Output

import json
import re

def parse_response(text):
    try:
        match = re.search(r"\{.*\}", text, re.DOTALL)
        if match:
            return json.loads(match.group())
    except:
        pass

    return {"alert": False}

Step 6: Audio Feedback

import os

def speak(text):
    os.system(f'say "{text}"')

Step 7: Main Loop

import time

last_alert = False

while True:
    img, ts = capture_image()
    if not img:
        continue

    response = ask_llm(img)
    data = parse_response(response)

    print(f"[{ts}] → {data}")

    if data.get("alert") and not last_alert:
        speak("Remove your thumb from your mouth")

    last_alert = data.get("alert")
    time.sleep(15)

Step 8: Debugging Strategy

Store both images and responses:

captures/
  2026-04-27_20-13-36.jpg
  2026-04-27_20-13-36.txt

This helps compare:

  • What the camera saw

  • What the model inferred


Challenges Faced

1. Black Frames from Camera

  • Fixed with warm-up reads + correct backend

2. JSON Parsing Failures

  • Fixed using regex extraction

3. Audio Playing Only Once

  • Caused by pyttsx3 limitations

  • Replaced with system TTS

4. Model Reliability

  • LLaVA is not pixel-precise

  • Works ~60–75% accuracy without fine-tuning


Why Local AI Matters

This project reinforced something important:

  • No data leaves the system

  • No API costs

  • Fully controllable pipeline

This is especially critical when working with sensitive personal data (like children).


Key Takeaways

  1. Start with a real problem

  2. Constraints drive better architecture

  3. Local LLMs are production-capable for edge use cases

  4. Multimodal AI is powerful, but needs careful prompting

  5. Debugging AI systems requires visibility into inputs + outputs


What’s Next

Potential improvements:

  • Fine-tuning LLaVA with custom dataset

  • Adding motion detection to reduce unnecessary inference

  • Multi-camera setup

  • Integration with smart speakers (Alexa, etc.)


Closing Thought

Most AI projects try to showcase what models can do.

This one taught me something different:

The real power of AI is not in the model,
it’s in how you apply it to your life, under real constraints.

 

Here’s a simplified version of the core loop and integration: [gist link]

Anil Yanduri

Anil Yanduri

An Indian, Techie, Photography enthusiast, new Rider, new blogger. A Father.