Anil Yanduri

Building a Privacy-First Local AI System to Solve a Real Problem at Home

Introduction

Most journeys into AI start with tutorials. Mine didn’t.

I had a very real, persistent problem at home, my daughter has a habit of keeping her thumb in her mouth. We tried everything: bitter solutions, pediatric advice, constant reminders. Nothing worked consistently.

Except one thing.

Whenever I said, “remove your thumb”, she would immediately stop.

That raised a simple but powerful question:

Can I automate this intervention using AI?

At the same time, I wanted to deeply understand LLMs and multimodal systems, not just conceptually, but by building something real.

Constraints That Shaped the System

Before writing any code, I defined constraints:

Privacy First
- No images of my child should leave my machine
- No cloud APIs
Cost Efficiency
- No per-request billing
- Continuous monitoring shouldn’t cost money
Near Real-Time
- System should react within seconds

These constraints naturally led to one direction:

Run everything locally

High-Level Architecture

[Webcam]
   ↓
[OpenCV Capture]
   ↓
[Image Saved to Disk]
   ↓
[Local Vision LLM (LLaVA via Ollama)]
   ↓
[Structured JSON Output]
   ↓
[Decision Engine]
   ↓
[Audio Feedback (TTS)]

Why I Chose This Stack (After Some Research)

Before jumping into implementation, I spent some time evaluating options. The goal wasn’t just to “make it work,” but to choose tools aligned with my constraints: local, private, and practical.

Why a Vision LLM (and not just Computer Vision)?

I had two possible approaches:

Traditional CV (YOLO / MediaPipe / custom model)
Vision Language Models (VLMs)

I chose a VLM approach because:

I didn’t want to train a model from scratch initially
I wanted semantic understanding, not just detection
I needed flexibility to evolve prompts instead of retraining constantly

Instead of building:

“detect thumb near mouth using coordinates”

I could simply ask:

“Is a child putting their thumb in their mouth?”

That abstraction is incredibly powerful.

Why I Picked LLaVA

I evaluated a few multimodal models and landed on LLaVA.

Reasons:

Open-source and actively maintained
Works well for general image understanding
Compatible with local runtimes
Good balance between capability and resource usage

It’s not perfect for fine-grained detection, but it’s excellent for rapid prototyping on real-world problems.

Why I Used Ollama

To run models locally, I chose Ollama.

I explored alternatives like manual Hugging Face setups and llama.cpp, but Ollama stood out because:

Extremely simple setup (ollama pull, ollama run)
Built-in API server
No need to manage inference servers manually
Works seamlessly with multimodal models

For a local-first system, this dramatically reduces complexity.

Design Philosophy

Keep it local → Keep it simple → Keep it controllable

Local → for privacy
Simple → for faster iteration
Controllable → for debugging and learning

Tradeoffs I Accepted

LLaVA is not highly precise for subtle gestures
CPU inference introduces latency
Prompt-based detection is less deterministic than CV

But for solving a real-world problem quickly, this was the right balance.

Step 1: Setting Up Local LLM Runtime

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull LLaVA model:

ollama pull llava

Step 2: Python Environment Setup

python3 -m pip install requests opencv-python pillow pyttsx3

Step 3: Capturing Images from Webcam

import cv2
from datetime import datetime
import os

OUTPUT_DIR = "captures"
os.makedirs(OUTPUT_DIR, exist_ok=True)

def capture_image():
    cap = cv2.VideoCapture(0, cv2.CAP_AVFOUNDATION)

    # Warm-up frames (important)
    for _ in range(10):
        cap.read()

    ret, frame = cap.read()
    cap.release()

    if not ret:
        return None, None

    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"{OUTPUT_DIR}/{timestamp}.jpg"

    cv2.imwrite(filename, frame)
    return filename, timestamp

Step 4: Sending Image to LLM

import requests
import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def ask_llm(image_path):
    prompt = """
Analyze this image.

Return ONLY JSON:
{
  "alert": true or false
}

Set alert=true ONLY if:
- a child is visible AND
- the child has a finger or thumb inside their mouth
"""

    payload = {
        "model": "llava",
        "prompt": prompt,
        "images": [encode_image(image_path)],
        "stream": False
    }

    res = requests.post("http://localhost:11434/api/generate", json=payload)
    return res.json()["response"]

Step 5: Handling LLM Output

import json
import re

def parse_response(text):
    try:
        match = re.search(r"\{.*\}", text, re.DOTALL)
        if match:
            return json.loads(match.group())
    except:
        pass

    return {"alert": False}

Step 6: Audio Feedback

import os

def speak(text):
    os.system(f'say "{text}"')

Step 7: Main Loop

import time

last_alert = False

while True:
    img, ts = capture_image()
    if not img:
        continue

    response = ask_llm(img)
    data = parse_response(response)

    print(f"[{ts}] → {data}")

    if data.get("alert") and not last_alert:
        speak("Remove your thumb from your mouth")

    last_alert = data.get("alert")
    time.sleep(15)

Step 8: Debugging Strategy

Store both images and responses:

captures/
  2026-04-27_20-13-36.jpg
  2026-04-27_20-13-36.txt

This helps compare:

What the camera saw
What the model inferred

Challenges Faced

1. Black Frames from Camera

Fixed with warm-up reads + correct backend

2. JSON Parsing Failures

Fixed using regex extraction

3. Audio Playing Only Once

Caused by pyttsx3 limitations
Replaced with system TTS

4. Model Reliability

LLaVA is not pixel-precise
Works ~60–75% accuracy without fine-tuning

Why Local AI Matters

This project reinforced something important:

No data leaves the system
No API costs
Fully controllable pipeline

This is especially critical when working with sensitive personal data (like children).

Key Takeaways

Start with a real problem
Constraints drive better architecture
Local LLMs are production-capable for edge use cases
Multimodal AI is powerful, but needs careful prompting
Debugging AI systems requires visibility into inputs + outputs

What’s Next

Potential improvements:

Fine-tuning LLaVA with custom dataset
Adding motion detection to reduce unnecessary inference
Multi-camera setup
Integration with smart speakers (Alexa, etc.)

Closing Thought

Most AI projects try to showcase what models can do.

This one taught me something different:

The real power of AI is not in the model,
it’s in how you apply it to your life, under real constraints.

Here’s a simplified version of the core loop and integration: [gist link]

Anil Yanduri

An Indian, Techie, Photography enthusiast, new Rider, new blogger. A Father.

Building a Privacy-First Local AI System to Solve a Real Problem at Home

Introduction

Constraints That Shaped the System

High-Level Architecture

Why I Chose This Stack (After Some Research)

Why a Vision LLM (and not just Computer Vision)?

Why I Picked LLaVA

Why I Used Ollama

Design Philosophy

Tradeoffs I Accepted

Step 1: Setting Up Local LLM Runtime

Step 2: Python Environment Setup

Step 3: Capturing Images from Webcam

Step 4: Sending Image to LLM

Step 5: Handling LLM Output

Step 6: Audio Feedback

Step 7: Main Loop

Step 8: Debugging Strategy

Challenges Faced

1. Black Frames from Camera

2. JSON Parsing Failures

3. Audio Playing Only Once

4. Model Reliability

Why Local AI Matters

Key Takeaways

What’s Next

Closing Thought

Anil Yanduri

Categories

Tag Cloud

Archives