Most journeys into AI start with tutorials. Mine didn’t.
I had a very real, persistent problem at home, my daughter has a habit of keeping her thumb in her mouth. We tried everything: bitter solutions, pediatric advice, constant reminders. Nothing worked consistently.
Except one thing.
Whenever I said, “remove your thumb”, she would immediately stop.
That raised a simple but powerful question:
Can I automate this intervention using AI?
At the same time, I wanted to deeply understand LLMs and multimodal systems, not just conceptually, but by building something real.
Before writing any code, I defined constraints:
Privacy First
No images of my child should leave my machine
No cloud APIs
Cost Efficiency
No per-request billing
Continuous monitoring shouldn’t cost money
Near Real-Time
System should react within seconds
These constraints naturally led to one direction:
Run everything locally
[Webcam]
↓
[OpenCV Capture]
↓
[Image Saved to Disk]
↓
[Local Vision LLM (LLaVA via Ollama)]
↓
[Structured JSON Output]
↓
[Decision Engine]
↓
[Audio Feedback (TTS)]
Before jumping into implementation, I spent some time evaluating options. The goal wasn’t just to “make it work,” but to choose tools aligned with my constraints: local, private, and practical.
I had two possible approaches:
Traditional CV (YOLO / MediaPipe / custom model)
Vision Language Models (VLMs)
I chose a VLM approach because:
I didn’t want to train a model from scratch initially
I wanted semantic understanding, not just detection
I needed flexibility to evolve prompts instead of retraining constantly
Instead of building:
“detect thumb near mouth using coordinates”
I could simply ask:
“Is a child putting their thumb in their mouth?”
That abstraction is incredibly powerful.
I evaluated a few multimodal models and landed on LLaVA.
Reasons:
Open-source and actively maintained
Works well for general image understanding
Compatible with local runtimes
Good balance between capability and resource usage
It’s not perfect for fine-grained detection, but it’s excellent for rapid prototyping on real-world problems.
To run models locally, I chose Ollama.
I explored alternatives like manual Hugging Face setups and llama.cpp, but Ollama stood out because:
Extremely simple setup (ollama pull, ollama run)
Built-in API server
No need to manage inference servers manually
Works seamlessly with multimodal models
For a local-first system, this dramatically reduces complexity.
Keep it local → Keep it simple → Keep it controllable
Local → for privacy
Simple → for faster iteration
Controllable → for debugging and learning
LLaVA is not highly precise for subtle gestures
CPU inference introduces latency
Prompt-based detection is less deterministic than CV
But for solving a real-world problem quickly, this was the right balance.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Pull LLaVA model:
ollama pull llava
python3 -m pip install requests opencv-python pillow pyttsx3
import cv2
from datetime import datetime
import os
OUTPUT_DIR = "captures"
os.makedirs(OUTPUT_DIR, exist_ok=True)
def capture_image():
cap = cv2.VideoCapture(0, cv2.CAP_AVFOUNDATION)
# Warm-up frames (important)
for _ in range(10):
cap.read()
ret, frame = cap.read()
cap.release()
if not ret:
return None, None
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
filename = f"{OUTPUT_DIR}/{timestamp}.jpg"
cv2.imwrite(filename, frame)
return filename, timestamp
import requests
import base64
def encode_image(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
def ask_llm(image_path):
prompt = """
Analyze this image.
Return ONLY JSON:
{
"alert": true or false
}
Set alert=true ONLY if:
- a child is visible AND
- the child has a finger or thumb inside their mouth
"""
payload = {
"model": "llava",
"prompt": prompt,
"images": [encode_image(image_path)],
"stream": False
}
res = requests.post("http://localhost:11434/api/generate", json=payload)
return res.json()["response"]
import json
import re
def parse_response(text):
try:
match = re.search(r"\{.*\}", text, re.DOTALL)
if match:
return json.loads(match.group())
except:
pass
return {"alert": False}
import os
def speak(text):
os.system(f'say "{text}"')
import time
last_alert = False
while True:
img, ts = capture_image()
if not img:
continue
response = ask_llm(img)
data = parse_response(response)
print(f"[{ts}] → {data}")
if data.get("alert") and not last_alert:
speak("Remove your thumb from your mouth")
last_alert = data.get("alert")
time.sleep(15)
Store both images and responses:
captures/
2026-04-27_20-13-36.jpg
2026-04-27_20-13-36.txt
This helps compare:
What the camera saw
What the model inferred
Fixed with warm-up reads + correct backend
Fixed using regex extraction
Caused by pyttsx3 limitations
Replaced with system TTS
LLaVA is not pixel-precise
Works ~60–75% accuracy without fine-tuning
This project reinforced something important:
No data leaves the system
No API costs
Fully controllable pipeline
This is especially critical when working with sensitive personal data (like children).
Start with a real problem
Constraints drive better architecture
Local LLMs are production-capable for edge use cases
Multimodal AI is powerful, but needs careful prompting
Debugging AI systems requires visibility into inputs + outputs
Potential improvements:
Fine-tuning LLaVA with custom dataset
Adding motion detection to reduce unnecessary inference
Multi-camera setup
Integration with smart speakers (Alexa, etc.)
Most AI projects try to showcase what models can do.
This one taught me something different:
The real power of AI is not in the model,
it’s in how you apply it to your life, under real constraints.
Here’s a simplified version of the core loop and integration: [gist link]