Building an Offline “Life Memorizer” with Gemini 2.0 & Qdrant Edge

A privacy-first, multimodal memory system that indexes your senses and runs entirely on-device — no cloud retrieval, no server process, no network dependency Consider this. You are trying to find where you left an item, like your wallet or keys, inside your own house. You know you saw it earlier today, but you cannot remember the exact spot. Hardware engineers have mostly solved the problem of capturing this data with wearable smart glasses and continuous POV cameras that log everything you see, hear, and read. But as developers, we are still stuck when it comes to managing that data. The standard solution is to take all that sensory data and stream it straight to a cloud server for processing and storage. From an infrastructure standpoint, it looks easy. But for a personal product, it introduces a massive mess. A personal log contains incredibly sensitive information, the inside of your home, private conversations, financial documents, and exact locations. Sending all of this to a remote server introduces constant security risks, network latency, and a total system failure the moment your internet connection drops. A personal memory assistant should be useful anywhere, especially when you are offline. The solution is to move the entire storage and search stack onto the device itself. I built a tool for myself called Life Memorizer . It is a local system that ingests multimodal sensory streams and searches through them without any cloud dependency at runtime. It combines Gemini Embedding 2 to process images, audio, and text into a single unified space, and Qdrant Edge to store and index everything directly within the application process. The full implementation, including the media processing scripts and command-line interface, is open-source and available on GitHub. High-Level System Architecture Flowchart | T he 4-stage pipeline: User & Sensory Input → Embedding & Optimization Layer → Embedded Storage (Qdrant Edge) → Recall & Generation Pipeline Why This Stack? When you build a system that indexes sensory data in real time, hardware limits force you to be highly selective about your resource budget. If you are running your application on an edge device like a Raspberry Pi or an NVIDIA Jetson or your local mobile device, you cannot afford to waste memory or CPU cycles on unnecessary infrastructure. 1. Gemini Embedding 2 : One Model, One Coordinate System Traditionally, searching across text, audio, and video meant loading three distinct models into memory: Whisper for audio transcription, CLIP for images, and a Sentence Transformer for text. Each of these models output vectors in completely different dimensional spaces, trained on entirely different data objectives. If you type a text query like “keys on the table,” it will not align mathematically with a CLIP vector of a photo unless you build, train, and maintain a custom translation layer to map the coordinate systems together. Gemini Embedding 2 completely eliminates this issue. It natively projects text, images, and audio into the exact same 3072-dimensional space. The model maps different input modalities into a single, unified coordinate system. A text description and a JPEG image of the same scene land near each other based on meaning alone. For an edge application, this removes the need for translation layers, drastically reducing pipeline bugs and saving critical memory. 2. Qdrant Edge : Works Like a Library, Persists Like a Database Most production vector databases follow a server-client architecture. You run the database as a separate server process often inside a Docker container and your application communicates with it over network ports via HTTP or gRPC. While this makes sense for a distributed cloud setup, it is a massive roadblock for a local developer tool or a wearable edge device. Making a local system depend on a running database service adds overhead, setup friction, and infrastructural complexity. Qdrant Edge functions conceptually like SQLite but built specifically for vector data. You import it as a standard Python library, point it at a local directory on your filesystem, and all vector storage, indexing, and querying happen directly inside the memory space of your Python process. pip install qdrant-edge-py Why I Picked Qdrant Edge Over Others? When I started mapping out the architecture for Life Memorizer , I evaluated several local storage options. My first instinct was to look at lightweight relational or document-based embedded databases that offer basic vector extensions, such as SQLite (with extensions like sqlite-vec) or DuckDB . While those tools are incredible for structured data analytics, they quickly fall apart when you treat them as dedicated, heavy-duty vector stores on constrained hardware. They often lack advanced Hierarchical Navigable Small World (HNSW) graph indexing natively for edge Python environments. This means as your historical memory log grows, your search latency scales linearly turning a quick lookup into a slow, sequential table scan that drains your CPU. I also considered running a full production-grade vector database locally, like a standalone instance of Chroma , Milvus , or Weaviate . But requiring an end-user to manage a running Docker daemon or keep a separate background database server alive just to index their smart glasses feed felt completely wrong. It would solve a data problem while creating three new infrastructure problems for a lightweight wearable project. Qdrant Edge hit the exact sweet spot for this project for three specific reasons: Process-Level Integration: Unlike server-dependent options, it requires zero background daemons, zero open network ports, and zero container management. It lives directly inside my Python code and closes cleanly when the execution ends. Smart Memory Management: It utilizes memory-mapped (memmap) files. Instead of loading every single vector into active RAM—which would cause an instant Out-of-Memory (OOM) crash when running alongside an on-device language model—the host operating system handles the data paging automatically. It swaps vector segments to disk dynamically. Production-Grade Filtering: Unlike simple flat-file vector array scripts (like raw FAISS indexes), Qdrant Edge brings the exact same advanced payload filtering capabilities as its enterprise cloud server. I can restrict searches by metadata — like location_context == 'Home'—directly during the HNSW graph traversal rather than filtering results after the search is complete, keeping lookups incredibly fast on low-powered edge hardware. System Data Flow Architecture | Raw sensory capture → Gemini embedding → EdgeShard write → payload index → search & retrieval loop Environment Setup & Qdrant Edge Initialization Tech Stack: Prerequisites qdrant-edge-py>=0.7.2 # embedded vector store google-genai>=0.3.0 # Gemini API client numpy>=1.24 # vector math pydantic>=2.5 pydantic-settings>=2.1 typer>=0.12 # CLI framework rich>=13.7 # terminal output formatting Optional packages for real media handling (not needed for the mock pipeline): opencv-python-headless>=4.9 # frame sampling from video imageio-ffmpeg>=0.4.9 # audio track extraction Pillow>=10.0 # image I/O helper pytesseract # OCR from frames Install from the project root: pip install -e ".[media]" # includes optional media extras # or pip install -r requirements.txt The mock pipeline in this tutorial is structurally identical to the production feed. Every module, every interface, every storage call is the same, the only difference is what feeds data into ingest.py. Project Structure The full implementation lives on GitHub. Before walking through each module, here’s how the project is organized: life-memorizer/ ├── .env.example # Template for environment variables ├── pyproject.toml # Build system, metadata, and dependencies ├── requirements.txt # Pinned requirements file ├── ARCHITECTURE-DOCUMENTATION.md # System architecture & codebase knowledge graph ├── samples/ # Sample video files for quick testing │ ├── pov-urban-bike-ride-through-city-streets.mp4 │ └── vibrant-city-street-with-shops-and-pedestrians.mp4 ├── life_memorizer/ # Core source package │ ├── cli.py # Command-line interface definitions │ ├── config.py # Configuration settings loader & validator │ ├── embeddings.py # Multi-modal embedding (Gemini / Matryoshka) │ ├── ingest.py # Ingestion pipeline coordinating media processing │ ├── media.py # Media processing utils (OpenCV, ffmpeg, Tesseract) │ ├── mock_data.py # Mock dataset for quick seeding and testing │ ├── models.py # Core Pydantic data schemas & enums │ ├── rag.py # Local Retrieval-Augmented Generation flows │ ├── recall.py # Recall engine for vector & hybrid queries │ └── store.py # Qdrant Edge vector store wrapper └── tests/ ├── conftest.py # Shared pytest fixtures ├── test_embeddings.py # Unit tests for embedding layers ├── test_rag.py # Unit tests for LocalRAG pipeline ├── test_step5.py # Unit tests for quantization & TTL pruning └── test_store_and_recall.py # Unit tests for storage and retrieval The article walks through the five core modules that form the pipeline: [embeddings.py] → [models.py] → [store.py] → [recall.py] → [rag.py] The ingest.py and media.py modules handle the real-hardware media processing layer: OpenCV frame sampling, ffmpeg audio extraction, Tesseract OCR, which fully documented in ARCHITECTURE-DOCUMENTATION.md. Full source: GitHub - satyam671/Life-Memorizer-With-Gemini-Embedding-2-And-Qdrant-Edge: A privacy-first, local digital twin for smart glasses that continuously indexes what a user sees, hears, and reads, allowing instant local semantic recall. Part 1: Initializing Qdrant Edge Setting Up the Local Shard The core primitive in Qdrant Edge is the EdgeShard: a self-contained storage unit that manages its own vector index, payload data, and HNSW graph, all backed by files in a local directory you specify. HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor index that makes fast vector search possible. In Qdrant Edge, this index is built and queried entirely within your process, using the same underlying Rust implementation as the full Qdrant server. The shard initialization follows a simple rule: if edge_config.json exists in the target directory, the shard already has data and you load it; otherwise you create it fresh with a new config. import qdrant_edge as qe from pathlib import Path db_path = Path("./life_memorizer_db") db_path.mkdir(parents=True, exist_ok=True) config_json = db_path / "edge_config.json" # Load existing or create a new local shard if config_json.exists(): client = qe.EdgeShard.load(str(db_path)) else: client = None # proceed to create with config below If the directory already has a shard, .load() reads the configuration and data from disk and returns a ready-to-query instance in seconds. If you call .create() on a directory that already contains data, it raises an error. That's intentional, the shard is a persistent storage unit, not a connection object. Treat it like one. Don't try to recreate it on every startup. Schema Design via Named Vectors Here’s the core schema decision. Instead of three separate databases for visual frames, audio transcripts, and OCR text, all three live as named vectors inside a single Qdrant point . Each point represents a moment in time, a snapshot of what the device captured at a specific second. That moment might carry a visual embedding , an audio embedding , an OCR embedding , or all three depending on what was detectable. Named vectors let you search any of those spaces independently during targeted queries, or fuse across all of them during hybrid search. vector_params = qe.EdgeVectorParams( size=768, # MRL-truncated from 3072 — more on this below distance=qe.Distance.Cosine, on_disk=True, # offloads vector index to disk, saves RAM ) vectors_config = { "video_frame": vector_params, "ambient_audio": vector_params, "ocr_log": vector_params, } config = qe.EdgeConfig( vectors=vectors_config, on_disk_payload=True, ) client = qe.EdgeShard.create(str(db_path), config) Enabling on_disk=True instructs Qdrant Edge to page the vector indices using the host filesystem rather than keeping them entirely in physical memory. This optimization lowers the baseline RAM usage of the script. On a device that's also running an on-device LLM, that tradeoff is almost always correct. Reducing Vector Dimensions via Matryoshka Truncation: Storing 3072 Dimensions in 768 Gemini Embedding 2 produces 3072-dimensional float32 vectors, which require 12 KB of storage per vector. When you are logging multiple video frames, transcripts, and OCR detections every minute, the storage requirement scales up quickly. To optimize local storage, the codebase uses Matryoshka Representation Learning (MRL). The embedding model is trained to structure semantic information so that the earliest coordinates contain the highest information density. You can drop the trailing dimensions and truncate the vector to 768 dimensions to achieve a 4x savings in disk space with minimal loss in retrieval accuracy. When you truncate a vector, you modify its length, which breaks Cosine distance calculations. You must re-normalize the truncated vector back to a unit length to ensure the vector database calculates similarity scores accurately. import numpy as np import math def _l2_normalize(vec: np.ndarray) -> np.ndarray: norm = float(np.linalg.norm(vec)) if norm == 0.0 or math.isnan(norm): return vec return vec / norm def matryoshka_truncate(vector: np.ndarray, dim: int) -> np.ndarray: if vector.shape[0] < dim: vector = np.pad(vector, (0, dim - vector.shape[0])) truncated = vector[:dim].astype(np.float32) return _l2_normalize(truncated) Two lines of actual logic wrapped in safety checks. The zero-norm guard matters: an all-zeros vector causes a division by zero that produces NaN values in your index, and NaN-contaminated vectors corrupt similarity scores for every query that touches that segment of the HNSW graph. Always guard it. Part 2: Ingesting the Senses Simulating the Smart Glasses Feed and Setting Up the Data Pipeline We need data to feed the pipeline. In production this comes from a camera, a microphone, and an OCR module. The project’s mock_data.py provides a structured dataset that mirrors what a real wearable feed would produce — visual scenes, transcribed audio fragments, and OCR-captured text, each tagged with a location and a timestamp offset: The MockMoment dataclass mirrors the structure of a real Moment object closely enough that all downstream pipeline code runs unchanged: from dataclasses import dataclass, field from typing import Optional @dataclass(frozen=True) class MockMoment: minutes_ago: int location: str scene: Optional[str] = None # text description of visual frame speech: Optional[str] = None # transcribed audio ocr: Optional[str] = None # text captured from surfaces media_file_path: Optional[str] = None tags: tuple[str, ...] = field(default_factory=tuple) MOCK_MOMENTS = ( MockMoment( minutes_ago=88, location="Home", scene="a set of brass house keys lying on the wooden hallway table next to a blue ceramic bowl", media_file_path="media_cache/home/hallway_table_keys.jpg", ), MockMoment( minutes_ago=64, location="Street", speech="Sarah says: can you buy oat milk and fresh basil on the way back?", media_file_path="media_cache/street/with_sarah.jpg", ), MockMoment( minutes_ago=58, location="Cafe", ocr="MAPLE & CO\nFlat White 4.20\nCappuccino 4.00\nOat milk +0.60", media_file_path="media_cache/cafe/menu_board.jpg", ), ) Each MockMoment represents a single sensory event with one dominant modality — visual scene, ambient speech, or OCR capture tagged to a location and a point in time. The scene field holds a text description standing in for actual JPEG bytes. When the real pipeline runs, embed_image() receives raw image bytes from media.py. The downstream ingestion code is identical either way. Computing the Vectors: Generating Aligned Embeddings All three modalities go through the same Gemini model and the same _embed_content call. The model determines modality from the input type — a raw string is text, a byte part with image/jpeg is an image, a byte part with audio/wav is audio. What comes back is always a vector in the same 3072-dimensional space. That’s the entire architecture argument in one wrapper class: class GeminiEmbedder: def embed_text(self, text: str) -> list[float]: return self._embed_content(text) def embed_image(self, image_path: str | Path) -> list[float]: part = self._file_part(image_path, default_mime="image/jpeg") return self._embed_content(part) def embed_audio(self, audio_path: str | Path) -> list[float]: part = self._file_part(audio_path, default_mime="audio/wav") return self._embed_content(part) def _embed_content(self, content) -> list[float]: from google.genai import types config = types.EmbedContentConfig(output_dimensionality=3072) response = self._client.models.embed_content( model="gemini-embedding-2", contents=content, config=config, ) values = response.embeddings[0].values vec = np.asarray(values, dtype=np.float32) return matryoshka_truncate(vec, self.dim).tolist() Text goes in as a string. Images and audio go in as typed byte parts with a MIME type. The MRL truncation happens at the exit of _embed_content, so every downstream consumer always receives a 768-dim, L2-normalized vector regardless of input modality. That consistency is deliberate: the storage layer doesn’t need to know which embedder produced a given vector, and the retrieval layer doesn’t need to handle different vector sizes per index. When LIFE_MEMORIZER_FAKE_EMBEDDINGS=1 is set, the project substitutes a deterministic fake embedder that generates random-but-consistent 768-dim vectors from a hash of the input text. This lets you run and test the full pipeline: seeding, storing, searching without an API key or internet connection. Part 3: Upserting Multi-Vector Points to Qdrant: Storing the Points Payload Design Each point stored in the EdgeShard carries two things: its named vectors, and a metadata payload used for filtering. The payload holds temporal information, location context, and text extracts from the moment. It deliberately excludes raw image bytes and audio clips, those stay in the media_cache/ directory on disk, referenced only by path if store_media_path is enabled. def payload(self, store_media_path: bool = True) -> dict[str, Any]: return { "timestamp": self.timestamp.isoformat(), "timestamp_epoch": int(self.timestamp.timestamp()), "location_context": self.location_context, "media_file_path": self.media_file_path if store_media_path else None, "source_clip": self.source_clip, "transcript": self.transcript, "ocr_text": self.ocr_text, "is_summary": self.is_summary, "summary_count": self.summary_count, } The fields is_summary (boolean flag) and summary_count track data compression state. When local storage limits are reached, the system clusters older historical data and merges them into a single mean-pooled "summary" point. These fields inform the retrieval loop whether a query match is a specific historical point or a consolidated overview of a particular timeframe. A digest point with summary_count=12 represents a location-cluster of 12 distinct moments from that time window, and the retrieval layer can surface that context in the response rather than treating it as a single event. The store_media_path flag gives you a simple privacy lever. Set it to False and the shard contains nothing but abstract vector math, timestamps, and location labels — no file references, no text fragments. We'll come back to where the actual privacy boundary sits, and why this flag alone doesn't give you the guarantee you might assume. Qdrant Edge Vector Database Point Structure | The Complete Anatomy of a Database Point Carrying Metadata and Named Vectors. Upserting to the Shard def upsert_moments(self, moments: Iterable[Moment]) -> int: store_path = self.settings.store_media_path points = [] for moment in moments: if not moment.vectors: continue points.append( qe.Point( id=moment.id, vector=dict(moment.vectors), # video_frame, ambient_audio, ocr_log payload=moment.payload(store_media_path=store_path), ) ) if not points: return 0 self.client.update(qe.UpdateOperation.upsert_points(points)) return len(points) Moments without any vectors are skipped before the batch is constructed. In a real wearable feed, this is common, frames with no detectable visual content, audio segments that are purely ambient noise below the transcription threshold, surfaces with no readable text. Filtering them out early keeps your point count clean and avoids empty-vector entries that would corrupt hybrid search scoring. Batching the upserts reduces disk write operations noticeably compared to upserting one point at a time. Running the Project Before looking at the query output, here’s how to actually run the project. There are two modes depending on whether you have a Gemini API key. Mode 1: Mock / Offline Mode (No API Key Required) This is the fastest way to get the full pipeline running. The fake embedder generates deterministic vectors from the mock dataset, so init → seed → query → ask, all work without touching the network. Step 1: Set environment variables Windows PowerShell: $env:LIFE_MEMORIZER_FAKE_EMBEDDINGS="1" $env:LIFE_MEMORIZER_FAKE_RAG="1" Bash (macOS / Linux): export LIFE_MEMORIZER_FAKE_EMBEDDINGS=1 export LIFE_MEMORIZER_FAKE_RAG=1 Step 2: Initialize the local database life-memorizer init This creates the ./life_memorizer_db/ directory and writes the initial edge_config.json with the named vector schema. Running init on an already-initialized database is safe — it detects the existing config and skips re-creation. Step 3: Seed the mock data life-memorizer seed Loads all MockMoment entries from mock_data.py, generates fake embeddings for each, and upserts them to the local shard. You'll see a progress output in the terminal as each moment is processed and stored. Step 4: Query and ask (covered in detail in Part 4) life-memorizer recall "where did I leave my keys?" --modality image life-memorizer ask "where did I leave my keys?" Mode 2: Live Video Ingestion (Real Gemini API) This mode uses actual Gemini embeddings and ingests real video content from your local files. Step 1: Configure your .env file cp .env.example .env Open .env and set your credentials: GEMINI_API_KEY=AIzaSy...YourActualKeyHere LIFE_MEMORIZER_FAKE_EMBEDDINGS=0 LIFE_MEMORIZER_FAKE_RAG=0 Step 2: Disable offline environment flags Windows PowerShell: $env:LIFE_MEMORIZER_FAKE_EMBEDDINGS="0" $env:LIFE_MEMORIZER_FAKE_RAG="0" Bash (macOS / Linux): export LIFE_MEMORIZER_FAKE_EMBEDDINGS=0 export LIFE_MEMORIZER_FAKE_RAG=0 Step 3: Initialize the database life-memorizer init If you previously ran Mock Mode and want a clean slate, delete ./life_memorizer_db/ first before re-initializing. Step 4: Ingest a video file # Point to any video file on your machine (e.g. samples/walk.mp4) life-memorizer ingest --video samples/pov-urban-bike-ride-through-city-streets.mp4 --location Home The ingestion pipeline samples the video into frames, extracts audio chunks, runs OCR where applicable, generates real Gemini embeddings for each, and writes everything to the local shard. Depending on video length and your machine, this can take a few minutes. The --location tag gets stored in each point's payload and used as a metadata filter during retrieval. Part 4: Querying Your Past: Multi Modal Retrieval With the database seeded, the recall layer routes your natural language queries to the right vector space. Each scenario below shows the CLI command, the underlying recall code, and a placeholder for the actual terminal output. Multi-Modal Retrieval Pipeline | A U ser Query being Embedded → Routed to Visual/Audio/OCR Named Vector Indices → Scores Fused with Weights → Top-K Results Returned with Payload Scenario A: Visual Search A natural language query gets embedded as text and searched against the video_frame index. The cross-modal match works because Gemini maps the text description and the visual content into the same coordinate space during training — you don't need a visual query to search visual memories. # Mock Offline Mode life-memorizer recall "where did I leave my keys?" --modality image The underlying recall call in recall.py: def visual_search(self, query: str, **kwargs) -> list[RecallHit]: kwargs.setdefault("target", Modality.image) return self.recall(query, modality=Modality.text, **kwargs) One function call. The modality routing, the embedding call, and the HNSW search all happen inside self.recall(). The target=Modality.image argument tells the recall engine which named vector index to search against. Live Video Ingestion (Real Gemini API) Provide a video file on your system to analyze. The video will be sampled into image frames, audio chunks will be extracted, OCR will be run, and everything will be embedded: # Point to any video file on your machine (e.g. samples/walk.mp4) life-memorizer ingest --video samples/pov-urban-bike-ride-through-city-streets.mp4 --location Home Ask questions related to your ingested video: life-memorizer recall "where did i see the red car today while i was cycling?" --modality image Scenario B: Audio Recall Same routing logic, different named index. The query targets the ambient_audio vector space, which holds embeddings of transcribed speech from the sensory feed. life-memorizer recall "what did Sarah say to buy?" --modality audio The underlying recall call in recall.py: def audio_recall(self, query: str, **kwargs) -> list[RecallHit]: kwargs.setdefault("target", Modality.audio) return self.recall(query, modality=Modality.text, **kwargs) In the mock dataset, audio embeddings are generated from the speech field text. In production, you'd pass raw .wav bytes to embed_audio() directly — the Gemini model handles audio transcription and embedding in a single API call, and the retrieval path is identical to what you see here. Scenario C: Hybrid Search with Location Filtering For broader queries that benefit from evidence across multiple modalities, the hybrid search fuses results from all three named indices weighted by relevance, optionally filtered to a specific location from the payload. life-memorizer recall "the cafe menu" --location Cafe --hybrid def hybrid_search( self, query_vector: list[float], weights: dict[str, float], limit: int = 5, location_context: Optional[str] = None, ) -> list[RecallHit]: fused = {} for vector_name, weight in weights.items(): if weight <= 0: continue hits = self.search( vector_name=vector_name, query_vector=query_vector, limit=limit * 3, location_context=location_context, ) for hit in hits: weighted = hit.score * weight existing = fused.get(hit.moment.id) if existing is None: fused[hit.moment.id] = RecallHit( moment=hit.moment, score=weighted, matched_vector=hit.matched_vector, ) else: existing.score += weighted if weighted > hit.score * weights.get(existing.matched_vector, 1.0): existing.matched_vector = hit.matched_vector ranked = sorted(fused.values(), key=lambda h: h.score, reverse=True) return ranked[:limit] Each index returns limit * 3 candidates. Scores accumulate per point ID across the three index searches, then the merged pool gets re-sorted by total weighted score. The location filter runs as a payload filter inside each individual .search() call, not post-retrieval on the fused pool. You're not fetching three hundred candidates and then discarding most of them. You're constraining the HNSW search before it runs. The weights are tunable via config.py. For most life memorizer queries, weighting video_frame and ocr_log higher than ambient_audio gives better precision because visual and text matches are more semantically specific. For voice-first applications, "what did people say on the street today?" shift weight to ambient_audio. Asking Grounded Questions via Local RAG Beyond ranked recall results, you can ask natural language questions and get a grounded answer from the on-device language model: Mock Mode (offline stub, no model required): life-memorizer ask "where did I leave my keys?" OR life-memorizer ask "what did Sarah ask me to buy?" Live Mode with Ollama (Gemma-2b, fully local): # Pull the model first ollama pull gemma2:2b life-memorizer ask "where did I spotted a white truck while cycling?" Live Mode with Gemini API backend: # Bash export LIFE_MEMORIZER_RAG_BACKEND=gemini # PowerShell $env:LIFE_MEMORIZER_RAG_BACKEND="gemini" life-memorizer ask "when did a couple cross me while I was walking on the city streets?" Part 5: Production Edge Optimization & Privacy Considerations Quantization: Staying Inside Your RAM Budget Even at 768 dimensions, float32 vectors consume memory at scale. A device logging one visual frame every five seconds accumulates 720 vectors per hour on the visual channel alone. At 768 floats × 4 bytes each, that’s roughly 2.2 MB per hour just for video_frame — manageable until you're also holding ambient_audio and ocr_log, plus the HNSW graph structure, plus whatever LLM you're running concurrently. Qdrant Edge supports two quantization modes, configurable at shard creation time: ▣ Scalar (Int8) : Each float32 component (4 bytes) is quantized to an int8 (1 byte). That’s a 4x reduction in vector storage — the 2.2 MB per hour becomes 550 KB per hour. Search accuracy is well-retained because the quantization error is small and the Cosine distance ranking is robust to small value noise. This is the right default for most edge applications. ▣ Binary : Each float component becomes a single bit. Up to 32x compression. Bit comparisons are fast, which improves search speed. The tradeoff is a measurable drop in recall accuracy for semantically nuanced queries. The standard mitigation is oversampling: fetching a larger candidate pool (limit * k) and then re-ranking with the original float vectors. More pipeline steps, but it keeps you inside tight RAM budgets. def _quantization_config(self): q = self.settings.quantization if q is Quantization.scalar: return qe.ScalarQuantizationConfig( type=qe.ScalarType.Int8, always_ram=True, ) if q is Quantization.binary: return qe.BinaryQuantizationConfig(always_ram=True) return None always_ram=True keeps the quantized vectors in physical RAM for fast retrieval while the full-precision vectors are paged to disk. Start with scalar. Move to binary only if scalar quantization still leaves you above your available RAM budget. Storage Compression (Quantization — Scalar & Binary) | (Diagram comparing float32 baseline vs int8 scalar (4x, high recall retention) vs binary (32x, recall tradeoff requiring oversampling + re-rank)) Memory Consolidation: Managing Local Storage Limits An edge device logging continuously will fill its storage. The naive response is periodic deletion of old records. The problem with simple deletion is that you lose historical context permanently, a moment from three weeks ago might be exactly what a query needs today. The better approach is mean-pool consolidation. Group expired moments by location context, compute the centroid of their vectors across all named spaces, extractively merge their text logs, and write a single “digest” point before deleting the originals. You compress many points into one while preserving semantic searchability. @staticmethod def _mean_pool_vectors(records: list[qe.Record]) -> dict[str, list[float]]: sums = {} counts = {} for rec in records: vectors = rec.vector or {} if not isinstance(vectors, dict): continue for name, vec in vectors.items(): arr = np.asarray(vec, dtype=np.float32) sums[name] = sums.get(name, np.zeros_like(arr)) + arr counts[name] = counts.get(name, 0) + 1 pooled = {} for name, total in sums.items(): mean = total / max(counts[name], 1) norm = float(np.linalg.norm(mean)) if norm > 0: mean = mean / norm pooled[name] = mean.astype(np.float32).tolist() return pooled The resulting digest point gets written with is_summary=True and summary_count=N in its payload, so retrieval code can distinguish it from raw moment points and format the response accordingly. Mean pooling is lossy, that’s the honest version of this. The centroid of twelve distinct visual memories is not a meaningful visual memory itself. What survives consolidation is the approximate semantic location of that memory cluster. Broad topical queries (“did anything happen at home last week?”) still resolve correctly. Precise factual queries (“show me the exact frame where X was visible”) do not. Design your TTL (time to live) windows with that boundary in mind. If a category of memories needs long-term exact retrieval, archive raw vectors to cold storage or a Qdrant server instance before consolidating. Memory Consolidation (Summarization) | Diagram showing N expired points grouped by location → mean-pooled centroid vector + extractive text merge → single digest point replaces originals, with storage reduction labeled Privacy: Where the Real Boundary Is Storage and retrieval happen entirely on-device. That part is genuinely offline. But the “100% local” framing has a gap worth naming directly. Gemini Embedding 2 is a cloud API. Every time you generate an embedding during ingestion, the source content, the image bytes, the audio clip, the scene description goes to Google’s servers to produce the vector. The vector comes back and lives locally. But the raw sensory data made a round trip. This applies to query-time embedding too: when you run life-memorizer recall, your query text gets sent to Gemini to generate its embedding before the local HNSW search runs. Setting store_media_path=False removes file references from the local database, but it doesn't change what happens during the embedding call. The privacy benefit of that flag is about what's stored locally after the fact, not about what left the device during ingestion. Two real mitigations exist. ▣ First, embed during connected windows and cache results, your querying is then fully offline since the vectors are already stored. ▣ Second, if you need end-to-end air-gapped operation, replace Gemini Embedding 2 with a local model. Qdrant’s FastEmbed library runs on-device with no API calls. The unified multimodal quality won’t match a cloud model, but your data never leaves the device at any stage. That’s the actual tradeoff. Pick based on your threat model, not the marketing pitch. Part 6: Closing the Loop with Local RAG Retrieval surfaces relevant memory points with scores and payload metadata. Getting from those points to a conversational answer requires a language model. The project supports two RAG backends, configured via LIFE_MEMORIZER_RAG_BACKEND in your .env or environment. ▣ Ollama backend (default) : Runs Gemma-2b locally via the Ollama HTTP API. Zero cloud calls after setup. Requires ollama pull gemma2:2b before first use. Latency depends on your hardware, but on a modern laptop you're typically looking at 2–5 seconds for a short answer. ▣ Gemini API backend : Uses the Gemini API for generation. Faster and higher quality than Gemma-2b, but adds a network round trip and API spend. Good for development and testing; think carefully before using it in a privacy-sensitive deployment. Both backends receive the same structured prompt: def build_prompt(question: str, hits: list[RecallHit]) -> str: context = build_context_block(hits) if hits else "(no relevant memories found)" return ( f"Recalled memories:\n{context}\n\n" f"Question: {question}\n" f"Answer using only the memories above." ) # Inside OllamaGenerator.generate: payload = { "model": self.model, "system": "Answer using ONLY the recalled memories. Never invent details.", "prompt": prompt, "stream": False, } data = json.dumps(payload).encode("utf-8") req = urllib.request.Request( f"{self.host}/api/generate", data=data, headers={"Content-Type": "application/json"}, ) The system prompt is load-bearing. “Never invent details” restricts the model to what’s in the context block. Without it, smaller models like Gemma-2b fill gaps with plausible-sounding fabrications. The instruction doesn’t guarantee hallucination-free output, nothing does at this scale, but it meaningfully reduces the frequency, especially for factual queries where the correct answer is already in the retrieved context. The build_context_block() function formats each RecallHit into a structured text block: timestamp, location, matched modality, and the relevant text extract (transcript or OCR). The model sees this as a list of dated, located memory fragments and grounds its answer against them. This RAG step transforms the pipeline from a vector search tool into a usable assistant. Without it, you get ranked memory points with scores and metadata. With it, you get a natural language answer grounded entirely in local evidence, no external API call during retrieval, no context leaving the device after ingestion. Architectural Takeaways ▣ Unified multimodal embeddings eliminate alignment engineering. If you’re building any system that searches across mixed input types, a model that projects all of them into the same coordinate space removes an entire class of architectural problems. The alternative, separate models, separate vector spaces, explicit translation layers work but creates fragility at every join. ▣ In-process vector storage is the right model for edge hardware. Client-server databases carry implicit assumptions about network reliability and persistent background processes that don’t hold on constrained or intermittently connected devices. Qdrant Edge removes those assumptions at the library level. ▣ Decouple ingestion from storage writes. The pipeline processes media, generates embeddings, and writes to the shard as separate steps. Keeping them decoupled lets you batch writes, add retry logic on the embedding step, and run optimize() explicitly after large ingest batches rather than during hot query paths. ▣ Start with scalar quantization. 4x compression, minimal recall loss. Binary is for when scalar still leaves you over budget, not a default. ▣ Mean pooling is principled memory decay, not deletion. Semantic neighborhoods survive consolidation. Exact moment recall does not. Design TTL windows around which of those properties your application actually needs. ▣ Know your privacy boundary. If the embedding model is cloud-hosted, source data leaves the device during ingestion and at query time. That’s a design constraint, not a deal-breaker. Build around it consciously. Conclusion: Experiment and Build Building this project showed me that we no longer need to rely entirely on heavy cloud setups to build smart, contextual software. You can design high-performance, intelligent local pipelines that respect data privacy and run entirely within an application process space. If you want to see how this performs on your own machine, clone the repository, load up a local sample video, and run through the workflows. Experiment with the vector weights, try out different quantization modes, or swap in a purely local embedding setup to see how far you can push edge vector search. The complete codebase is open-source and live right now, clone it, run locally and experiment with it: https://github.com/satyam671/Life-Memorizer-With-Gemini-Embedding-2-And-Qdrant-Edge . Feel free to explore the modules, open an issue if you encounter bugs, or modify the architecture for your own tools. GitHub - satyam671/Life-Memorizer-With-Gemini-Embedding-2-And-Qdrant-Edge: A privacy-first, local digital twin for smart glasses that continuously indexes what a user sees, hears, and reads, allowing instant local semantic recall. References Qdrant Edge Documentation: https://qdrant.tech/documentation/edge/ Qdrant Edge Quickstart: https://qdrant.tech/documentation/edge/edge-quickstart/ Gemini Embedding 2: https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/embedding-2 Building an Offline “Life Memorizer” with Gemini 2.0 & Qdrant Edge was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://pub.towardsai.net/building-an-offline-life-memorizer-with-gemini-2-0-qdrant-edge-695ce69d3360?source=rss----98111c9905da---4