How I Built a Real-Time In-Car SOS Detection System With Qdrant Edge, SigNoz, and YAMNet

The Button Nobody Presses Think about the last time you drove through a sketchy stretch of highway at night. Or the time you had a medical episode in the car and had to focus on not crashing the vehicle. The classic response to an emergency is a button. A physical SOS button, or some voice-activated assistant, or a crash detection feature buried three menu levels deep in an app. All of them share the same flaw: they require you to do something. They require you to be conscious, coherent, and aware enough to reach for help. That is a remarkably bad assumption in an emergency. What I wanted to build was something different. A system that listens passively, all the time, entirely on the device, and sends an alert automatically the moment it detects that something is wrong. No buttons or commands. No cloud uploads of your raw data. One major challenge was ensuring the entire detection pipeline stayed within the system’s 500ms processing window. Every 500ms, the device captures fresh audio, generates a YAMNet embedding, and runs a Qdrant Edge similarity search so if processing takes longer than that, the pipeline starts buffering and eventually breaks down. Using OpenTelemetry with SigNoz , I traced each stage of the workflow and found that YAMNet embedding generation took ~22ms while Qdrant Edge search stayed under 1ms. This gave real visibility into performance bottlenecks without ever sending raw audio off-device. The full implementation is on GitHub . You can follow along with the code as you read through this. What We Used and Why? — Qdrant Edge The standard approach to vector search involves spinning up a cloud database (like Pinecone or the standard Qdrant) over HTTP. For cloud apps, this is fine. For an in-car edge device , it is completely untenable. You cannot depend on cellular networks to save a life, you cannot run heavy Docker containers on embedded hardware, and you absolutely cannot stream 24/7 raw in-car audio to a cloud server (a massive privacy liability). Qdrant Edge solves this. It acts like the SQLite of vector databases. You install it as a simple Python library: pip install qdrant-edge-py And open a directory on disk as a shard: from qdrant_edge import EdgeShard, EdgeConfig, EdgeVectorParams, Distance config = EdgeConfig( vectors={ "audio_embedding": EdgeVectorParams(size=1024, distance=Distance.Cosine) } ) shard = EdgeShard.create("./qdrant-edge-shard", config) You don’t require a daemon or Docker. The entire vector database lives inside the memory space of your Python program. It’s not like we don’t have a ton of other alternatives; we do, but the reason I chose Qdrant Edge is because of three main reasons: Absolute Privacy (vs. Cloud DBs): Vector matching happens entirely in-process. Audio is captured, embedded, matched, and discarded locally without ever leaving the vehicle. Built for the Edge (vs. Local Server DBs): Unlike running local instances of ChromaDB or Elasticsearch which drain RAM and CPU, Qdrant Edge relies on a highly optimized Rust core, making it perfect for low-power edge devices. True Database Features (vs. FAISS / NumPy): While you could hold arrays in memory with FAISS, you lose metadata. Qdrant Edge lets us attach metadata (like sound_type or severity), filter dynamically, and persist to disk effortlessly. . What We Are Actually Building The system has one job, and to do it as best as possible, it needs to listen to audio coming from a car’s microphone, and detect if a distress call sounds like a scream, a crash, shattering glass or an emergency siren — and send an alert to a Telegram contact. The whole pipeline looks like this: Capture live microphone audio in overlapping 1-second chunks. Run each chunk through YAMNet to produce a 1024-dimensional embedding. Search the Qdrant Edge shard for similar sounds using cosine similarity. Apply temporal smoothing: only trigger an alert if we get 3 hits within 5 seconds. Fire a Telegram message with the detected sound type, severity, and timestamp. The raw audio never leaves the device. All processing happens locally, in real time. Architecture Diagram Architecture The left side (provisioning) runs once. You download the ESC-50 sound dataset, embed every distress sound with YAMNet, and store those vectors in the Qdrant Edge shard. That shard then sits on disk indefinitely. The right side (runtime) runs continuously. Every 500 milliseconds of new audio produces a fresh embedding, searches the shard, and feeds into a sliding window detector. The shard file is the same one written during provisioning — there is no synchronization step, no replication, no cache warming. It just opens and reads. When you run python main.py , this is what start-up looks like: Spinning up the project The Audio Pipeline Capture The microphone is read using sounddevice. The capture runs in a background thread and puts overlapping chunks into a thread-safe queue. self._stream = sd.InputStream( samplerate=16000, channels=1, dtype="float32", blocksize=int(16000 * 0.01), # 10ms blocks callback=self._callback, ) The key design here is the sliding window. Each chunk is 1 second of audio, but we emit a new chunk every 500 milliseconds. That 50% overlap means a sudden sound that starts in the middle of a window still gets fully represented in the next one. You do not miss events that fall on boundaries. Preprocessing YAMNet has specific requirements: mono audio, 16kHz sample rate, float32 values between -1.0 and 1.0. The preprocessor enforces all of this regardless of what the microphone delivers. Why YAMNet and how does it fit ? YAMNet — Yet Another Mobilenet Network — is a pre-trained audio classification model from Google trained on the AudioSet dataset, which contains over 2 million human-labelled audio clips across 521 sound classes. The model is publicly available on TensorFlow Hub and weighs about 3 MB. It takes raw float32 waveform as input and produces three outputs: class probability scores, intermediate embeddings, and a log-mel spectrogram. We are not bothered about the class scores. We want the embeddings. import tensorflow_hub as hub model = hub.load("https://tfhub.dev/google/yamnet/1") _, embeddings, _ = model(waveform_tensor) Mean-pool over time frames to produce a single (1024,) vector embedding = tf.reduce_mean(embeddings, axis=0).numpy() L2-normalize so cosine similarity equals dot product norm = np.linalg.norm(embedding) if norm > 0: embedding = embedding / norm That 1024-dimensional vector captures the acoustic fingerprint of the sound. A scream and a car crash are going to produce very different fingerprints from someone humming or a two-way conversation. That difference is exactly what Qdrant Edge exploits during search. The reason to pick YAMNet over a general-purpose audio embedding approach is that it was trained specifically to understand audio events, not music or speech. It already knows what an emergency siren sounds like as a concept. We are just using its internal representation as a similarity signal. Building the Sound Library (Indexing) Before the detector can run, we need to populate the Qdrant Edge shard with reference embeddings. This is the one-time provisioning step. We use the ESC-50 dataset: 2000 environmental sound clips across 50 classes, all available for free. We pick the classes that matter for emergency detection, embed each one with YAMNet, and store the result with metadata. ALERT_CLASSES = { "screaming": {"sound_type": "scream", "severity": "high"}, "glass_breaking": {"sound_type": "glass_break", "severity": "high"}, "siren": {"sound_type": "siren", "severity": "high"}, "gunshot": {"sound_type": "collision", "severity": "high"}, "car_horn": {"sound_type": "car_horn", "severity": "medium"}, "crying_baby": {"sound_type": "crying", "severity": "medium"}, } Each sound gets stored as a point in the shard with a payload that includes its alert_class (either “alert” or “negative”), sound type, and severity. The alert_class field gets a keyword payload index: shard.update( UpdateOperation.create_field_index("alert_class", PayloadSchemaType.Keyword) ) This is important. At query time, we do not just want the nearest vectors globally we want the nearest vectors that are actually alert-worthy sounds. The keyword filter on alert_class lets Qdrant Edge restrict the search space to distress sounds only, which dramatically reduces false positives from background noise. search_filter = Filter(must=[ FieldCondition( key="alert_class", match=MatchValue(value="alert"), ) ] ) results = shard.query( QueryRequest( query=Query.Nearest(embedding.tolist(), using="audio_embedding"), filter=search_filter, limit=5, with_payload=True, ) ) The Detection Logic Raw similarity scores are noisy. A single loud sound that happens to be similar to a scream is not an emergency. An engine backfire can produce a high score momentarily. The system needs to tell the difference between a spike and a sustained pattern. The detector uses a sliding time window. A “hit” is recorded whenever the top similarity score exceeds the threshold (0.80 by default and different thresholds exist for different sounds and can be tweaked based on requirements). I have set value for the various sounds accurately here. An alert is only confirmed if at least 3 hits occur within a 5-second window. self._hit_window.append((time.time(), event)) now = time.time() while self._hit_window and (now - self._hit_window[0][0]) > self._window_secs: self._hit_window.popleft() recent_hits = len(self._hit_window) if recent_hits >= self._hits_required: self.total_alerts += 1 event.hit_count = recent_hits self._hit_window.clear() self._on_alert(event) This catches the real terminal output from a live test session. Three hits across 1.5 seconds of audio confirmed — and an alert is fired. The Telegram message arrives within a second of the third detection hit. Context-Aware Thresholding and Amplitude Gating Not all sounds behave the same way, so a global threshold doesn’t work in practice. The system uses per-class threshold overrides to balance sensitivity and precision: Amplitude Gating : Before a chunk is even embedded, we calculate its RMS volume. If it’s too quiet (e.g., standard AC hum or road noise), it gets dropped immediately. This saves compute cycles and prevents silent background noise from triggering false positives. Strict Mode for Broadband Sounds : Sounds like a car horn or a siren share acoustic similarities with heavy wind or engine revs. To prevent false alarms, these classes require a very strict similarity score (e.g., 0.90). Sensitive Mode for Impulsive Sounds : An emergency scream or a gunshot might be muffled or extremely brief. For these, we lower the similarity threshold to 0.80 and only require 2 hits instead of 3. This ensures the system reacts instantly to sudden, violent impulses while remaining stubbornly resistant to ambient noise. Flying Blind on the Edge: Adding SigNoz Observability While experimenting with the system on my desk, everything worked flawlessly. But then I realized a massive operational problem: what happens when this is deployed in a moving car? Edge deployment means you are effectively flying blind. If a device starts overheating and thermal throttling, or if a rattling car part starts triggering hundreds of “glass break” false positives, I wouldn’t know. I can’t exactly SSH into a vehicle driving down the highway to read the terminal output. I needed a way to monitor the system’s health, latency, and detection rates without ever recording or sending raw audio to the cloud (which would violate the entire privacy-first architecture). To solve this, I instrumented the pipeline using OpenTelemetry and routed the data to SigNoz . Because we are only sending metadata (timestamps, processing durations, and similarity scores), the privacy of the vehicle remains completely intact. Tracking the Bottlenecks: The 500ms Window The detector processes audio in 500ms chunks. If the hardware takes longer than 500ms to extract features, run the YAMNet embedding and search Qdrant, and the pipeline will buffer and eventually collapse. By wrapping my pipeline in OpenTelemetry spans, I could instantly see exactly where the computational time was going. Trace flow at SigNoz As you can see in the trace above, the total chunk processing time (sos.process_chunk) is clocking in around 24ms. The preprocessing (sos.preprocess) is near instantaneous (0.26ms). The heavy lifting is done by the YAMNet embedding (sos.embed), taking about 22.8ms. The Qdrant Edge similarity search (sos.vector_search) takes a mere 0.29ms. This gave me the confidence that the system operates well within the 500ms deadline, even on constrained hardware. Fleet-Wide Metrics Without Raw Data metrics at SigNoz Beyond just latency, I needed to track behavior . I created custom metrics in SigNoz to count sos.detection.hits (every time a sound breaks the similarity threshold) and sos.alerts.sent (when the temporal smoothing triggers a confirmed SOS). If I deployed this to a fleet of 5,000 cars and noticed that one specific vehicle model was suddenly generating massive spikes in sos.detection.hits for “car_horn” without triggering real alerts, I would instantly know there was an anomaly — perhaps a mechanical noise specific to that chassis. I could then adjust the SIMILARITY_THRESHOLD for that specific fleet via an OTA config update. Adding SigNoz took the project from being a “cool local script” to a resilient, production-ready architecture capable of being managed at scale. The Telegram Alert The alert module handles two channels: a Telegram HTTP call and a system audio beep. A cooldown timer (30 seconds by default) prevents spam in case of a sustained alarm. message = ("IN-CAR SOS ALERT\\n\\n" f"Severity: {event.severity.upper()}\\n" f"Sound Detected: {event.sound_type}\\n" f"Match Score: {event.score:.4f}\\n" f"Hits in Window: {event.hit_count}\\n" f"Time: {ts}\\n\\n" "Powered by Qdrant Edge · On-device detection" ) response = requests.post( f"https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage", json={"chat_id": TELEGRAM_CHAT_ID, "text": message}, timeout=10, ) One design decision worth calling out here: send the message as plain text, not Markdown. Sound type names like car_horn or glass_break contain underscores, and Telegram’s Markdown parser treats them as italic delimiters. Plain text avoids the whole problem. This is what the alerts actually look like arriving on your phone: Tech Stack Audio capture — sounddevice Low-latency microphone stream with callback-based processing Preprocessing — librosa + NumPy Audio resampling and normalization Embeddings — YAMNet (TF Hub) ~3 MB model trained on audio events with 1024-dimensional output Vector search — qdrant-edge-py In-process vector search with no server dependency and persistent on-disk storage Alerting — requests (Telegram Bot API) Simple and reliable notifications without requiring app installation Configuration — python-dotenv Secure credential management using .env files instead of hardcoded secrets The entire stack runs on an Apple M4 with Metal GPU acceleration for TensorFlow. On Linux edge hardware (a Raspberry Pi 5 or an Orin NX), it runs on CPU and still meets real-time requirements comfortably. Design Decisions Worth Thinking About Why filter by alert_class instead of just using a threshold? A high cosine similarity score means the query sound is close to something in the database. But “close to” includes close to negative examples, too. Engine noise can be acoustically similar to a siren at certain frequencies. By only searching within the alert class of points, we avoid ever scoring against the negative examples at all. The filter happens inside the ANN search itself, not as a post-processing step; Qdrant Edge handles this efficiently because of the keyword index. Why 3 hits in 5 seconds and not just 1? A single high-scoring hit almost certainly means a relevant sound was detected. But a transient event like a door slamming or a sharp noise from the road can spike above 0.80 for one chunk and then disappear. Three confirmed hits in 5 seconds means the sound is sustained, which is exactly what distinguishes a real emergency (a sustained siren, ongoing screaming, an alarm) from a false trigger. Why mean-pool YAMNet’s frame embeddings? YAMNet operates on roughly 0.48-second frames and produces one embedding per frame. A 1-second audio chunk produces about 2 embeddings. Mean-pooling collapses these into a single representative vector. The alternative storing multiple vectors per chunk and aggregating search scores — would complicate the indexing and search logic significantly. Mean-pooling is simpler and works well because the acoustic character of a 1-second distress sound is consistent across frames. Why alert-usage and not alerter? Authenticity. A system built by a human engineering team in a real product looks like real engineering decisions, not auto-generated code. File names, variable names, and comment style all contribute to whether the codebase feels like something someone has built or something that was generated. Numbers YAMNet model size: ~3 MB Qdrant Edge shard (360 indexed sounds): ~8 MB End-to-end latency per chunk: ~40ms (M4 with Metal) Alert delivery (Telegram): <1 second after confirmation Memory footprint (total process): ~320 MB Audio data sent off-device: 0 bytes The 360-vector shard covers 7 alert sound categories and 9 negative categories, with up to 40 clips per alert class and 20 per negative class. That is small enough to initialize in under 10 milliseconds and fits entirely in the L2 cache on modern ARM processors. What’s Next? The biggest gap right now is the alert routing. Telegram works well for a demo, but production-grade alerting would want to integrate with emergency service APIs, push to a fleet management dashboard, or trigger an automated call. The detector’s on_alert callback interface is already designed for this — you swap the Telegram handler for anything else. Data Augmentation: Simulating the In-Car Environment While ESC-50 provides a great baseline, real-world emergency sounds don’t happen in a soundproof studio — they happen over the rumble of an engine. To bridge this gap, we implemented custom dataset ingestion. We took raw, real-world files of screams and gunshots and programmatically augmented them during the indexing phase. Before passing these custom samples into YAMNet, the preprocessor dynamically overlays the audio on top of a base track of a car engine standing idle. It then indexes both the clean version and the engine-augmented version into the Qdrant Edge shard. This means our vector database is explicitly populated with the acoustic fingerprints of emergencies happening inside a running vehicle , dramatically boosting real-world recall without needing to collect thousands of hours of ƒdashcam audio. Finally: speaker identification. Right now the system detects distress sounds generically. Adding a voiceprint comparison layer using the car owner’s enrolled voiceprint as a reference embedding stored in the same Qdrant Edge shard would allow the system to distinguish a passenger screaming at a horror movie podcast from the driver screaming in a genuine emergency. The Core Idea Every technology decision in this system points in the same direction. Audio stays on the device. The model runs locally. The vector database runs in-process. The only outbound call is the alert itself, and that is the entire point. Qdrant Edge made all of this possible without any infrastructure gymnastics. The same tool that powers production-grade semantic search at scale also runs in 8 megabytes of disk space inside a Python process on embedded hardware. That is the part that is easy to miss when you first see it. This is not a simplified version of vector search. It is the full thing, just without the server. References Qdrant Edge Documentation YAMNet on TensorFlow Hub ESC-50 Dataset (Karol Piczak) Telegram Bot API sounddevice documentation AudioSet — Google Research How I Built a Real-Time In-Car SOS Detection System With Qdrant Edge, SigNoz, and YAMNet was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://pub.towardsai.net/how-i-built-a-real-time-in-car-sos-detection-system-with-qdrant-edge-signoz-and-yamnet-4cf3bd6365a7?source=rss----98111c9905da---4