A detector-first, VLM-veto, multi-feed-corroboration architecture, evaluated with operational honesty
Author: Mark Ludwikowski <markl02us@yahoo.com>
Deliverable: ADRIZ → INGV (Istituto Nazionale di Geofisica e Vulcanologia), Mount Etna visual monitoring
Frozen evaluation baseline: commit 4d1b0cca79bf396e99b9a49e8477ae3a36ecfd33 (4d1b0cc), branch master
Verification window (UTC): 2026-06-29 ~02:31 → ~02:56
Operational-classification labels used throughout: LIVE_OPERATIONAL · LIVE_STALE · STAGED_NOT_LIVE · RESEARCH_ONLY · PLANNED · BROKEN_OR_BLOCKED · UNKNOWN_NEEDS_VERIFICATION
Evidence discipline (binding). Every quantitative claim in this thesis traces to a committed Phase-1 evidence artifact in
flank_wildfire/reports/thesis/. No number is invented or softened. Where evidence is incomplete the text says so plainly, scopes the claim, or labels the metric UNKNOWN. The system is not claimed to solve early wildfire detection generally; it is a reproducible, evidence-backed architecture for multi-source environmental monitoring, reported with detector-alone-versus-two-stage metrics, confidence intervals, latency, false-positive and false-negative behaviour, and explicit failure modes.
Mount Etna is one of the most challenging environments in the world for camera-based wildfire detection: vegetated, populated flanks that genuinely burn sit directly beneath a persistently active volcano whose degassing plumes, ash columns, lava incandescence, and strombolian ejecta are constant visual confounders, and whose summit is routinely obscured by meteorological cloud and twilight glare. This work presents and evaluates ADRIZ, a public-data-driven visual monitoring system for the Etna / Sicily setting that combines four components: (i) a camera wall ingesting multiple public webcam and institutional video sources on a 75-second model-resident cadence with honest per-source health reporting; (ii) a frozen YOLO11s detector (weights sha256 c0a3d0ead257…cf0a20) that generates per-frame candidates; (iii) a crop-level Qwen3-VL semantic veto (qwen3-vl-32b, temperature 0.0) invoked only on detector-routed hot/bright crops, made recall-safe at night by a durable satellite-corroboration override; and (iv) a 64-feed public-data board with operational staleness classification plus a per-class satellite/weather/geospatial corroboration layer (WF36).
On a held-out, leakage-controlled evaluation (RESEARCH_ONLY, small-n), the two-stage system held the volcanic false-alarm rate at 8.1% (5/62, 95% Wilson CI [3.5–17.5%]) versus 9.7% detector-alone, while preserving 94.4% daytime genuinely-visible wildfire recall (17/18, 95% CI [74.2–99.0%]). The veto's measured effect was strictly one-directional (McNemar b=2, c=0, exact p=0.50, not significant at this sample size): it removed exactly one sensor-artifact false alarm and one borderline daytime ground-lights frame, and improved no genuine-fire decision. At night the veto originally vetoed two true vegetation fires whose flame is single-frame-indistinguishable from lava incandescence; a durable night-safety guard now withholds the volcanic veto on any uncorroborated wildfire-class night detection (surfacing it as uncertain_night rather than dropping it), taking the night true-fire silent false-negative count from 2 to 0 while leaving the daytime numbers and the volcanic false-alarm rate unchanged. A separately-reported quantum due-diligence track found, on the same Etna volcanic-versus-wildfire task, that simulated quantum classifiers are beaten by matched classical baselines with confidence intervals that exclude zero (Q-kernel −0.099 [−0.161, −0.038]) — a clean publishable negative, with the one genuine novelty (volcanic source-inversion as a QUBO) preserved as a research line, no quantum hardware used. We report the system with full operational classification, identify the limitations honestly (3-of-5 cameras online at verification, seven stale feeds, latency tails not instrumented, alert dispatch staged-off), and lay out the next build directions: domain adaptation, IR/thermal fusion, MTG-FCI event tracking, and a fine-tuned wildfire/volcano VLM.
Fixed-camera early smoke detection is an established and operationally valuable lineage. Govil et al. [1] demonstrated, on the HPWREN / AlertWildfire camera network in Southern California, a deep-learning system scanning hundreds of cameras every minute that detects smoke typically within roughly fifteen minutes of ignition at under one false positive per camera per day. That result frames the problem this work inherits: detection value comes not from recall alone but from recall under a strict false-alarm budget. A monitoring system that alarms constantly is operationally useless regardless of its sensitivity, because human reviewers stop trusting it. Dewangan et al. [2] (FIgLib / SmokeyNet) and the PyroNear-2025 benchmark [3] established that single-frame camera detection saturates on confounders and that the task remains hard across camera domains — which is precisely why this work layers a semantic veto and multi-source corroboration on top of a fast detector rather than relying on the detector alone.
Sicily experiences a severe Mediterranean wildfire season; its vegetated terrain, including the populated lower and middle flanks of Mount Etna, genuinely burns. The deliverable target for this system is INGV, Italy's national geophysics and volcanology institute, whose Etna monitoring concern spans both volcanic activity and the wildfire risk on the edifice's flanks. The defining difficulty is co-location: a vegetation wildfire on Etna's flank and the volcano's own thermal and plume activity occupy the same cameras, the same satellite pixels, and frequently the same frame.
The confounder set at Etna is unusually rich and adversarial:
WARP [12] found that both CNN and transformer wildfire detectors fail to distinguish cloud-like patches from real smoke under local adversarial perturbation; Etna supplies that adversarial confounder set naturally, every day. A system for this environment must therefore be engineered around wildfire-versus-volcanic disambiguation, not generic smoke detection.
No single sensor resolves the confounder problem. A camera sees a plume but cannot tell smoke from steam at the summit; a satellite thermal product sees heat but at a pixel far coarser than the camera's region of interest and cannot tell flank wildfire from crater lava when the heat is on-crater; a gas sensor or SO₂ retrieval supports a "volcanic" reading but is a coarse atmospheric column, not pixel-level event proof. The architecture this thesis evaluates therefore treats the problem as a system of independent public data sources — camera candidate detection, semantic VLM reasoning, and satellite/weather/geospatial corroboration — built entirely from public feeds and low-cost commodity compute (a local workstation detector plus remote serverless VLM inference; no edge, Hailo, or DGX hardware is assumed). The contribution is the reproducible, operationally-honest integration of these sources, with every component classified by its true operational status.
This chapter situates each design choice in current, verified literature. Every reference was confirmed to exist against arXiv, the publisher, IEEE/ScienceDirect, or the Hugging Face papers index before citation; all 35 references are VERIFIED (the two highest-stakes, SmokeBench [7] and FCI-FireDyn [26], were independently spot-re-confirmed). Numbers in square brackets index the References section.
The camera-wall lineage is defined by Govil et al. [1] (HPWREN/AlertWildfire, ~15-minute detection, <1 FP/camera/day), Dewangan et al. [2] (FIgLib's ~25,000 labelled fixed-camera smoke images and the spatiotemporal SmokeyNet CNN that exploits frame-to-frame information), and the PyroNear-2025 benchmark [3] (a geographically diverse web-scraped camera dataset, ~150k annotations over 640 wildfires, showing the task remains hard across domains). These define the detector stage's job — per-frame candidate generation under a strict false-alarm budget — and motivate the temporal and multi-source layers added on top.
The incumbent detector is a YOLO11s-class single-stage model chosen for model-resident real-time inference. Newer detectors are treated as future directions, not requirements: YOLOv12 [4] (attention-centric, Area-Attention/R-ELAN, improved mAP at comparable latency but with training-instability/CPU-throughput costs); RT-DETR [5] (the first real-time end-to-end NMS-free detection transformer, CVPR 2024); and for open-vocabulary candidate generation, Grounding DINO 1.5 [6], particularly the TensorRT-optimised Edge variant (~75 FPS). RT-DETR's NMS-free property and Grounding DINO Edge's text-promptable open-set detection are the two most relevant upgrade paths for a camera wall that must add confounder classes without full retraining.
This is the central evidence base for the detector-first + VLM-veto choice. SmokeBench [7] (Qi, Li, Barnes; WACV 2026) evaluates MLLMs (Qwen2.5-VL, InternVL3, GPT-4o, Gemini-2.5 Pro, Grounding DINO, Idefics2, Unified-IO 2) on smoke classification, localization, and detection; its headline finding is that models can often classify large-area smoke but all struggle with accurate localization, especially early-stage, with performance strongly tied to smoke volume. Earth-observation VLM benchmarks corroborate the pattern: GPT-4V-on-EO ("good at captioning, bad at counting") [8] and GEOBench-VLM [9] both show strong open-ended scene knowledge but poor spatial localization/counting (GPT-4o ~40% on GEOBench-VLM MCQs, ~2× chance). This directly supports using a VLM not as the primary localizer but as a second-stage semantic veto on already-localized crops — the regime where MLLMs are strong.
No single canonical paper names "detector-first + VLM-veto for wildfire cameras"; the claim is supported by converging verified evidence and this is stated honestly rather than attributed to an invented source. SmokeBench [7] and the EO-VLM benchmarks [8,9] establish VLM localization weakness; the camera-detector lineage [1,2,3] establishes that fast single-stage detectors localize well but over-fire on confounders. FireCLIP [10] is the closest direct evidence that a vision-language stage adds value specifically as a false-alarm discriminator (cooking smoke, industrial emissions), reporting ≥12.45% zero-shot improvement and better regional generalization via prompt tuning. The two-stage decomposition — fast detector for recall/localization, VLM for precision/semantic veto — is exactly what the WF25 evaluation (Chapter 5) tests empirically.
FireCLIP [10] demonstrates prompt tuning as the mechanism delivering its false-alarm and generalization gains; TuneVLSeg [11] benchmarks textual/visual/multimodal prompt-tuning under domain shift (textual prompts degrade under large shift, visual prompting is a competitive cheaper first attempt); WARP [12] shows prompt/threshold-adjacent design controls the recall-versus-false-alarm operating point. Accordingly, the Qwen3-VL veto prompts in this work are versioned and frozen (Chapter 3, model_prompt_freeze.json) and reported as a tested variable, not an afterthought.
Identified as a likely core next-build pillar. Verified sources: the MCAF multilevel-feature-alignment UDA smoke detector [13]; EDIF [14] for enhanced domain-invariant cross-domain forest-fire smoke detection; a synthetic-to-real UDA study on the AlertWildfire network [15]; and Pesonen et al. [16] on zero-shot foundation-model supervision training small real-time camera segmenters from box labels. Each ADRIZ camera (INGV, Windy, EtnaWalk) is a distinct visual domain (lighting, angle, weather, Etna's plume backdrop), which scopes the Chapter-7 adaptation plan.
WARP [12] (Ide & Yang) is the first model-agnostic framework for adversarial robustness of wildfire detection models, injecting global (Gaussian) and local (cloud-PNG-patch) noise; transformers showed >70% precision degradation under global noise, and both CNN and transformer models failed to distinguish cloud-like patches from real smoke under local attacks. This is the template for the failure-appendix hard-negative battery (Chapter 6) and the direct literature motivation for the VLM veto + satellite corroboration as mitigations.
Verified RGB-thermal fusion sources: a UAV multi-scenario RGB-Thermal forest-fire dataset and fusion model [17]; the MCDet target-aware RGB-T fusion model [18]; a visible+thermal-infrared flame-detection method [19]; and at the VLM level WildFireVQA [20], a large radiometric-thermal VQA benchmark finding RGB remains the strongest single modality for current MLLMs while retrieved thermal context helps stronger models. IR is directly relevant to Etna's lava-versus-flame ambiguity, but WildFireVQA keeps the claim honest: thermal is a contextual gain, not a solved modality for VLMs.
For temporal smoke-motion consistency and plume-growth segmentation: a spatiotemporal bag-of-features early-smoke detector using histogram of oriented optical flow exploiting upward thermal convection [21]; spatiotemporal/dynamic-texture forest-fire smoke video detection [22]; and for modern segmentation, SAM 2 [23] (promptable video segmentation with streaming memory) plus a fire-specific SAM2 study [24] (Box+MP best, mIoU ~0.64). Temporal consistency is the most promising future mitigation for the night-fire↔lava residual (lava is steady; wildfire flickers and spreads).
The corroboration layer's geostationary basis. MTG-FCI detects fires ~4 h earlier than SEVIRI, ~2 h before MODIS, and finds ~5× more active-fire pixels than SEVIRI [25]; the FCI-FireDyn / Fire-Event-Tracker algorithm [26] (Paugam et al., 2026) spatio-temporally clusters FCI hotspots at 10-minute cadence to derive fire-arrival maps, rate of spread, and burnt-area evolution, validated on Southern-European 2024–2025 fires; a feasibility study [27] explores unsupervised MTG-FCI wildfire detection. The directive's insistence that FCI be treated as event-tracking / early-candidate data, not perfect ground truth is grounded here: FCI's strength is temporal evolution and early timing, while its ~1–2 km pixel makes it too coarse for camera-event-level ground truth — exactly the "supports / too coarse" labelling WF36 applies.
Schroeder et al. [28] is the canonical VIIRS 375 m active-fire algorithm behind NASA FIRMS; Xu & Wooster [29] describe the operational SLSTR daytime active-fire / FRP product with global intercomparison to MODIS/VIIRS/Landsat (daytime product operational since March 2022), building on pre-launch algorithm work [30]. These supply the published detection limits that justify the asymmetric corroboration logic in WF36: a thermal hotspot near a camera candidate elevates confidence, while absence is treated as non-disconfirming (sub-pixel early smoke is below the satellite detection floor).
Theys et al. [31] (global TROPOMI volcanic-SO₂ degassing), an Italy-specific Stromboli SO₂ study [32], and the TROPOMI SO₂ retrieval ATBD [33] anchor the gas-context layer. TROPOMI SO₂ supports a "volcanic degassing" classification but its coarse footprint and overpass cadence make it supporting context, never per-pixel camera-event proof — the precise labelling WF36 enforces. CAMS/GFAS plays the analogous aerosol/emission role; TROPOMI is cited as the verified anchor and CAMS-specific event-level proof is flagged context-only.
This is an engineering/operational contribution rather than an academic finding, and that is stated honestly: it is grounded in official documentation — NASA FIRMS product/latency documentation [34] and EUMETSAT MTG instrument documentation [35] — which defines the upstream cadences against which the staleness thresholds are derived. The system's feed-health monitor classifies every feed operationally, with a degraded-response guard ensuring a degraded upstream cannot masquerade as "live-with-zero."
Status of this chapter's claims: every component is labelled with its verified operational status from operational_state_verification.md and system_performance_spec.md, re-verified from current evidence in the 2026-06-29 02:31–02:56 UTC window at commit 4d1b0cc.
ADRIZ is a four-layer pipeline:
Public camera sources (5 configured)
│ 75 s model-resident cycle
▼
[Stage 1] YOLO11s detector (frozen, sha256 c0a3d0ea…)
│ per-frame candidate boxes, 19-class head
├── PASS_THROUGH classes (smoke / ash / degassing) ──────────────┐
│ │
└── ROUTE classes (lava / incandescence / flame) ──► hot/bright crop
│
▼ ▼
[Stage 2] Crop-level Qwen3-VL veto (qwen3-vl-32b, temp 0.0)
│ WILDFIRE | VOLCANIC | NEITHER (+ NIGHT-SAFETY override)
▼
[Stage 3] WF36 multi-feed corroboration (FIRMS / SLSTR / FCI / SEVIRI / TROPOMI / CAMS)
│ on-crater = volcanic; off-crater fresh FIRMS = independent wildfire support
▼
[Stage 4] Alert taxonomy + feed-health board (64 feeds) + bilingual dashboard
The detector and VLM run live in the EtnaCameraWall scheduled task; the feed board refreshes hourly via EtnaFeedsRefresh. Both tasks were Running at verification (schtasks /query @ 02:32 UTC). Status: LIVE_OPERATIONAL for the running pipeline; alert dispatch is STAGED_NOT_LIVE (Section 3.8).
The data-feeds board inventories 64 public sources. At verification (curl https://adr-etna-ingv.pages.dev/data/feeds.json @ 2026-06-29T02:31:33Z, HTTP 200), an independent recount of the 64 group-level entries matched the published summary exactly:
| Feed status | Count | Meaning (from the live banner) | Status label |
|---|---|---|---|
| live | 46 | data returned now | LIVE_OPERATIONAL |
| stale | 7 | real pull, but upstream archive/outage lag | LIVE_STALE |
| catalogued | 10 | reachable but needs a token / no scalar-point API | STAGED_NOT_LIVE |
| auth_pending | 1 | our key not yet configured | STAGED_NOT_LIVE |
| error | 0 | — | — |
| total | 64 | LIVE_OPERATIONAL (board) |
Two engineering guarantees make this a defensible operational claim rather than a vanity count:
osm_roads_rail.py (L64–72, commit 4d1b0cc, originally 398e94d) applies a plausibility guard: an empty/zero road count is flagged stale with validated_pull=False, so a degraded Overpass response cannot masquerade as live-with-a-bogus-zero. The current live read is 23,379 roads / 341 rail → status=live.A representative live numeric is the Fire Weather Index: the most recent EFFIS daily FWI analysis (CEMS EWDS GEFF 4.1) at the Etna-summit cell was 13.06 (moderate) at 02:31:33Z. Honest latency caveat: GEFF 4.1 daily analysis carries ~3–4 day latency, so this is the most recent daily analysis, not an instantaneous reading.
The seven stale feeds at verification are surfaced as failures, not hidden (full table in Chapter 6 / failure_appendix.md §8): cams_gfas_fire (archive 208 days behind), effis_active_fire (EFFIS WFS Oracle backend failure, self-heals), era5_land (~5-day production latency, stale-by-design), gwis (JRC WFS Oracle backend failure), ingv_oe_bulletin (no Etna item in this week's GVP RSS), meteostat (bulk-archive lag), opensky_adsb (HTTP 429 anonymous rate-limit).
Five camera sources are configured (/api/cams sources[] @ 2026-06-29T02:44:37Z): the INGV EtnaTVChn mosaic (garr.tv PeerTube HLS), three Windy webcams (Milo East 9.1 km, Trecastagni 16.8 km, Catania Jonio 26.8 km from summit), and the EtnaWalk YouTube live stream. The wall runs a 75-second model-resident cycle (cadence_s:75, confirmed in both /api/cams and the local publisher artifact cameras_wall.json).
Honest multi-source caveat (the camera wall is not "5 live cameras"). At the verification timestamp only 3 of the 5 sources were online (n_online:3) — the three Windy cameras (all CLEAR, DAY_RGB). The INGV EtnaTVChn mosaic and the EtnaWalk stream were OFFLINE (online=False, badge OFFLINE). The wall reports OFFLINE honestly rather than serving a frozen frame. Thesis-wide wording is therefore scoped to "5 configured camera sources, 3 online at the verification timestamp," never "5 live cameras." Status: LIVE_OPERATIONAL (3/5 online); the two OFFLINE sources are a LIVE_STALE sub-component honestly flagged.
Camera health detection is itself a LIVE_OPERATIONAL feature: per-source online:false / badge OFFLINE is emitted truthfully (multi_cam_service.py L251–261), and a stale/frozen-frame watchdog enforces an age-based guard (STALE_FRAME_MAX_S = 1800 s). A per-frame perceptual-hash identical-frame check (to catch a recent but frozen camera) is not yet implemented and is queued in the roadmap. The /api/cams endpoint additionally showed transient TLS resets (two of three fetches) during verification before succeeding; the locally published cameras_wall.json corroborated the same content. This is recorded as a monitoring flag (Chapter 6), not a hard stop.
The Stage-1 detector is frozen for the thesis (WF19 KEEP-INCUMBENT decision):
| Parameter | Value |
|---|---|
| Architecture | YOLO11s, 19-class head |
| Weights | models/ingv_v1b_best.pt, 19,261,267 bytes |
| Weights sha256 | c0a3d0ead257d318e70bec3bb84feaec7b99e9e3d55b132fc5f1ffd405cf0a20 |
| Inference size / confidence | imgsz 960 / conf 0.25 |
| Crop pad / max side | 0.25 fraction / 768 px |
| Dedup / cooldown | IoU 0.4 / 1800 s per class-location |
| Device | auto (CPU or GPU; no edge-hardware assumption) |
The model-resident detector loop (multi_cam_task.py, PID 31700, ~308 MB resident) was running at verification. The 19-class head emits five wildfire/volcanic-relevant buckets; note that data.yaml shows nc:5, which is stale — the operative head is 19-class, confirmed against the weights and service/config.py CLASS_NAMES. Status: LIVE_OPERATIONAL.
Crucially, the detector's class is used to route:
wildfire_smoke, ash_plume, forced_ashladen_degassing, fumarolic_steam, passive_degassing_steam. Smoke is passed through to preserve wildfire recall — a grey distant plume is not red-hot and must never be vetoable.wildfire_flame, active_lava_flow, incandescent_ejecta, lava_fountain, lava_incandescence, strombolian_explosion. The hot/bright classes are exactly where the lava-versus-flame disambiguation lives.The Stage-2 veto is the exact, frozen configuration recorded in model_prompt_freeze.json (no thesis result references "Qwen-VL" generically):
| Field | Value |
|---|---|
| Model | qwen3-vl-32b (Qwen3-VL 32B) |
| Serving | PHOENIX Model-Vault RunPod serverless, OpenAI-compatible endpoint (remote; crash-resilient by design) |
| Routing | crop-level — a padded crop of each routed detector box; the full frame is not sent for the veto |
| Temperature | 0.0 (deterministic) |
| max_tokens | 120 |
| top_p | server default (not pinned in the request payload — a reproducibility gap, see Chapter 6) |
| Retries / backoff / timeout | 3 / 8.0 s / 180 s |
| Image encoding | crop downscaled to long-side ≤768 px, JPEG q88, base64 data-URL |
| Output | strict JSON {"label":"WILDFIRE|VOLCANIC|NEITHER","confidence":0.0-1.0,"reason":"<=14 words"}; on parse failure, label PARSE_FAIL |
| Cache policy | deterministic (temp 0) verdicts cached per crop; WF25 reproduced Config A bit-for-bit from cache (0 new calls, 245/245 decisions match, 0 phash leakage) |
Two prompts are frozen. The CROP_PROMPT (hot/bright disambiguation, primary veto) explicitly instructs the model "Do NOT assume it is volcanic just because Etna is a volcano — vegetation wildfires occur on Etna's flanks," and forces a one-of-three choice WILDFIRE / VOLCANIC / NEITHER (the last covering sunset glow, sunlit cloud, artificial lights, lens flare, reflection, sensor artifact). The SMOKE_PROMPT (degassing-versus-plume, used for ambiguous large/summit smoke) distinguishes a denser browner/greyer wildfire column rising from vegetated ground from a white/blue crater-rooted degassing plume from diffuse sky-wide cloud/haze. The exact verbatim text of both prompts is in model_prompt_freeze.json (vlm_prompt_exact_text_CROP_PROMPT, vlm_prompt_exact_text_SMOKE_PROMPT).
The VLM is invoked only on detector-routed hot/bright crops: measured at ~0.151 calls/frame over the held-out set and 0 on quiet frames (vlm_call_rate), corroborated live by vlm_calls_this_cycle: 0 on a quiet cycle. This is the one performance figure safe to classify LIVE_OPERATIONAL for the rate itself. Status: LIVE_OPERATIONAL (detector + crop-veto run live in EtnaCameraWall; WF25 metrics are RESEARCH_ONLY).
The VLM veto's most consequential design element is its night-safety override, a durable guard (not a one-off) in service/crop_veto.py + service/config.py. It exists because the volcano-context prompt that gives the system its low volcanic false-alarm rate is exactly what mis-routes a bright nighttime vegetation fire — whose flame is single-frame-indistinguishable from lava incandescence — to VOLCANIC.
Rule as implemented. When a wildfire-class detection (wildfire_flame / wildfire_smoke) is routed and the VLM verdict would SUPPRESS it (VOLCANIC or NEITHER):
NIGHT_PANEL_MEAN_MAX = 12): unchanged — daytime recall was already preserved; lava confusion is a night problem.firms_corroborated), or the hot crop sits inside the summit ROI (inside_summit_roi), reusing the WF36 on-crater logic. With no such corroboration the alarm is not silently dropped: it is downgraded to a still-surfaced WILDFIRE_UNCERTAIN_NIGHT / needs_review state (alert feed + tile).A real off-crater night fire can therefore never be erased by the VLM alone. The re-scored effect is quantified in Chapter 5: night true-fire silent false-negative 2 → 0, daytime and volcanic-FA unchanged. Status: LIVE_OPERATIONAL guard logic (in the live service); the WF25 re-score demonstrating its effect is RESEARCH_ONLY.
Stage 3 places a candidate in independent context via service/corroboration.py (corroborate_decision, gate_alerts, _volcanic_scene), evaluated against a live feed snapshot (snapshot_utc 2026-06-29T02:56:39Z). The corroboration logic implements rules for the five genuinely-corroborable detector classes:
firms_near_summit_km (min of fresh VIIRS/MODIS). A hit in the annulus CRATER_KM(3) < near ≤ NEAR_SUMMIT_KM(15) → corroborated (independent wildfire signal). A hit ≤3 km is treated as the volcano itself → not a wildfire confirmation. >15 km or no fresh FIRMS → uncorroborated; a fresh FRP granule with no co-located value contributes only granule-recency (supports). Load-bearing assumption: on-crater FIRMS is volcanic, not wildfire — FIRMS cannot distinguish lava from a wildfire on the crater itself. Because FIRMS/SLSTR have hours-scale latency, a real early wildfire will routinely be uncorroborated (camera-only early warning), so the system must alarm on high detector+VLM confidence in that window rather than wait for satellite.explains_volcanic=True, which suppresses any wildfire alert for the same scene.The worked examples in wf36_corroboration_matrix.md §3 are the actual output of python service/corroboration.py against the live snapshot. On that cycle, firms_near_summit_km = 0.4 km (on-crater): lava_incandescence was corroborated VOLCANIC, while wildfire_flame/wildfire_smoke were correctly held as uncorroborated (the 0.4 km hit is Etna's own crater thermal, not an independent wildfire — exactly the trap WF36 exists to avoid).
The honest 5-corroborable / 8-STAGED split. Eight Gate-C classes are STAGED_NOT_LIVE corroboration targets, not live detection or corroboration, and the thesis must not imply otherwise: meteorological cloud, glare/sun/reflection, and black/frozen/stale frame are detector context labels only (they never alarm and have no corroboration branch); fog/haze, industrial smoke, dust/quarry, camera artifact, and unknown are not detector classes at all. For industrial smoke and dust, the OSM industrial/power and roads/rail data are on the board but not wired into corroboration.py — they are available-but-unwired columns, not corroboration the thesis can claim. Status: LIVE_OPERATIONAL for the five rule-backed classes (with their cycle-level qualifiers); STAGED_NOT_LIVE for the other eight.
Alerts are deduplicated spatially (IoU 0.4) with an 1800 s per-class/location cooldown; frames older than 1800 s and feeds older than 24 h are treated as stale. The output surface is an internal/preview self-assessment dashboard that explicitly carries a "Not a public product" banner. Automated public alert dispatch is STAGED_NOT_LIVE: email dispatch was gated off (alert_email.enabled=false) at verification, and the human-in-loop review workflow is not yet a live operational pipeline. This is stated plainly: the system is not claimed to operate an alerting pipeline.
The dashboard is fully bilingual (Italian default, English toggle) with browser-detected language and localStorage persistence. Translation-key parity is exact: the en: and it: dictionaries in public/i18n.js each contain exactly 240 keys (240/240). Per-string translation quality was not separately audited; the LIVE_OPERATIONAL claim is key-count parity. Status: LIVE_OPERATIONAL (bilingual UI); STAGED_NOT_LIVE (alert dispatch / human-in-loop).
Operational classification of all Chapter-4/5 metrics: RESEARCH_ONLY (held-out offline evaluation; small-n flagged throughout). The headline system performance is a held-out benchmark, not a live alert-dispatch measurement.
Two disjoint held-out sets are used, both real frames only:
MarkL02/ingv-etna-camera-historical, all ground-truth-negative for wildfire (summit degassing, ash, lava glow, strombolian activity, cloud, glare/sunset, snow). These are the confounders the system must not alarm on.Operating-point confusion definition: GT-negative = the 62 volcanic frames; GT-positive = the 18 daytime genuinely-visible camera-fire frames (the conservative denominator).
The headline conservatively excludes the entire four-frame night↔lava residual category from the daytime denominator → n = 18, 94.4% (17/18), matching the committed WF25_system_scoring. Under Config A only 2 of those 4 night frames are actually lost (07871, 07875); the other 2 (07723, 07773) pass through and alarm. Excluding only the 2 genuinely-lost frames gives the alternative n = 20, 95.0% (19/20), CI [76.4–99.1%]. Both are disclosed; the headline uses the conservative 17/18.
eval_external_v1b.json). Honest caveat: the v1b train images live off-repo (DGX/RunPod workspace), so the train↔held-out diff is taken on documented provenance; the 62 held-out frames are independently confirmed internally distinct (62 unique pHashes). To fully close it, the train↔held-out collision list (expected empty) should be committed.reports/crop_veto_outputs/te_crop_level_cpu_configA.json). WF25 scoring spent 0 new VLM calls, a stub was wired to raise on any cache miss (none occurred), and 245/245 per-frame decisions matched the stored Config A result bit-for-bit. Determinism is scoped to the served qwen3-vl-32b build (temperature 0 gave ±0 swing across three fresh-query repeats), not guaranteed in perpetuity if the served model changes.CONTAM_016, CONTAM_017).The evaluation is built around the wildfire/volcanic confounder taxonomy: wildfire smoke and flame (positives), against volcanic ash, lava/incandescence, strombolian, degassing/steam, plus meteorological cloud, fog/haze, glare/sunset, snow, sensor/lens artifact. A representative hard-negative library is exported (Chapter 6), with web hard-negatives (dust, fog, industrial smoke, glare) listed for category coverage even where the images live off-repo (paths left blank, not fabricated).
All Chapter-5 metrics are RESEARCH_ONLY (held-out offline, small-n), reproduced cache-only from the post-night-guard artifacts. The detector is frozen; the VLM verdicts are deterministic temperature-0 cache reads.
The complete performance specification for the shipping two-stage system — detector → crop-level Qwen3-VL veto, Config A (smoke pass-through) — measured end-to-end on the same real held-out frames, with Wilson 95% CIs:
| Metric | Detector alone | Two-stage system | Δ | 95% CI (two-stage, Wilson) | Evidence |
|---|---|---|---|---|---|
| wildfire smoke recall (daytime) | 100% (17/17) | 100% (17/17) | 0.0 pp | [81.6 – 100%] | per_frame_recall, wildfire_smoke |
| wildfire flame recall (daytime) | 100% (10/10) | 90.0% (9/10) | −10.0 pp | [59.6 – 98.2%] | per_frame_recall, wildfire_flame |
| volcanic-plume FP rate | 9.7% (6/62) | 8.1% (5/62) | −1.6 pp | [3.5 – 17.5%] | survivors = summit-degassing smoke |
| steam/cloud/fog FP rate | 0% (0/62) | 0% (0/62) | 0.0 pp | [0 – 5.8%] | no steam/cloud frame alarmed |
| artifact FP rate | 1.6% (1/62) | 0% (0/62) | −1.6 pp | [0 – 5.8%] | 9243ab lens/sensor artifact removed |
| overall volcanic FP rate | 9.7% (6/62) | 8.1% (5/62) | −1.6 pp | [3.5 – 17.5%] | A_external_volcanic_FA |
| precision / PPV (operating) | 0.750 | 0.7727 | +0.023 | — | §5.3 confusion |
| recall / sensitivity (daytime, n=18) | 100% (18/18) | 94.4% (17/18) | −5.6 pp | [74.2 – 99.0%] | recall_daytime_only |
| F1 (operating) | 0.857 | 0.850 | −0.007 | — | §5.3 confusion |
| F2, recall-first (operating) | 0.9375 | 0.9043 | −0.033 | — | §5.3 confusion |
| specificity (operating) | 0.9032 | 0.9194 | +0.016 | — | §5.3 confusion |
| false negatives introduced by VLM | — | 1 (borderline ground-lights, not a true fire) | — | — | §5.3 paired change |
| latency p50 / p90 / p99 (detector CPU) | 243.6 / ~310 / ~335 ms | + amortised VLM | — | n=245 | detector_latency.json |
| VLM per routed crop (p50 / p95 / max) | — | 992 / 1421 / 2053 ms | — | n=37 | vlm_per_routed_crop_ms |
| VLM calls per frame / quiet / active | — | 0.151 / 0 / 0.032 | — | measured | vlm_call_rate |
| estimated cost per alert | — | ~$5–15/mo all-in ($0 quiet, scale-to-zero) | — | — | cost_model.json |
Reading note on per-class denominators. A frame can carry both a smoke box and a flame box, so the class counts (17 smoke + 10 flame) exceed the 18 unique daytime frames. The single daytime recall loss (dfire_AoF07872) is a flame box (settlement ground-lights), which is why flame recall shows −10 pp while smoke recall is untouched.
Reading note on latency tails. p90/p99 were not separately computed; the committed cache stores detector-CPU p50/p95/max (243.6 / 334.5 / 1876.5 ms). p90 ≈ 310 ms by interpolation; p99 ≈ the max-tail (the 1876 ms max is a single GC/IO outlier). These tail estimates and the frame-capture success rate are UNKNOWN_NEEDS_VERIFICATION and queued in the roadmap. End-to-end CPU mean ≈ 426 ms/frame (detector mean + 0.151 × VLM mean); a frame with one routed crop ≈ 1756 ms p95; quiet frames add 0.
| Quantity | Value | Status |
|---|---|---|
| Routed hot crops in held-out set | 37 | reproduced |
| Served from temperature-0 cache | 37 | ✅ |
| New VLM calls this run | 0 | ✅ |
| Cache-miss stub raised | No | ✅ deterministic |
| pHash leakage (external volcanic ↔ v1b train) | 0 collisions | ✅ |
Reproduce: python service/thesis_wf25_scorecard.py.
Volcanic FA set (n=62, GT-negative):
| Transition | Count | Files |
|---|---|---|
| FP → TN (veto suppressed a false alarm) | 1 | 9243ab… (sensor/lens artifact, VLM ruled NEITHER) |
| FP → FP (false alarm survived) | 5 | the 5 summit-degassing wildfire_smoke plumes (pass-through by design) |
| TN → FP (veto created a false alarm) | 0 | — |
| TN → TN (unchanged) | 56 | — |
Recall set (GT-positive):
| Transition | Count | Files |
|---|---|---|
| TP → TP (fire kept) | 17 (daytime) | — |
| TP → FN (veto vetoed a fire) | 0 daytime · 2 night-residual (BEFORE guard) → 0 silent FN (AFTER guard) | dfire_AoF07871, dfire_AoF07875 — now surfaced uncertain_night, not dropped |
| FN → TP (veto recovered a fire) | 0 | — |
| FN → FN (unchanged) | 0 | — |
McNemar (exact, two-sided) over the full paired decision set (62 volcanic + 18 daytime recall frames): discordant b (det-alone alarm, two-stage no-alarm) = 2; discordant c (det-alone no-alarm, two-stage alarm) = 0; exact two-sided p = 0.50 — not significant. All discordant pairs are one-directional (b>0, c=0): the veto only ever removes alarms, never adds one. Its entire measured effect on this held-out set is the removal of 2 alarms — 1 volcanic sensor-artifact FP and 1 borderline daytime ground-lights frame.
On this held-out set the Qwen3-VL veto changes exactly 2 of 80 paired decisions, both removals. It suppresses 1 volcanic sensor-artifact false alarm (
9243ab: 6/62 → 5/62 FA) and drops 1 borderline daytime ground-lights frame (dfire_AoF07872: 18/18 → 17/18 recall). It does not improve precision or recall on any genuine smoke or flame case, it creates no new false alarm, and it does not reach 0% volcanic FA — the 5 survivors are summit-degassing smoke that passes through by design. The veto's honest, measured value is the removal of one artifact false alarm; on every genuine-fire and genuine-degassing decision it leaves the detector unchanged. A larger FA reduction (8.1% → 3.2%) is available only by also routing smoke (Config B), at the cost of routing the contamination frames and incurring 2 additional night-lava recall losses; the prior whole-frame veto reached 0% FA but cost ~20.8% recall. Config A is the recommended default.
Gate-B question: does the VLM ever veto a true wildfire smoke/flame case?
dfire_AoF07872) is settlement ground-lights, not a fire.dfire_AoF07871 @0.818, dfire_AoF07875 @0.878). By single-frame appearance their flame is near-identical to lava incandescence, and the volcano-primed VLM sided VOLCANIC.uncertain_night / needs_review instead of being dropped.Re-scored result (same temperature-0 cache, 0 new VLM calls):
| Night true-fire FN | Daytime true-fire FN | Volcanic FA (n=62) | |
|---|---|---|---|
| Before guard | 2 (07871, 07875 silently vetoed VOLCANIC) | 0 | 8.1% (5/62) |
| After guard | 0 (both surfaced as uncertain_night) |
0 | 8.1% (5/62) — unchanged |
The volcanic FA is provably unchanged: 0 of the 62 volcanic frames are dark enough (all bright daytime, mean > 12) to trip the night guard, so the guard structurally cannot touch the 5 daytime-degassing survivors. The veto is consequently a recall-safe veto at night via the corroboration override — a daytime advisory precision layer AND a night veto that is corroboration-gated so it can never produce a silent off-crater night false-negative. (Reproduce: python service/rescore_wf25_night_guard.py.)
False-negative bound: daytime true-fire FN introduced by the VLM = 0 (Wilson upper bound on the observed 1/18 daytime FN — a non-fire frame — is 25.8%, which the small sample cannot tighten).
Cell legend: ✔ confirms · ◐ supports · ✗ contradicts · – unavailable · ∅ not-applicable · ⏳ stale · ≈ too-coarse. (Source-column keys as in wf36_corroboration_matrix.md §1.)
| Class (Gate-C) | detector class? | FCI/SEV | FIRMS | SLSTR | TROPOMI | CAMS | OSM | CAM |
|---|---|---|---|---|---|---|---|---|
| wildfire smoke | YES (wildfire_smoke) |
– | ✔/◐ | ◐ | ∅ | ∅ | – | – |
| wildfire flame | YES (wildfire_flame) |
– | ✔/◐ | ◐ | ∅ | ∅ | – | – |
| volcanic ash plume | YES (ash_plume) |
✔(SEV) | ∅ | ∅ | ∅ | ◐⏳ | ∅ | – |
| volcanic steam/degassing | YES (3 classes) | – | ∅ | ∅ | ✔(SO₂) | ◐⏳ | ∅ | – |
| lava / incandescence | YES (5 classes) | ◐(FCI) | ✔(on-crater) | ◐(FRP) | ∅ | ∅ | ∅ | – |
| meteorological cloud | partial (context-only) | – | ∅ | ∅ | ∅ | – | ∅ | ∅ |
| fog / haze | NO | – | ∅ | ∅ | ∅ | – | ∅ | ∅ |
| industrial smoke | NO | – | – | – | – | – | (avail, NOT wired) | – |
| dust / quarry / road dust | NO | – | ∅ | ∅ | ∅ | ≈ | (avail, NOT wired) | – |
| glare / sun / reflection | partial (context-only) | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ |
| camera artifact | NO | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ | (target, NOT wired) |
| black / frozen / stale frame | partial (context-only) | ∅ | ∅ | ∅ | ∅ | ∅ | ∅ | (target, NOT wired) |
| unknown | NO | – | – | – | – | – | – | – |
Genuinely corroborable now (5): wildfire smoke, wildfire flame (LIVE rules; uncorroborated this cycle — on-crater 0.4 km FIRMS, correctly not a wildfire confirmation), lava/incandescence (corroborated VOLCANIC this cycle), volcanic ash plume (corroborated via SEVIRI coverage only — weak/contextual, CAMS AOD was stale), volcanic steam/degassing (LIVE rule; uncorroborated this cycle — TROPOMI null + CAMS stale + value below the elevated floor). STAGED_NOT_LIVE (8): meteorological cloud, glare, frozen-frame (context labels only), fog/haze, industrial smoke, dust, camera artifact, unknown (no rule / no detector class). Per the directive's §8 hard-stop check, no over-claim is required and no hard-stop is triggered, provided the thesis restricts corroboration claims to the five rule-backed classes with their cycle-level qualifiers and labels the other eight STAGED — which it does.
Key LIVE_OPERATIONAL rows (full table in system_performance_spec.md): 5 configured cameras / 3 online; 75 s cadence; VLM trigger = routed hot/bright crops only; 64 feeds (46 live / 7 stale / 10 catalogued / 1 key-pending / 0 error); FWI 13.06 moderate; hourly EtnaFeedsRefresh Running; dedup IoU 0.4 / cooldown 1800 s; 240/240 bilingual parity; OFFLINE camera health honest; Overpass degradation guard active. STAGED_NOT_LIVE: human-review / alert-email dispatch. UNKNOWN_NEEDS_VERIFICATION: frame-capture success rate; p90/p99 latency.
Operational classification: RESEARCH_ONLY. SIMULATION ONLY — no QPU was contacted, the IBM Quantum key was not read. All compute was light local CPU (statevector simulation of ≤4 qubits, ~46 s wall). A quantum win is asserted only where a paired-difference CI excludes zero.
A fresh, real-data quantum-versus-classical benchmark was run directly on the INGV task: discriminate VOLCANIC vs WILDFIRE thermal-anomaly events near Etna — the populated-flank problem INGV's own literature calls spectrally hard. Data: 33 Etna-edifice volcanic events (GVP/INGV weekly state oracle + FIRMS FRP) and 404 vegetated-flank wildfire events (real FIRMS active fire in the ≤25 km annulus), n = 437, 182 date groups, GroupKFold by date (leakage-guarded). The hard near-field intrinsic regime uses thermal magnitude / FIRMS multiplicity only (log_frp_max, log_frp_sum, log_firms_count, n_firms_sensors), with no geometry and no source-availability proxies (which correlate perfectly with class by construction and are stripped). With 4 features = 4 qubits, the quantum map sees the full signal with no PCA information loss — the fairest possible footing.
Out-of-fold AUC (grouped-by-date, n=437):
| Model | Type | OOF AUC | 95% CI |
|---|---|---|---|
| Classical RBF-SVM (matched) | classical | 0.936 | [0.889, 0.973] |
| Classical HistGB (matched) | classical | 0.918 | [0.867, 0.964] |
| Classical HistGB (full features) | classical | 0.892 | [0.820, 0.947] |
| Quantum fidelity kernel (ZZ, 4q) | quantum | 0.837 | [0.767, 0.896] |
| Quantum VQC (4q, 2-layer) | quantum | 0.702 | [0.608, 0.790] |
Paired AUC deltas (quantum − classical, bootstrap 95% CI):
| Comparison | Δ AUC | 95% CI | Read |
|---|---|---|---|
| Q-kernel − RBF (matched) | −0.099 | [−0.161, −0.038] | quantum worse, CI excludes 0 |
| Q-kernel − HistGB (matched) | −0.081 | [−0.139, −0.028] | quantum worse, CI excludes 0 |
| VQC − RBF (matched) | −0.234 | [−0.325, −0.150] | quantum much worse, CI excludes 0 |
| Q-kernel − HistGB (full) | −0.055 | [−0.129, +0.021] | tie/worse (CI brackets 0) |
Honest interpretation. This is a clean, CI-backed publishable negative: on the same real features and the same leakage-guarded split, the quantum fidelity kernel (0.837) is beaten by matched classical RBF-SVM (0.936) by −0.099 [−0.161, −0.038]; the VQC (0.702) is the worst model tested. Notably the loss is not a dimensionality-truncation artifact — with only 4 features the 4-qubit map sees the full signal — it is the encoding/kernel-geometry mismatch and the VQC generalisation ceiling themselves. Because statevector simulation is exact, the classification verdict will not improve on real hardware (device noise only hurts). The right tool for this classification task is classical.
The one genuine novelty (preserved as a research line, not an operational claim): the formulation of volcanic deformation source inversion (Mogi/Okada) as a QUBO/Ising problem. On a synthetic-realistic Etna GNSS geometry, the multi-source / model-selection variant solved by simulated annealing matches the exact optimum 100% [89–100%] where multi-start Levenberg–Marquardt traps at 60% and greedy/OMP at 0%. Honest caveat: this win is shared by a classical simulated-annealing sampler — it is a QUBO-formulation success, not quantum-hardware advantage. The Mogi-single-source-QUBO and Dozier-sub-pixel-as-QAOA mappings are, to our knowledge, literature firsts (pending peer confirmation). Overall quantum verdict: worth evaluating and worth formulating, but on the classification tasks that actually run the monitor it does not add operational value — classical wins with CIs that exclude zero. The QPU gate remains BLOCKED pending explicit approval; no quantum hardware was used anywhere. (Reproduce: python quantum_disambiguator.py, ~46 s, statevector sim only.)
This chapter is exhaustive by design (Gate F). The machine-readable manifest is failure_case_manifest.csv (38 rows); 24 source frames + 24 thumbnails are exported to failure_crops/. Read-only build: the live multi_cam_service was not disturbed.
Two frames are the only true wildfire-recall losses in the shipping Config A, reported separately and excluded from the daytime denominator:
| case_id | frame | detector | VLM verdict | crop |
|---|---|---|---|---|
| FN_007 | dfire_pos_AoF07871 |
wildfire_flame @0.818 | VOLCANIC "glowing, irregularly shaped incandescence consistent with lava flow or vent activity" | failure_crops/FN_007.jpg |
| FN_009 | dfire_pos_AoF07875 |
wildfire_flame @0.878 | VOLCANIC "bright, diffuse glow … consistent with summit incandescence or strombolian activity" | failure_crops/FN_009.jpg |
A bright nighttime vegetation-fire line and lava incandescence are not separable from single-frame appearance; the volcano-context prompt that gives the system its low volcanic FA is exactly what mis-routes these two. This is mitigated, not solved. The night-safety guard (Chapter 3.6 / 5.5) converts these from silent false-negatives into surfaced uncertain_night alerts (night silent FN 2 → 0), but the underlying single-frame ambiguity remains. Recommended next steps (future work, not claimed operational): multi-frame temporal persistence (lava is steady; wildfire flickers and spreads) and a hard FIRMS/SLSTR night co-location override. Config B incurs 2 additional night-lava losses (AoF07723, AoF07773; FNB_010, FNB_011) — the documented recall price of pushing volcanic FA from 8.1% to 3.2%, and the reason Config A is the default.
FN_008 (dfire_pos_AoF07872, wildfire_flame @0.858/0.713 → NEITHER, "artificial ground lights, likely from settlement") is the single frame that moves daytime recall 100% → 94.4% (17/18). It is an arguably-correct rejection of distant settlement/ground lights, counted conservatively as a recall loss so the headline is not inflated. Crop: failure_crops/FN_008.jpg.
The Config-A veto suppresses exactly one volcanic FP (FP_003 / 9243ab, sensor-artifact/lens-flare boxed as flame, VLM ruled NEITHER, 9.7% → 8.1%). The 5 survivors (FP_001, FP_002, FP_004, FP_005, FP_006) are summit passive-degassing steam / cloud / ash mis-boxed as wildfire_smoke; smoke-class crops pass through by design (this is what preserves wildfire recall), so the veto cannot suppress them in Config A. All six source frames are exported (failure_crops/FP_001.jpg … FP_006.jpg).
The prior whole-frame veto reached 0/62 volcanic FA but destroyed ~20.8% of genuinely-visible fire recall (FRAMEVETO_012–015: HPWREN/Roboflow faint distant horizon smoke called "haze/cloud" by the volcano-context whole-frame prompt). The shipping crop-level Config A recovers every one — a grey distant plume is not red-hot, is never routed to the VLM, and can never be vetoed. Retained as the honest upper-bound failure mode of the alternative architecture.
Volcanic confounder library (exported real frames): HN_018 steam/degassing, HN_019 ash plume, HN_020 lava glow, HN_021 strombolian, HN_022 cloud obscuration, HN_023 glare/sunset, HN_024 snow cover. Web hard negatives (dust, fog, industrial smoke, glare, other; HNWEB_025–029) carry blank frame paths because the images live on the eval pod, not in this checkout — stated as fact, not fabricated. Categories with 0 locally-exportable instances (industrial smoke, dust, compression-artifact) are stated as such rather than invented.
failure_appendix.md §8) are excluded from any "live" count and treated as no-evidence by corroboration. The 10 catalogued + 1 key-pending feeds are not live.ALERT_EMAIL_ENABLED=0); the dashboard is internal/preview ("Not a public product"); human-in-loop review is not a live workflow./api/cams transient TLS resets observed (2 of 3 fetches; recovered on retry, local publisher corroborated). A monitoring item, not a hard stop.top_p not pinned in the VLM request payload (server default) — a reproducibility gap to close.qwen3-vl-32b build, not guaranteed in perpetuity.FROZEN_037, SATCON_038); none is invented. The night dark-RGB condition is handled as a true quiet-scene negative, not mis-read as a frozen failure.Every item here is PLANNED until it has live health evidence; nothing is described as operational. The thesis baseline stays frozen at 4d1b0cc / ingv_v1b_best.pt / qwen3-vl-32b temp 0.0 — new models are challengers, not baseline swaps. No edge/Hailo/DGX hardware is assumed (ops-room target: detector CPU/GPU auto, VLM remote serverless).
Instrument frame-capture success rate (rolling per-source online/offline + frame-age counter, replacing the point-in-time "3 of 5 online"); compute p90/p99 latency from existing per-frame arrays including the routed-crop VLM tail, separately for CPU and GPU; pin top_p in the VLM request; harden /api/cams with a retry/health probe and a freshness-stamped fallback to cameras_wall.json; add a per-frame pHash-identical-consecutive-frame guard (catch a frozen-but-recent camera); lock WF36 STAGED wording.
Enlarge the held-out sets (more bulletin-confirmed volcanic frames and clean-source daytime wildfire frames; keep pHash-leakage-zero and the Pyronear-sequence separation); commit the train↔held-out collision list; run the WARP-style hard-negative robustness battery [12] (Gaussian noise, JPEG compression, blur, cloud-like patches, fog/haze, glare, rain-on-lens, timestamp overlays, black/frozen frames) and report detector + two-stage degradation curves.
Domain adaptation [13,14,15,16] is the core next pillar: collect unlabeled frames from every live camera, mine detector positives + detector–VLM disagreements + VLM vetoes, human-review a small hard set, train with UDA/self-training, and test on held-out camera/date/weather/volcanic episodes, guarding against catastrophic false-positive drift. Detector/segmentation challengers (RESEARCH_ONLY): YOLO11/YOLO12 [4], RT-DETR [5], Grounding DINO 1.5/Edge [6], SAM 2 video plume masks [23,24], optical-flow smoke-motion-consistency [21,22], sky/terrain masking — benchmarked against the frozen YOLO11s, not swapped in. Prompt-as-evaluated-component study [10,11]: measure the recall-versus-FA tradeoff across CROP/SMOKE prompt variants, SmokeBench-style [7].
Fine-tuned Etna/wildfire VLM [20]: fine-tune a Qwen2.5/Qwen3-VL-style model on Etna degassing-vs-wildfire-vs-cloud crops, keeping the frozen 32B as the thesis baseline. IR/thermal fusion [17,18,19,20]: RGB+IR fusion, thermal/night detection, fire/lava/industrial-heat discrimination, satellite-thermal corroboration — directly attacking the night-fire↔lava residual. MTG-FCI event tracking [25,26,27]: move FCI from coverage-only corroboration to FireDyn-style fire-pixel extraction, rate-of-spread, FRP evolution, and fire-arrival maps, treating FCI as early-candidate + event-tracking data, not perfect ground truth. Active-learning dashboard + data flywheel: detector fires → candidates, VLM disagreements → hard examples, human decisions → labels, satellite corroboration → weak labels, INGV bulletins → volcanic labels, stale/frozen frames → camera-health labels.
VLM/MLLM early-smoke localization limits (keep detector-first + VLM-veto per current evidence [7]); FireCLIP-style multimodal prompt tuning [10]; TROPOMI SO₂ + CAMS as contextual-only evidence [31,32,33]; FCI-vs-SEVIRI sensitivity study [25] and SLSTR active-fire characterization [29,30]; multi-season dataset expansion across lighting/weather/angle/episode shift; operational human-in-loop alerting protocol design (currently STAGED) before any public-facing claim; and the volcanic-source-inversion-as-QUBO research line (Chapter 5.8) — a parallel research thread, QPU BLOCKED.
Verification status: all 35 references VERIFIED (paper/source confirmed to exist via arXiv / publisher / IEEE / ScienceDirect / Hugging Face papers index, matching title and authors); [34] and [35] are official documentation (verified, non-paper), cited deliberately for the operational feed-reliability concept.
Every headline claim traces to thesis_evidence_ledger.csv (claim_id, operational_status, evidence_path, commit, command/query, timestamp, metric, CI, limitation, thesis-safe wording). Summary of the 16 ledger rows:
| Claim | Status | Metric | Evidence |
|---|---|---|---|
| C01 Frozen baseline | LIVE_OPERATIONAL | commit 4d1b0cc (master) |
git rev-parse HEAD @ 02:31Z |
| C02 64-feed board | LIVE_OPERATIONAL | 46/7/10/1/0 of 64 | feeds.json @ 02:31:33Z |
| C03 Overpass guard | LIVE_OPERATIONAL | 23,379 roads / 341 rail (live) | osm_roads_rail.py L64-72 |
| C04 Live FWI | LIVE_OPERATIONAL | 13.06 (moderate) | feeds.json effis_fwi |
| C05 Camera wall | LIVE_OPERATIONAL (3/5) | 75 s cadence, 3 online | /api/cams @ 02:44:37Z |
| C06 Camera sources | LIVE_OPERATIONAL | 5 (INGV + 3 Windy + EtnaWalk) | /api/cams sources[] |
| C07 VLM trigger | LIVE_OPERATIONAL | ~0.151/frame, 0 quiet | /api/cams + WF25 |
| C08 WF25 volcanic FA | RESEARCH_ONLY | 8.1% (5/62), CI [3.5–17.5%] | WF25_system_scoring.json |
| C09 WF25 daytime recall | RESEARCH_ONLY | 94.4% (17/18), CI [74.2–99.0%] | WF25_system_scoring.json |
| C10 Deterministic cache | RESEARCH_ONLY | 245/245 match, 0 leakage | WF25 provenance |
| C11 WF36 matrix | LIVE_OPERATIONAL | 13 classes; >24 h = no-evidence | corroboration.py @ 02:56:39Z |
| C12 Bilingual parity | LIVE_OPERATIONAL | 240 EN = 240 IT | i18n.js |
| C13 Scheduled jobs | LIVE_OPERATIONAL | both Running; feeds hourly | schtasks @ 02:32Z |
| C14 Model/prompt freeze | LIVE_OPERATIONAL | sha256 c0a3d0ea…; temp 0.0 | model_prompt_freeze.json |
| C15 Dedup/staleness guards | LIVE_OPERATIONAL | IoU 0.4; 1800 s; 24 h | config.py |
| C16 Alert dispatch | STAGED_NOT_LIVE | alert_email.enabled=false |
banner + cameras_wall.json |
failure_case_manifest.csv (38 rows); failure_crops/ (24 source frames + 24 thumbnails). Key illustrative crops: night-fire↔lava residual FN_007.jpg, FN_009.jpg; borderline town-lights FN_008.jpg; surviving volcanic FA FP_001.jpg … FP_006.jpg (FP_003 = the one vetoed sensor-artifact); Config-B-only night losses FNB_010.jpg, FNB_011.jpg; prior frame-level-veto losses FRAMEVETO_012–015.jpg; volcanic confounder library HN_018 … HN_024; contamination CONTAM_016 (false-colour satellite), CONTAM_017 (painting) — both correctly VLM-rejected and excluded from the recall denominator. Web hard-negatives HNWEB_025–029 carry blank paths (images off-repo, stated not fabricated). Vetoed-crop originals: crops/VETOED_dfire_pos_AoF07871_…_VOLCANIC.jpg, …AoF07875_…_VOLCANIC.jpg, …AoF07872_…crop0/crop1_NEITHER.jpg.
Full manifest: model_prompt_freeze.json (commit 4d1b0cc, frozen 2026-06-29T02:45:00Z). Detector: YOLO11s 19-class, models/ingv_v1b_best.pt, sha256 c0a3d0ead257d318e70bec3bb84feaec7b99e9e3d55b132fc5f1ffd405cf0a20, 19,261,267 bytes, imgsz 960, conf 0.25, crop pad 0.25 / max 768 px, dedup IoU 0.4, cooldown 1800 s. VLM: qwen3-vl-32b, Model-Vault RunPod serverless OpenAI-compatible endpoint, crop-level routing, temperature 0.0, max_tokens 120, top_p server-default, retries 3 / backoff 8.0 s / timeout 180 s, image crop long-side ≤768 px JPEG q88 base64. Two frozen prompts (CROP_PROMPT primary veto, SMOKE_PROMPT degassing-vs-plume) with verbatim text in the manifest; strict-JSON output {label, confidence, reason} with PARSE_FAIL fallback. Cache: deterministic temp-0 per-crop (reports/crop_veto_outputs/te_crop_level_cpu_configA.json). Hardware: local OFFICE workstation (CPU/GPU auto) + remote serverless VLM; no DGX, no PHOENIX prod, no edge/Hailo assumed.
Full per-class matrix, per-class implemented rule + worked example on current live feeds + honest caveat: wf36_corroboration_matrix.md (implementation service/corroboration.py @ 4d1b0cc, feed snapshot 2026-06-29T02:56:39Z). Five rule-backed classes (wildfire smoke, wildfire flame, lava/incandescence, ash plume, steam/degassing) with cycle-level qualifiers; eight STAGED classes (cloud, glare, frozen-frame, fog/haze, industrial smoke, dust, camera artifact, unknown). Freshness rule: feeds older than FEED_MAX_AGE_H = 24 h are treated as no current evidence (uncorroborated, never contradicted). Reproduce: ETNA_FEEDS_OUT=../etna_dashboard/feeds/out python corroboration.py.
Prepared for the INGV deliverable. Frozen baseline commit 4d1b0cc. Every quoted number traces to a committed Phase-1 evidence artifact in flank_wildfire/reports/thesis/. The system is reported as an evidence-backed multi-source monitoring architecture with explicit operational classification, confidence intervals, and failure modes; it is not claimed to solve early wildfire detection generally.