Author: Mark Ludwikowski <markl02us@yahoo.com> · INTERNAL / PREVIEW — ADRIZ self-assessment for INGV review. Use your browser’s Print → Save as PDF for a PDF copy.

A Public-Data-Driven Wildfire and Volcanic-Confounder Visual Monitoring System for Sicily / Mount Etna

A detector-first, VLM-veto, multi-feed-corroboration architecture, evaluated with operational honesty

Author: Mark Ludwikowski <markl02us@yahoo.com> Deliverable: ADRIZ → INGV (Istituto Nazionale di Geofisica e Vulcanologia), Mount Etna visual monitoring Frozen evaluation baseline: commit 4d1b0cca79bf396e99b9a49e8477ae3a36ecfd33 (4d1b0cc), branch master Verification window (UTC): 2026-06-29 ~02:31 → ~02:56 Operational-classification labels used throughout: LIVE_OPERATIONAL · LIVE_STALE · STAGED_NOT_LIVE · RESEARCH_ONLY · PLANNED · BROKEN_OR_BLOCKED · UNKNOWN_NEEDS_VERIFICATION

Evidence discipline (binding). Every quantitative claim in this thesis traces to a committed Phase-1 evidence artifact in flank_wildfire/reports/thesis/. No number is invented or softened. Where evidence is incomplete the text says so plainly, scopes the claim, or labels the metric UNKNOWN. The system is not claimed to solve early wildfire detection generally; it is a reproducible, evidence-backed architecture for multi-source environmental monitoring, reported with detector-alone-versus-two-stage metrics, confidence intervals, latency, false-positive and false-negative behaviour, and explicit failure modes.


Abstract

Mount Etna is one of the most challenging environments in the world for camera-based wildfire detection: vegetated, populated flanks that genuinely burn sit directly beneath a persistently active volcano whose degassing plumes, ash columns, lava incandescence, and strombolian ejecta are constant visual confounders, and whose summit is routinely obscured by meteorological cloud and twilight glare. This work presents and evaluates ADRIZ, a public-data-driven visual monitoring system for the Etna / Sicily setting that combines four components: (i) a camera wall ingesting multiple public webcam and institutional video sources on a 75-second model-resident cadence with honest per-source health reporting; (ii) a frozen YOLO11s detector (weights sha256 c0a3d0ead257…cf0a20) that generates per-frame candidates; (iii) a crop-level Qwen3-VL semantic veto (qwen3-vl-32b, temperature 0.0) invoked only on detector-routed hot/bright crops, made recall-safe at night by a durable satellite-corroboration override; and (iv) a 64-feed public-data board with operational staleness classification plus a per-class satellite/weather/geospatial corroboration layer (WF36).

On a held-out, leakage-controlled evaluation (RESEARCH_ONLY, small-n), the two-stage system held the volcanic false-alarm rate at 8.1% (5/62, 95% Wilson CI [3.5–17.5%]) versus 9.7% detector-alone, while preserving 94.4% daytime genuinely-visible wildfire recall (17/18, 95% CI [74.2–99.0%]). The veto's measured effect was strictly one-directional (McNemar b=2, c=0, exact p=0.50, not significant at this sample size): it removed exactly one sensor-artifact false alarm and one borderline daytime ground-lights frame, and improved no genuine-fire decision. At night the veto originally vetoed two true vegetation fires whose flame is single-frame-indistinguishable from lava incandescence; a durable night-safety guard now withholds the volcanic veto on any uncorroborated wildfire-class night detection (surfacing it as uncertain_night rather than dropping it), taking the night true-fire silent false-negative count from 2 to 0 while leaving the daytime numbers and the volcanic false-alarm rate unchanged. A separately-reported quantum due-diligence track found, on the same Etna volcanic-versus-wildfire task, that simulated quantum classifiers are beaten by matched classical baselines with confidence intervals that exclude zero (Q-kernel −0.099 [−0.161, −0.038]) — a clean publishable negative, with the one genuine novelty (volcanic source-inversion as a QUBO) preserved as a research line, no quantum hardware used. We report the system with full operational classification, identify the limitations honestly (3-of-5 cameras online at verification, seven stale feeds, latency tails not instrumented, alert dispatch staged-off), and lay out the next build directions: domain adaptation, IR/thermal fusion, MTG-FCI event tracking, and a fine-tuned wildfire/volcano VLM.


Chapter 1 — Motivation and Problem

1.1 Wildfire early detection and the false-alarm budget

Fixed-camera early smoke detection is an established and operationally valuable lineage. Govil et al. [1] demonstrated, on the HPWREN / AlertWildfire camera network in Southern California, a deep-learning system scanning hundreds of cameras every minute that detects smoke typically within roughly fifteen minutes of ignition at under one false positive per camera per day. That result frames the problem this work inherits: detection value comes not from recall alone but from recall under a strict false-alarm budget. A monitoring system that alarms constantly is operationally useless regardless of its sensitivity, because human reviewers stop trusting it. Dewangan et al. [2] (FIgLib / SmokeyNet) and the PyroNear-2025 benchmark [3] established that single-frame camera detection saturates on confounders and that the task remains hard across camera domains — which is precisely why this work layers a semantic veto and multi-source corroboration on top of a fast detector rather than relying on the detector alone.

1.2 Sicily and Mount Etna as a hard monitoring environment

Sicily experiences a severe Mediterranean wildfire season; its vegetated terrain, including the populated lower and middle flanks of Mount Etna, genuinely burns. The deliverable target for this system is INGV, Italy's national geophysics and volcanology institute, whose Etna monitoring concern spans both volcanic activity and the wildfire risk on the edifice's flanks. The defining difficulty is co-location: a vegetation wildfire on Etna's flank and the volcano's own thermal and plume activity occupy the same cameras, the same satellite pixels, and frequently the same frame.

1.3 Volcanic and meteorological confounders

The confounder set at Etna is unusually rich and adversarial:

WARP [12] found that both CNN and transformer wildfire detectors fail to distinguish cloud-like patches from real smoke under local adversarial perturbation; Etna supplies that adversarial confounder set naturally, every day. A system for this environment must therefore be engineered around wildfire-versus-volcanic disambiguation, not generic smoke detection.

1.4 The need for a public-feed, multi-source, low-cost system

No single sensor resolves the confounder problem. A camera sees a plume but cannot tell smoke from steam at the summit; a satellite thermal product sees heat but at a pixel far coarser than the camera's region of interest and cannot tell flank wildfire from crater lava when the heat is on-crater; a gas sensor or SO₂ retrieval supports a "volcanic" reading but is a coarse atmospheric column, not pixel-level event proof. The architecture this thesis evaluates therefore treats the problem as a system of independent public data sources — camera candidate detection, semantic VLM reasoning, and satellite/weather/geospatial corroboration — built entirely from public feeds and low-cost commodity compute (a local workstation detector plus remote serverless VLM inference; no edge, Hailo, or DGX hardware is assumed). The contribution is the reproducible, operationally-honest integration of these sources, with every component classified by its true operational status.


Chapter 2 — Related Work

This chapter situates each design choice in current, verified literature. Every reference was confirmed to exist against arXiv, the publisher, IEEE/ScienceDirect, or the Hugging Face papers index before citation; all 35 references are VERIFIED (the two highest-stakes, SmokeBench [7] and FCI-FireDyn [26], were independently spot-re-confirmed). Numbers in square brackets index the References section.

2.1 Camera-based wildfire / smoke detection foundations

The camera-wall lineage is defined by Govil et al. [1] (HPWREN/AlertWildfire, ~15-minute detection, <1 FP/camera/day), Dewangan et al. [2] (FIgLib's ~25,000 labelled fixed-camera smoke images and the spatiotemporal SmokeyNet CNN that exploits frame-to-frame information), and the PyroNear-2025 benchmark [3] (a geographically diverse web-scraped camera dataset, ~150k annotations over 640 wildfires, showing the task remains hard across domains). These define the detector stage's job — per-frame candidate generation under a strict false-alarm budget — and motivate the temporal and multi-source layers added on top.

2.2 YOLO-style detectors and transformer challengers

The incumbent detector is a YOLO11s-class single-stage model chosen for model-resident real-time inference. Newer detectors are treated as future directions, not requirements: YOLOv12 [4] (attention-centric, Area-Attention/R-ELAN, improved mAP at comparable latency but with training-instability/CPU-throughput costs); RT-DETR [5] (the first real-time end-to-end NMS-free detection transformer, CVPR 2024); and for open-vocabulary candidate generation, Grounding DINO 1.5 [6], particularly the TensorRT-optimised Edge variant (~75 FPS). RT-DETR's NMS-free property and Grounding DINO Edge's text-promptable open-set detection are the two most relevant upgrade paths for a camera wall that must add confounder classes without full retraining.

2.3 VLMs / MLLMs for wildfire smoke — strong semantically, weak as primary localizers

This is the central evidence base for the detector-first + VLM-veto choice. SmokeBench [7] (Qi, Li, Barnes; WACV 2026) evaluates MLLMs (Qwen2.5-VL, InternVL3, GPT-4o, Gemini-2.5 Pro, Grounding DINO, Idefics2, Unified-IO 2) on smoke classification, localization, and detection; its headline finding is that models can often classify large-area smoke but all struggle with accurate localization, especially early-stage, with performance strongly tied to smoke volume. Earth-observation VLM benchmarks corroborate the pattern: GPT-4V-on-EO ("good at captioning, bad at counting") [8] and GEOBench-VLM [9] both show strong open-ended scene knowledge but poor spatial localization/counting (GPT-4o ~40% on GEOBench-VLM MCQs, ~2× chance). This directly supports using a VLM not as the primary localizer but as a second-stage semantic veto on already-localized crops — the regime where MLLMs are strong.

2.4 Detector-first + VLM-veto versus VLM-only monitoring

No single canonical paper names "detector-first + VLM-veto for wildfire cameras"; the claim is supported by converging verified evidence and this is stated honestly rather than attributed to an invented source. SmokeBench [7] and the EO-VLM benchmarks [8,9] establish VLM localization weakness; the camera-detector lineage [1,2,3] establishes that fast single-stage detectors localize well but over-fire on confounders. FireCLIP [10] is the closest direct evidence that a vision-language stage adds value specifically as a false-alarm discriminator (cooking smoke, industrial emissions), reporting ≥12.45% zero-shot improvement and better regional generalization via prompt tuning. The two-stage decomposition — fast detector for recall/localization, VLM for precision/semantic veto — is exactly what the WF25 evaluation (Chapter 5) tests empirically.

2.5 Prompt design as an evaluated system component

FireCLIP [10] demonstrates prompt tuning as the mechanism delivering its false-alarm and generalization gains; TuneVLSeg [11] benchmarks textual/visual/multimodal prompt-tuning under domain shift (textual prompts degrade under large shift, visual prompting is a competitive cheaper first attempt); WARP [12] shows prompt/threshold-adjacent design controls the recall-versus-false-alarm operating point. Accordingly, the Qwen3-VL veto prompts in this work are versioned and frozen (Chapter 3, model_prompt_freeze.json) and reported as a tested variable, not an afterthought.

2.6 Domain adaptation for camera networks

Identified as a likely core next-build pillar. Verified sources: the MCAF multilevel-feature-alignment UDA smoke detector [13]; EDIF [14] for enhanced domain-invariant cross-domain forest-fire smoke detection; a synthetic-to-real UDA study on the AlertWildfire network [15]; and Pesonen et al. [16] on zero-shot foundation-model supervision training small real-time camera segmenters from box labels. Each ADRIZ camera (INGV, Windy, EtnaWalk) is a distinct visual domain (lighting, angle, weather, Etna's plume backdrop), which scopes the Chapter-7 adaptation plan.

2.7 Robustness and adversarial / hard-negative testing

WARP [12] (Ide & Yang) is the first model-agnostic framework for adversarial robustness of wildfire detection models, injecting global (Gaussian) and local (cloud-PNG-patch) noise; transformers showed >70% precision degradation under global noise, and both CNN and transformer models failed to distinguish cloud-like patches from real smoke under local attacks. This is the template for the failure-appendix hard-negative battery (Chapter 6) and the direct literature motivation for the VLM veto + satellite corroboration as mitigations.

2.8 IR / thermal fusion (future capability)

Verified RGB-thermal fusion sources: a UAV multi-scenario RGB-Thermal forest-fire dataset and fusion model [17]; the MCDet target-aware RGB-T fusion model [18]; a visible+thermal-infrared flame-detection method [19]; and at the VLM level WildFireVQA [20], a large radiometric-thermal VQA benchmark finding RGB remains the strongest single modality for current MLLMs while retrieved thermal context helps stronger models. IR is directly relevant to Etna's lava-versus-flame ambiguity, but WildFireVQA keeps the claim honest: thermal is a contextual gain, not a solved modality for VLMs.

2.9 Optical-flow / temporal plume tracking and segmentation

For temporal smoke-motion consistency and plume-growth segmentation: a spatiotemporal bag-of-features early-smoke detector using histogram of oriented optical flow exploiting upward thermal convection [21]; spatiotemporal/dynamic-texture forest-fire smoke video detection [22]; and for modern segmentation, SAM 2 [23] (promptable video segmentation with streaming memory) plus a fire-specific SAM2 study [24] (Box+MP best, mIoU ~0.64). Temporal consistency is the most promising future mitigation for the night-fire↔lava residual (lava is steady; wildfire flickers and spreads).

2.10 Satellite corroboration and geostationary event tracking (MTG-FCI)

The corroboration layer's geostationary basis. MTG-FCI detects fires ~4 h earlier than SEVIRI, ~2 h before MODIS, and finds ~5× more active-fire pixels than SEVIRI [25]; the FCI-FireDyn / Fire-Event-Tracker algorithm [26] (Paugam et al., 2026) spatio-temporally clusters FCI hotspots at 10-minute cadence to derive fire-arrival maps, rate of spread, and burnt-area evolution, validated on Southern-European 2024–2025 fires; a feasibility study [27] explores unsupervised MTG-FCI wildfire detection. The directive's insistence that FCI be treated as event-tracking / early-candidate data, not perfect ground truth is grounded here: FCI's strength is temporal evolution and early timing, while its ~1–2 km pixel makes it too coarse for camera-event-level ground truth — exactly the "supports / too coarse" labelling WF36 applies.

2.11 Point corroboration: FIRMS VIIRS/MODIS and Sentinel-3 SLSTR

Schroeder et al. [28] is the canonical VIIRS 375 m active-fire algorithm behind NASA FIRMS; Xu & Wooster [29] describe the operational SLSTR daytime active-fire / FRP product with global intercomparison to MODIS/VIIRS/Landsat (daytime product operational since March 2022), building on pre-launch algorithm work [30]. These supply the published detection limits that justify the asymmetric corroboration logic in WF36: a thermal hotspot near a camera candidate elevates confidence, while absence is treated as non-disconfirming (sub-pixel early smoke is below the satellite detection floor).

2.12 Sentinel-5P/TROPOMI and CAMS as contextual gas/plume evidence

Theys et al. [31] (global TROPOMI volcanic-SO₂ degassing), an Italy-specific Stromboli SO₂ study [32], and the TROPOMI SO₂ retrieval ATBD [33] anchor the gas-context layer. TROPOMI SO₂ supports a "volcanic degassing" classification but its coarse footprint and overpass cadence make it supporting context, never per-pixel camera-event proof — the precise labelling WF36 enforces. CAMS/GFAS plays the analogous aerosol/emission role; TROPOMI is cited as the verified anchor and CAMS-specific event-level proof is flagged context-only.

2.13 Operational data-feed reliability and staleness classification

This is an engineering/operational contribution rather than an academic finding, and that is stated honestly: it is grounded in official documentation — NASA FIRMS product/latency documentation [34] and EUMETSAT MTG instrument documentation [35] — which defines the upstream cadences against which the staleness thresholds are derived. The system's feed-health monitor classifies every feed operationally, with a degraded-response guard ensuring a degraded upstream cannot masquerade as "live-with-zero."


Chapter 3 — System Architecture

Status of this chapter's claims: every component is labelled with its verified operational status from operational_state_verification.md and system_performance_spec.md, re-verified from current evidence in the 2026-06-29 02:31–02:56 UTC window at commit 4d1b0cc.

3.1 Overview

ADRIZ is a four-layer pipeline:

  Public camera sources (5 configured)
        │  75 s model-resident cycle
        ▼
  [Stage 1]  YOLO11s detector  (frozen, sha256 c0a3d0ea…)
        │  per-frame candidate boxes, 19-class head
        ├── PASS_THROUGH classes (smoke / ash / degassing) ──────────────┐
        │                                                                │
        └── ROUTE classes (lava / incandescence / flame) ──► hot/bright crop
                                                                │
        ▼                                                       ▼
  [Stage 2]  Crop-level Qwen3-VL veto (qwen3-vl-32b, temp 0.0)
        │  WILDFIRE | VOLCANIC | NEITHER   (+ NIGHT-SAFETY override)
        ▼
  [Stage 3]  WF36 multi-feed corroboration  (FIRMS / SLSTR / FCI / SEVIRI / TROPOMI / CAMS)
        │  on-crater = volcanic;  off-crater fresh FIRMS = independent wildfire support
        ▼
  [Stage 4]  Alert taxonomy + feed-health board (64 feeds) + bilingual dashboard

The detector and VLM run live in the EtnaCameraWall scheduled task; the feed board refreshes hourly via EtnaFeedsRefresh. Both tasks were Running at verification (schtasks /query @ 02:32 UTC). Status: LIVE_OPERATIONAL for the running pipeline; alert dispatch is STAGED_NOT_LIVE (Section 3.8).

3.2 The 64-feed public-data board and feed-health monitoring

The data-feeds board inventories 64 public sources. At verification (curl https://adr-etna-ingv.pages.dev/data/feeds.json @ 2026-06-29T02:31:33Z, HTTP 200), an independent recount of the 64 group-level entries matched the published summary exactly:

Feed status Count Meaning (from the live banner) Status label
live 46 data returned now LIVE_OPERATIONAL
stale 7 real pull, but upstream archive/outage lag LIVE_STALE
catalogued 10 reachable but needs a token / no scalar-point API STAGED_NOT_LIVE
auth_pending 1 our key not yet configured STAGED_NOT_LIVE
error 0
total 64 LIVE_OPERATIONAL (board)

Two engineering guarantees make this a defensible operational claim rather than a vanity count:

A representative live numeric is the Fire Weather Index: the most recent EFFIS daily FWI analysis (CEMS EWDS GEFF 4.1) at the Etna-summit cell was 13.06 (moderate) at 02:31:33Z. Honest latency caveat: GEFF 4.1 daily analysis carries ~3–4 day latency, so this is the most recent daily analysis, not an instantaneous reading.

The seven stale feeds at verification are surfaced as failures, not hidden (full table in Chapter 6 / failure_appendix.md §8): cams_gfas_fire (archive 208 days behind), effis_active_fire (EFFIS WFS Oracle backend failure, self-heals), era5_land (~5-day production latency, stale-by-design), gwis (JRC WFS Oracle backend failure), ingv_oe_bulletin (no Etna item in this week's GVP RSS), meteostat (bulk-archive lag), opensky_adsb (HTTP 429 anonymous rate-limit).

3.3 Camera ingestion — including the multi-source reality

Five camera sources are configured (/api/cams sources[] @ 2026-06-29T02:44:37Z): the INGV EtnaTVChn mosaic (garr.tv PeerTube HLS), three Windy webcams (Milo East 9.1 km, Trecastagni 16.8 km, Catania Jonio 26.8 km from summit), and the EtnaWalk YouTube live stream. The wall runs a 75-second model-resident cycle (cadence_s:75, confirmed in both /api/cams and the local publisher artifact cameras_wall.json).

Honest multi-source caveat (the camera wall is not "5 live cameras"). At the verification timestamp only 3 of the 5 sources were online (n_online:3) — the three Windy cameras (all CLEAR, DAY_RGB). The INGV EtnaTVChn mosaic and the EtnaWalk stream were OFFLINE (online=False, badge OFFLINE). The wall reports OFFLINE honestly rather than serving a frozen frame. Thesis-wide wording is therefore scoped to "5 configured camera sources, 3 online at the verification timestamp," never "5 live cameras." Status: LIVE_OPERATIONAL (3/5 online); the two OFFLINE sources are a LIVE_STALE sub-component honestly flagged.

Camera health detection is itself a LIVE_OPERATIONAL feature: per-source online:false / badge OFFLINE is emitted truthfully (multi_cam_service.py L251–261), and a stale/frozen-frame watchdog enforces an age-based guard (STALE_FRAME_MAX_S = 1800 s). A per-frame perceptual-hash identical-frame check (to catch a recent but frozen camera) is not yet implemented and is queued in the roadmap. The /api/cams endpoint additionally showed transient TLS resets (two of three fetches) during verification before succeeding; the locally published cameras_wall.json corroborated the same content. This is recorded as a monitoring flag (Chapter 6), not a hard stop.

3.4 The detector — frozen YOLO11s, exact provenance

The Stage-1 detector is frozen for the thesis (WF19 KEEP-INCUMBENT decision):

Parameter Value
Architecture YOLO11s, 19-class head
Weights models/ingv_v1b_best.pt, 19,261,267 bytes
Weights sha256 c0a3d0ead257d318e70bec3bb84feaec7b99e9e3d55b132fc5f1ffd405cf0a20
Inference size / confidence imgsz 960 / conf 0.25
Crop pad / max side 0.25 fraction / 768 px
Dedup / cooldown IoU 0.4 / 1800 s per class-location
Device auto (CPU or GPU; no edge-hardware assumption)

The model-resident detector loop (multi_cam_task.py, PID 31700, ~308 MB resident) was running at verification. The 19-class head emits five wildfire/volcanic-relevant buckets; note that data.yaml shows nc:5, which is stale — the operative head is 19-class, confirmed against the weights and service/config.py CLASS_NAMES. Status: LIVE_OPERATIONAL.

Crucially, the detector's class is used to route:

3.5 Crop-level Qwen3-VL veto — exact model, prompt, and settings

The Stage-2 veto is the exact, frozen configuration recorded in model_prompt_freeze.json (no thesis result references "Qwen-VL" generically):

Field Value
Model qwen3-vl-32b (Qwen3-VL 32B)
Serving PHOENIX Model-Vault RunPod serverless, OpenAI-compatible endpoint (remote; crash-resilient by design)
Routing crop-level — a padded crop of each routed detector box; the full frame is not sent for the veto
Temperature 0.0 (deterministic)
max_tokens 120
top_p server default (not pinned in the request payload — a reproducibility gap, see Chapter 6)
Retries / backoff / timeout 3 / 8.0 s / 180 s
Image encoding crop downscaled to long-side ≤768 px, JPEG q88, base64 data-URL
Output strict JSON {"label":"WILDFIRE|VOLCANIC|NEITHER","confidence":0.0-1.0,"reason":"<=14 words"}; on parse failure, label PARSE_FAIL
Cache policy deterministic (temp 0) verdicts cached per crop; WF25 reproduced Config A bit-for-bit from cache (0 new calls, 245/245 decisions match, 0 phash leakage)

Two prompts are frozen. The CROP_PROMPT (hot/bright disambiguation, primary veto) explicitly instructs the model "Do NOT assume it is volcanic just because Etna is a volcano — vegetation wildfires occur on Etna's flanks," and forces a one-of-three choice WILDFIRE / VOLCANIC / NEITHER (the last covering sunset glow, sunlit cloud, artificial lights, lens flare, reflection, sensor artifact). The SMOKE_PROMPT (degassing-versus-plume, used for ambiguous large/summit smoke) distinguishes a denser browner/greyer wildfire column rising from vegetated ground from a white/blue crater-rooted degassing plume from diffuse sky-wide cloud/haze. The exact verbatim text of both prompts is in model_prompt_freeze.json (vlm_prompt_exact_text_CROP_PROMPT, vlm_prompt_exact_text_SMOKE_PROMPT).

The VLM is invoked only on detector-routed hot/bright crops: measured at ~0.151 calls/frame over the held-out set and 0 on quiet frames (vlm_call_rate), corroborated live by vlm_calls_this_cycle: 0 on a quiet cycle. This is the one performance figure safe to classify LIVE_OPERATIONAL for the rate itself. Status: LIVE_OPERATIONAL (detector + crop-veto run live in EtnaCameraWall; WF25 metrics are RESEARCH_ONLY).

3.6 The NIGHT-SAFETY corroboration override

The VLM veto's most consequential design element is its night-safety override, a durable guard (not a one-off) in service/crop_veto.py + service/config.py. It exists because the volcano-context prompt that gives the system its low volcanic false-alarm rate is exactly what mis-routes a bright nighttime vegetation fire — whose flame is single-frame-indistinguishable from lava incandescence — to VOLCANIC.

Rule as implemented. When a wildfire-class detection (wildfire_flame / wildfire_smoke) is routed and the VLM verdict would SUPPRESS it (VOLCANIC or NEITHER):

A real off-crater night fire can therefore never be erased by the VLM alone. The re-scored effect is quantified in Chapter 5: night true-fire silent false-negative 2 → 0, daytime and volcanic-FA unchanged. Status: LIVE_OPERATIONAL guard logic (in the live service); the WF25 re-score demonstrating its effect is RESEARCH_ONLY.

3.7 WF36 multi-feed corroboration

Stage 3 places a candidate in independent context via service/corroboration.py (corroborate_decision, gate_alerts, _volcanic_scene), evaluated against a live feed snapshot (snapshot_utc 2026-06-29T02:56:39Z). The corroboration logic implements rules for the five genuinely-corroborable detector classes:

  1. Wildfire smoke / flame. Take firms_near_summit_km (min of fresh VIIRS/MODIS). A hit in the annulus CRATER_KM(3) < near ≤ NEAR_SUMMIT_KM(15)corroborated (independent wildfire signal). A hit ≤3 km is treated as the volcano itselfnot a wildfire confirmation. >15 km or no fresh FIRMS → uncorroborated; a fresh FRP granule with no co-located value contributes only granule-recency (supports). Load-bearing assumption: on-crater FIRMS is volcanic, not wildfire — FIRMS cannot distinguish lava from a wildfire on the crater itself. Because FIRMS/SLSTR have hours-scale latency, a real early wildfire will routinely be uncorroborated (camera-only early warning), so the system must alarm on high detector+VLM confidence in that window rather than wait for satellite.
  2. Lava / incandescence. On-crater fresh FIRMS (≤3 km), else fresh FRP, else fresh FCI coverage → corroborated as VOLCANIC with explains_volcanic=True, which suppresses any wildfire alert for the same scene.
  3. Volcanic ash plume. Fresh CAMS AOD value → corroborated; else fresh SEVIRI coverage → corroborated (ash/IR-window context). This is the weakest corroborated verdict in the module: SEVIRI coverage existing is not evidence a plume is present, so the thesis-safe wording is "ash context available (geostationary coverage)," not "ash plume confirmed."
  4. Volcanic steam / degassing. Fresh TROPOMI SO₂ ≥ 5e-4 mol/m², else fresh CAMS SO₂ ≥ 5e-5 kg/m², else fresh EMIT granule → corroborated. The rule and thresholds are real, but on the verification cycle no SO₂ value was usable (TROPOMI summit-box mean null + CAMS stale), so degassing was uncorroborated this cycle — correctly.

The worked examples in wf36_corroboration_matrix.md §3 are the actual output of python service/corroboration.py against the live snapshot. On that cycle, firms_near_summit_km = 0.4 km (on-crater): lava_incandescence was corroborated VOLCANIC, while wildfire_flame/wildfire_smoke were correctly held as uncorroborated (the 0.4 km hit is Etna's own crater thermal, not an independent wildfire — exactly the trap WF36 exists to avoid).

The honest 5-corroborable / 8-STAGED split. Eight Gate-C classes are STAGED_NOT_LIVE corroboration targets, not live detection or corroboration, and the thesis must not imply otherwise: meteorological cloud, glare/sun/reflection, and black/frozen/stale frame are detector context labels only (they never alarm and have no corroboration branch); fog/haze, industrial smoke, dust/quarry, camera artifact, and unknown are not detector classes at all. For industrial smoke and dust, the OSM industrial/power and roads/rail data are on the board but not wired into corroboration.py — they are available-but-unwired columns, not corroboration the thesis can claim. Status: LIVE_OPERATIONAL for the five rule-backed classes (with their cycle-level qualifiers); STAGED_NOT_LIVE for the other eight.

3.8 Alert taxonomy, human review, and the bilingual dashboard

Alerts are deduplicated spatially (IoU 0.4) with an 1800 s per-class/location cooldown; frames older than 1800 s and feeds older than 24 h are treated as stale. The output surface is an internal/preview self-assessment dashboard that explicitly carries a "Not a public product" banner. Automated public alert dispatch is STAGED_NOT_LIVE: email dispatch was gated off (alert_email.enabled=false) at verification, and the human-in-loop review workflow is not yet a live operational pipeline. This is stated plainly: the system is not claimed to operate an alerting pipeline.

The dashboard is fully bilingual (Italian default, English toggle) with browser-detected language and localStorage persistence. Translation-key parity is exact: the en: and it: dictionaries in public/i18n.js each contain exactly 240 keys (240/240). Per-string translation quality was not separately audited; the LIVE_OPERATIONAL claim is key-count parity. Status: LIVE_OPERATIONAL (bilingual UI); STAGED_NOT_LIVE (alert dispatch / human-in-loop).


Chapter 4 — Dataset and Evaluation Design

Operational classification of all Chapter-4/5 metrics: RESEARCH_ONLY (held-out offline evaluation; small-n flagged throughout). The headline system performance is a held-out benchmark, not a live alert-dispatch measurement.

4.1 Held-out sets

Two disjoint held-out sets are used, both real frames only:

Operating-point confusion definition: GT-negative = the 62 volcanic frames; GT-positive = the 18 daytime genuinely-visible camera-fire frames (the conservative denominator).

4.2 The daytime recall denominator (both reported, honest reconciliation)

The headline conservatively excludes the entire four-frame night↔lava residual category from the daytime denominator → n = 18, 94.4% (17/18), matching the committed WF25_system_scoring. Under Config A only 2 of those 4 night frames are actually lost (07871, 07875); the other 2 (07723, 07773) pass through and alarm. Excluding only the 2 genuinely-lost frames gives the alternative n = 20, 95.0% (19/20), CI [76.4–99.1%]. Both are disclosed; the headline uses the conservative 17/18.

4.3 No-leakage controls

4.4 Taxonomy and confounders

The evaluation is built around the wildfire/volcanic confounder taxonomy: wildfire smoke and flame (positives), against volcanic ash, lava/incandescence, strombolian, degassing/steam, plus meteorological cloud, fog/haze, glare/sunset, snow, sensor/lens artifact. A representative hard-negative library is exported (Chapter 6), with web hard-negatives (dust, fog, industrial smoke, glare) listed for category coverage even where the images live off-repo (paths left blank, not fabricated).


Chapter 5 — Results

All Chapter-5 metrics are RESEARCH_ONLY (held-out offline, small-n), reproduced cache-only from the post-night-guard artifacts. The detector is frozen; the VLM verdicts are deterministic temperature-0 cache reads.

5.1 The full WF25 Gate-A table (detector-alone vs two-stage)

The complete performance specification for the shipping two-stage system — detector → crop-level Qwen3-VL veto, Config A (smoke pass-through) — measured end-to-end on the same real held-out frames, with Wilson 95% CIs:

Metric Detector alone Two-stage system Δ 95% CI (two-stage, Wilson) Evidence
wildfire smoke recall (daytime) 100% (17/17) 100% (17/17) 0.0 pp [81.6 – 100%] per_frame_recall, wildfire_smoke
wildfire flame recall (daytime) 100% (10/10) 90.0% (9/10) −10.0 pp [59.6 – 98.2%] per_frame_recall, wildfire_flame
volcanic-plume FP rate 9.7% (6/62) 8.1% (5/62) −1.6 pp [3.5 – 17.5%] survivors = summit-degassing smoke
steam/cloud/fog FP rate 0% (0/62) 0% (0/62) 0.0 pp [0 – 5.8%] no steam/cloud frame alarmed
artifact FP rate 1.6% (1/62) 0% (0/62) −1.6 pp [0 – 5.8%] 9243ab lens/sensor artifact removed
overall volcanic FP rate 9.7% (6/62) 8.1% (5/62) −1.6 pp [3.5 – 17.5%] A_external_volcanic_FA
precision / PPV (operating) 0.750 0.7727 +0.023 §5.3 confusion
recall / sensitivity (daytime, n=18) 100% (18/18) 94.4% (17/18) −5.6 pp [74.2 – 99.0%] recall_daytime_only
F1 (operating) 0.857 0.850 −0.007 §5.3 confusion
F2, recall-first (operating) 0.9375 0.9043 −0.033 §5.3 confusion
specificity (operating) 0.9032 0.9194 +0.016 §5.3 confusion
false negatives introduced by VLM 1 (borderline ground-lights, not a true fire) §5.3 paired change
latency p50 / p90 / p99 (detector CPU) 243.6 / ~310 / ~335 ms + amortised VLM n=245 detector_latency.json
VLM per routed crop (p50 / p95 / max) 992 / 1421 / 2053 ms n=37 vlm_per_routed_crop_ms
VLM calls per frame / quiet / active 0.151 / 0 / 0.032 measured vlm_call_rate
estimated cost per alert ~$5–15/mo all-in ($0 quiet, scale-to-zero) cost_model.json

Reading note on per-class denominators. A frame can carry both a smoke box and a flame box, so the class counts (17 smoke + 10 flame) exceed the 18 unique daytime frames. The single daytime recall loss (dfire_AoF07872) is a flame box (settlement ground-lights), which is why flame recall shows −10 pp while smoke recall is untouched.

Reading note on latency tails. p90/p99 were not separately computed; the committed cache stores detector-CPU p50/p95/max (243.6 / 334.5 / 1876.5 ms). p90 ≈ 310 ms by interpolation; p99 ≈ the max-tail (the 1876 ms max is a single GC/IO outlier). These tail estimates and the frame-capture success rate are UNKNOWN_NEEDS_VERIFICATION and queued in the roadmap. End-to-end CPU mean ≈ 426 ms/frame (detector mean + 0.151 × VLM mean); a frame with one routed crop ≈ 1756 ms p95; quiet frames add 0.

5.2 Reproduction and no-leakage (re-confirmed)

Quantity Value Status
Routed hot crops in held-out set 37 reproduced
Served from temperature-0 cache 37
New VLM calls this run 0
Cache-miss stub raised No ✅ deterministic
pHash leakage (external volcanic ↔ v1b train) 0 collisions

Reproduce: python service/thesis_wf25_scorecard.py.

5.3 Paired-change confusion (detector-alone → two-stage) and McNemar

Volcanic FA set (n=62, GT-negative):

Transition Count Files
FP → TN (veto suppressed a false alarm) 1 9243ab… (sensor/lens artifact, VLM ruled NEITHER)
FP → FP (false alarm survived) 5 the 5 summit-degassing wildfire_smoke plumes (pass-through by design)
TN → FP (veto created a false alarm) 0
TN → TN (unchanged) 56

Recall set (GT-positive):

Transition Count Files
TP → TP (fire kept) 17 (daytime)
TP → FN (veto vetoed a fire) 0 daytime · 2 night-residual (BEFORE guard) → 0 silent FN (AFTER guard) dfire_AoF07871, dfire_AoF07875 — now surfaced uncertain_night, not dropped
FN → TP (veto recovered a fire) 0
FN → FN (unchanged) 0

McNemar (exact, two-sided) over the full paired decision set (62 volcanic + 18 daytime recall frames): discordant b (det-alone alarm, two-stage no-alarm) = 2; discordant c (det-alone no-alarm, two-stage alarm) = 0; exact two-sided p = 0.50 — not significant. All discordant pairs are one-directional (b>0, c=0): the veto only ever removes alarms, never adds one. Its entire measured effect on this held-out set is the removal of 2 alarms — 1 volcanic sensor-artifact FP and 1 borderline daytime ground-lights frame.

5.4 Plain finding (as the directive requires)

On this held-out set the Qwen3-VL veto changes exactly 2 of 80 paired decisions, both removals. It suppresses 1 volcanic sensor-artifact false alarm (9243ab: 6/62 → 5/62 FA) and drops 1 borderline daytime ground-lights frame (dfire_AoF07872: 18/18 → 17/18 recall). It does not improve precision or recall on any genuine smoke or flame case, it creates no new false alarm, and it does not reach 0% volcanic FA — the 5 survivors are summit-degassing smoke that passes through by design. The veto's honest, measured value is the removal of one artifact false alarm; on every genuine-fire and genuine-degassing decision it leaves the detector unchanged. A larger FA reduction (8.1% → 3.2%) is available only by also routing smoke (Config B), at the cost of routing the contamination frames and incurring 2 additional night-lava recall losses; the prior whole-frame veto reached 0% FA but cost ~20.8% recall. Config A is the recommended default.

5.5 Recall preservation (Gate B) and the night-safety re-score

Gate-B question: does the VLM ever veto a true wildfire smoke/flame case?

Re-scored result (same temperature-0 cache, 0 new VLM calls):

Night true-fire FN Daytime true-fire FN Volcanic FA (n=62)
Before guard 2 (07871, 07875 silently vetoed VOLCANIC) 0 8.1% (5/62)
After guard 0 (both surfaced as uncertain_night) 0 8.1% (5/62) — unchanged

The volcanic FA is provably unchanged: 0 of the 62 volcanic frames are dark enough (all bright daytime, mean > 12) to trip the night guard, so the guard structurally cannot touch the 5 daytime-degassing survivors. The veto is consequently a recall-safe veto at night via the corroboration override — a daytime advisory precision layer AND a night veto that is corroboration-gated so it can never produce a silent off-crater night false-negative. (Reproduce: python service/rescore_wf25_night_guard.py.)

False-negative bound: daytime true-fire FN introduced by the VLM = 0 (Wilson upper bound on the observed 1/18 daytime FN — a non-fire frame — is 25.8%, which the small sample cannot tighten).

5.6 WF36 per-class corroboration matrix (the 5-corroborable / 8-STAGED split)

Cell legend: ✔ confirms · ◐ supports · ✗ contradicts · – unavailable · ∅ not-applicable · ⏳ stale · ≈ too-coarse. (Source-column keys as in wf36_corroboration_matrix.md §1.)

Class (Gate-C) detector class? FCI/SEV FIRMS SLSTR TROPOMI CAMS OSM CAM
wildfire smoke YES (wildfire_smoke) ✔/◐
wildfire flame YES (wildfire_flame) ✔/◐
volcanic ash plume YES (ash_plume) ✔(SEV) ◐⏳
volcanic steam/degassing YES (3 classes) ✔(SO₂) ◐⏳
lava / incandescence YES (5 classes) ◐(FCI) ✔(on-crater) ◐(FRP)
meteorological cloud partial (context-only)
fog / haze NO
industrial smoke NO (avail, NOT wired)
dust / quarry / road dust NO (avail, NOT wired)
glare / sun / reflection partial (context-only)
camera artifact NO (target, NOT wired)
black / frozen / stale frame partial (context-only) (target, NOT wired)
unknown NO

Genuinely corroborable now (5): wildfire smoke, wildfire flame (LIVE rules; uncorroborated this cycle — on-crater 0.4 km FIRMS, correctly not a wildfire confirmation), lava/incandescence (corroborated VOLCANIC this cycle), volcanic ash plume (corroborated via SEVIRI coverage only — weak/contextual, CAMS AOD was stale), volcanic steam/degassing (LIVE rule; uncorroborated this cycle — TROPOMI null + CAMS stale + value below the elevated floor). STAGED_NOT_LIVE (8): meteorological cloud, glare, frozen-frame (context labels only), fog/haze, industrial smoke, dust, camera artifact, unknown (no rule / no detector class). Per the directive's §8 hard-stop check, no over-claim is required and no hard-stop is triggered, provided the thesis restricts corroboration claims to the five rule-backed classes with their cycle-level qualifiers and labels the other eight STAGED — which it does.

5.7 Operational spec snapshot (Gate E)

Key LIVE_OPERATIONAL rows (full table in system_performance_spec.md): 5 configured cameras / 3 online; 75 s cadence; VLM trigger = routed hot/bright crops only; 64 feeds (46 live / 7 stale / 10 catalogued / 1 key-pending / 0 error); FWI 13.06 moderate; hourly EtnaFeedsRefresh Running; dedup IoU 0.4 / cooldown 1800 s; 240/240 bilingual parity; OFFLINE camera health honest; Overpass degradation guard active. STAGED_NOT_LIVE: human-review / alert-email dispatch. UNKNOWN_NEEDS_VERIFICATION: frame-capture success rate; p90/p99 latency.

5.8 Quantum evaluation (edge-of-research due diligence — honest negative)

Operational classification: RESEARCH_ONLY. SIMULATION ONLY — no QPU was contacted, the IBM Quantum key was not read. All compute was light local CPU (statevector simulation of ≤4 qubits, ~46 s wall). A quantum win is asserted only where a paired-difference CI excludes zero.

A fresh, real-data quantum-versus-classical benchmark was run directly on the INGV task: discriminate VOLCANIC vs WILDFIRE thermal-anomaly events near Etna — the populated-flank problem INGV's own literature calls spectrally hard. Data: 33 Etna-edifice volcanic events (GVP/INGV weekly state oracle + FIRMS FRP) and 404 vegetated-flank wildfire events (real FIRMS active fire in the ≤25 km annulus), n = 437, 182 date groups, GroupKFold by date (leakage-guarded). The hard near-field intrinsic regime uses thermal magnitude / FIRMS multiplicity only (log_frp_max, log_frp_sum, log_firms_count, n_firms_sensors), with no geometry and no source-availability proxies (which correlate perfectly with class by construction and are stripped). With 4 features = 4 qubits, the quantum map sees the full signal with no PCA information loss — the fairest possible footing.

Out-of-fold AUC (grouped-by-date, n=437):

Model Type OOF AUC 95% CI
Classical RBF-SVM (matched) classical 0.936 [0.889, 0.973]
Classical HistGB (matched) classical 0.918 [0.867, 0.964]
Classical HistGB (full features) classical 0.892 [0.820, 0.947]
Quantum fidelity kernel (ZZ, 4q) quantum 0.837 [0.767, 0.896]
Quantum VQC (4q, 2-layer) quantum 0.702 [0.608, 0.790]

Paired AUC deltas (quantum − classical, bootstrap 95% CI):

Comparison Δ AUC 95% CI Read
Q-kernel − RBF (matched) −0.099 [−0.161, −0.038] quantum worse, CI excludes 0
Q-kernel − HistGB (matched) −0.081 [−0.139, −0.028] quantum worse, CI excludes 0
VQC − RBF (matched) −0.234 [−0.325, −0.150] quantum much worse, CI excludes 0
Q-kernel − HistGB (full) −0.055 [−0.129, +0.021] tie/worse (CI brackets 0)

Honest interpretation. This is a clean, CI-backed publishable negative: on the same real features and the same leakage-guarded split, the quantum fidelity kernel (0.837) is beaten by matched classical RBF-SVM (0.936) by −0.099 [−0.161, −0.038]; the VQC (0.702) is the worst model tested. Notably the loss is not a dimensionality-truncation artifact — with only 4 features the 4-qubit map sees the full signal — it is the encoding/kernel-geometry mismatch and the VQC generalisation ceiling themselves. Because statevector simulation is exact, the classification verdict will not improve on real hardware (device noise only hurts). The right tool for this classification task is classical.

The one genuine novelty (preserved as a research line, not an operational claim): the formulation of volcanic deformation source inversion (Mogi/Okada) as a QUBO/Ising problem. On a synthetic-realistic Etna GNSS geometry, the multi-source / model-selection variant solved by simulated annealing matches the exact optimum 100% [89–100%] where multi-start Levenberg–Marquardt traps at 60% and greedy/OMP at 0%. Honest caveat: this win is shared by a classical simulated-annealing sampler — it is a QUBO-formulation success, not quantum-hardware advantage. The Mogi-single-source-QUBO and Dozier-sub-pixel-as-QAOA mappings are, to our knowledge, literature firsts (pending peer confirmation). Overall quantum verdict: worth evaluating and worth formulating, but on the classification tasks that actually run the monitor it does not add operational value — classical wins with CIs that exclude zero. The QPU gate remains BLOCKED pending explicit approval; no quantum hardware was used anywhere. (Reproduce: python quantum_disambiguator.py, ~46 s, statevector sim only.)


Chapter 6 — Failure Modes and Limitations

This chapter is exhaustive by design (Gate F). The machine-readable manifest is failure_case_manifest.csv (38 rows); 24 source frames + 24 thumbnails are exported to failure_crops/. Read-only build: the live multi_cam_service was not disturbed.

6.1 The night-fire↔lava residual (the only true wildfire-recall loss)

Two frames are the only true wildfire-recall losses in the shipping Config A, reported separately and excluded from the daytime denominator:

case_id frame detector VLM verdict crop
FN_007 dfire_pos_AoF07871 wildfire_flame @0.818 VOLCANIC "glowing, irregularly shaped incandescence consistent with lava flow or vent activity" failure_crops/FN_007.jpg
FN_009 dfire_pos_AoF07875 wildfire_flame @0.878 VOLCANIC "bright, diffuse glow … consistent with summit incandescence or strombolian activity" failure_crops/FN_009.jpg

A bright nighttime vegetation-fire line and lava incandescence are not separable from single-frame appearance; the volcano-context prompt that gives the system its low volcanic FA is exactly what mis-routes these two. This is mitigated, not solved. The night-safety guard (Chapter 3.6 / 5.5) converts these from silent false-negatives into surfaced uncertain_night alerts (night silent FN 2 → 0), but the underlying single-frame ambiguity remains. Recommended next steps (future work, not claimed operational): multi-frame temporal persistence (lava is steady; wildfire flickers and spreads) and a hard FIRMS/SLSTR night co-location override. Config B incurs 2 additional night-lava losses (AoF07723, AoF07773; FNB_010, FNB_011) — the documented recall price of pushing volcanic FA from 8.1% to 3.2%, and the reason Config A is the default.

6.2 The borderline town-lights veto

FN_008 (dfire_pos_AoF07872, wildfire_flame @0.858/0.713 → NEITHER, "artificial ground lights, likely from settlement") is the single frame that moves daytime recall 100% → 94.4% (17/18). It is an arguably-correct rejection of distant settlement/ground lights, counted conservatively as a recall loss so the headline is not inflated. Crop: failure_crops/FN_008.jpg.

6.3 The five surviving volcanic false alarms

The Config-A veto suppresses exactly one volcanic FP (FP_003 / 9243ab, sensor-artifact/lens-flare boxed as flame, VLM ruled NEITHER, 9.7% → 8.1%). The 5 survivors (FP_001, FP_002, FP_004, FP_005, FP_006) are summit passive-degassing steam / cloud / ash mis-boxed as wildfire_smoke; smoke-class crops pass through by design (this is what preserves wildfire recall), so the veto cannot suppress them in Config A. All six source frames are exported (failure_crops/FP_001.jpg … FP_006.jpg).

6.4 Prior frame-level-veto losses (motivating the crop-level design)

The prior whole-frame veto reached 0/62 volcanic FA but destroyed ~20.8% of genuinely-visible fire recall (FRAMEVETO_012–015: HPWREN/Roboflow faint distant horizon smoke called "haze/cloud" by the volcano-context whole-frame prompt). The shipping crop-level Config A recovers every one — a grey distant plume is not red-hot, is never routed to the VLM, and can never be vetoed. Retained as the honest upper-bound failure mode of the alternative architecture.

6.5 Representative hard negatives

Volcanic confounder library (exported real frames): HN_018 steam/degassing, HN_019 ash plume, HN_020 lava glow, HN_021 strombolian, HN_022 cloud obscuration, HN_023 glare/sunset, HN_024 snow cover. Web hard negatives (dust, fog, industrial smoke, glare, other; HNWEB_025–029) carry blank frame paths because the images live on the eval pod, not in this checkout — stated as fact, not fabricated. Categories with 0 locally-exportable instances (industrial smoke, dust, compression-artifact) are stated as such rather than invented.

6.6 Honesty flags and known limitations (consolidated)


Chapter 7 — Future Work

Every item here is PLANNED until it has live health evidence; nothing is described as operational. The thesis baseline stays frozen at 4d1b0cc / ingv_v1b_best.pt / qwen3-vl-32b temp 0.0 — new models are challengers, not baseline swaps. No edge/Hailo/DGX hardware is assumed (ops-room target: detector CPU/GPU auto, VLM remote serverless).

7.1 Phase 0–2 weeks — close Phase-1 gaps and harden the freeze

Instrument frame-capture success rate (rolling per-source online/offline + frame-age counter, replacing the point-in-time "3 of 5 online"); compute p90/p99 latency from existing per-frame arrays including the routed-crop VLM tail, separately for CPU and GPU; pin top_p in the VLM request; harden /api/cams with a retry/health probe and a freshness-stamped fallback to cameras_wall.json; add a per-frame pHash-identical-consecutive-frame guard (catch a frozen-but-recent camera); lock WF36 STAGED wording.

7.2 Phase 1 month — evaluation rigor

Enlarge the held-out sets (more bulletin-confirmed volcanic frames and clean-source daytime wildfire frames; keep pHash-leakage-zero and the Pyronear-sequence separation); commit the train↔held-out collision list; run the WARP-style hard-negative robustness battery [12] (Gaussian noise, JPEG compression, blur, cloud-like patches, fog/haze, glare, rain-on-lens, timestamp overlays, black/frozen frames) and report detector + two-stage degradation curves.

7.3 Phase 3 months — domain adaptation and segmentation challengers

Domain adaptation [13,14,15,16] is the core next pillar: collect unlabeled frames from every live camera, mine detector positives + detector–VLM disagreements + VLM vetoes, human-review a small hard set, train with UDA/self-training, and test on held-out camera/date/weather/volcanic episodes, guarding against catastrophic false-positive drift. Detector/segmentation challengers (RESEARCH_ONLY): YOLO11/YOLO12 [4], RT-DETR [5], Grounding DINO 1.5/Edge [6], SAM 2 video plume masks [23,24], optical-flow smoke-motion-consistency [21,22], sky/terrain masking — benchmarked against the frozen YOLO11s, not swapped in. Prompt-as-evaluated-component study [10,11]: measure the recall-versus-FA tradeoff across CROP/SMOKE prompt variants, SmokeBench-style [7].

7.4 Phase 6 months — fine-tuned VLM, IR/thermal fusion, FCI event tracking

Fine-tuned Etna/wildfire VLM [20]: fine-tune a Qwen2.5/Qwen3-VL-style model on Etna degassing-vs-wildfire-vs-cloud crops, keeping the frozen 32B as the thesis baseline. IR/thermal fusion [17,18,19,20]: RGB+IR fusion, thermal/night detection, fire/lava/industrial-heat discrimination, satellite-thermal corroboration — directly attacking the night-fire↔lava residual. MTG-FCI event tracking [25,26,27]: move FCI from coverage-only corroboration to FireDyn-style fire-pixel extraction, rate-of-spread, FRP evolution, and fire-arrival maps, treating FCI as early-candidate + event-tracking data, not perfect ground truth. Active-learning dashboard + data flywheel: detector fires → candidates, VLM disagreements → hard examples, human decisions → labels, satellite corroboration → weak labels, INGV bulletins → volcanic labels, stale/frozen frames → camera-health labels.

7.5 Research-only (no committed date)

VLM/MLLM early-smoke localization limits (keep detector-first + VLM-veto per current evidence [7]); FireCLIP-style multimodal prompt tuning [10]; TROPOMI SO₂ + CAMS as contextual-only evidence [31,32,33]; FCI-vs-SEVIRI sensitivity study [25] and SLSTR active-fire characterization [29,30]; multi-season dataset expansion across lighting/weather/angle/episode shift; operational human-in-loop alerting protocol design (currently STAGED) before any public-facing claim; and the volcanic-source-inversion-as-QUBO research line (Chapter 5.8) — a parallel research thread, QPU BLOCKED.


References

Verification status: all 35 references VERIFIED (paper/source confirmed to exist via arXiv / publisher / IEEE / ScienceDirect / Hugging Face papers index, matching title and authors); [34] and [35] are official documentation (verified, non-paper), cited deliberately for the operational feed-reliability concept.

  1. Govil, K.; Welch, M.L.; Ball, J.T.; Pennypacker, C.R. (2020). Preliminary Results from a Wildfire Detection System Using Deep Learning on Remote Camera Images. Remote Sensing 12(1):166. https://doi.org/10.3390/rs12010166
  2. Dewangan, A.; Pande, Y.; Braun, H.-W.; Vernon, F.; Perez, I.; Altintas, I.; Cottrell, G.W.; Nguyen, M.H. (2021). FIgLib & SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection. arXiv:2112.08598. https://arxiv.org/abs/2112.08598
  3. Lostanlen, M.; Veith, F.; Buc, C.; Barriere, V. (2024). Constructing a Real-World Benchmark for Early Wildfire Detection (PyroNear-2025 Dataset). arXiv:2402.05349. https://arxiv.org/abs/2402.05349
  4. Tian, Y.; Ye, Q.; Doermann, D. (2025). YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv:2502.12524 (NeurIPS 2025). https://arxiv.org/abs/2502.12524
  5. Zhao, Y.; Lv, W.; Xu, S.; et al. (2024). DETRs Beat YOLOs on Real-time Object Detection (RT-DETR). CVPR 2024; arXiv:2304.08069. https://arxiv.org/abs/2304.08069
  6. Ren, T.; et al. / IDEA-Research (2024). Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection. arXiv:2405.10300. https://arxiv.org/abs/2405.10300
  7. Qi, T.; Li, W.; Barnes, N. (2025). SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection. WACV 2026; arXiv:2512.11215. https://arxiv.org/abs/2512.11215
  8. Zhang, C.; Wang, S. (2024). Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data. arXiv:2401.17600. https://arxiv.org/abs/2401.17600
  9. Danish, M.S.; Munir, M.A.; Shah, S.R.A.; Kuckreja, K.; Khan, F.S.; Fraccaro, P.; Lacoste, A.; Khan, S. (2024). GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks. arXiv:2411.19325. https://arxiv.org/abs/2411.19325
  10. FireCLIP: Enhancing Forest Fire Detection with Multimodal Prompt Tuning and Vision-Language Understanding. Fire (MDPI) 8(6):237, 2025. https://www.mdpi.com/2571-6255/8/6/237
  11. Adhikari, R.; Thapaliya, S.; Dhakal, M.; Khanal, B. (2024). TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models. arXiv:2410.05239. https://arxiv.org/abs/2410.05239
  12. Ide, R.; Yang, L. (2024/2025). Adversarial Robustness for Deep Learning-based Wildfire Detection Models (WARP). arXiv:2412.20006; Fire (MDPI) 8(2):50. https://arxiv.org/abs/2412.20006
  13. Multilevel feature cooperative alignment and fusion for unsupervised domain adaptation smoke detection. Frontiers in Physics 11:1136021, 2023. https://www.frontiersin.org/articles/10.3389/fphy.2023.1136021/full
  14. EDIF: boosting unsupervised cross-domain forest fire smoke detection with enhanced domain-invariant features. Geomatics, Natural Hazards and Risk, 2025. https://www.tandfonline.com/doi/full/10.1080/19475705.2025.2556144
  15. Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap. arXiv:2511.16617, 2025. https://arxiv.org/abs/2511.16617
  16. Pesonen, J.; Hakala, T.; Karjalainen, V.; Koivumäki, N.; Markelin, L.; Raita-Hakola, A.-M.; Suomalainen, J.; Pölönen, I.; et al. (2024). Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models. arXiv:2408.10843. https://arxiv.org/abs/2408.10843
  17. A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection. Remote Sensing 17(15):2593, 2025. https://www.mdpi.com/2072-4292/17/15/2593
  18. MCDet: Target-Aware Fusion for RGB-T Fire Detection. Forests 16(7):1088, 2025. https://www.mdpi.com/1999-4907/16/7/1088
  19. A Study on Flame Detection Method Combining Visible Light and Thermal Infrared Multimodal Images. Fire Technology, 2024. https://link.springer.com/article/10.1007/s10694-024-01676-9
  20. Habibpour, M.; Alipour Talemi, N.; Spodnik, J.; Khoury, C.J.; Afghah, F. (2026). WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring. arXiv:2604.20190. https://arxiv.org/abs/2604.20190
  21. Yuan, F. (2014). Spatiotemporal bag-of-features for early wildfire smoke detection (HOOF temporal feature). Image and Vision Computing 32(1):24–33. https://doi.org/10.1016/j.imavis.2013.08.001
  22. Zhao, Y.; et al. (2015). Forest Fire Smoke Video Detection Using Spatiotemporal and Dynamic Texture Features. Journal of Electrical and Computer Engineering 2015:706187. https://onlinelibrary.wiley.com/doi/10.1155/2015/706187
  23. Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; et al. (2024). SAM 2: Segment Anything in Images and Videos. arXiv:2408.00714. https://arxiv.org/abs/2408.00714
  24. Ugwu, E.U.; Xinming, Z. (2025). Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance. arXiv:2510.21782. https://arxiv.org/abs/2510.21782
  25. Major improvements in spaceborne early fire detection and small-fire FRP retrieval with the Meteosat Third Generation Flexible Combined Imager. Science of Remote Sensing, 2026. https://www.sciencedirect.com/science/article/pii/S2666017226000040
  26. Paugam, R.; Filippi, J.-B.; Benali, A.; Gomes, J.; Xu, W.; Dutra, E.; Andre, F.; Boulanger, D.; Retornard, V.; Meraner, A.; Harvie, J.; Penot, V.; Denjean, C. (2026). Leveraging MTG-FCI fire observations for event-based fire behavior monitoring (FCI-FireDyn / Fire Event Tracker). arXiv:2606.06016. https://arxiv.org/abs/2606.06016
  27. Unsupervised Wildfire Detection Using Multispectral MTG-FCI Data: A Feasibility Study. Journal of Imaging 12(6):229, 2026. https://doi.org/10.3390/jimaging12060229
  28. Schroeder, W.; Oliva, P.; Giglio, L.; Csiszar, I.A. (2014). The New VIIRS 375 m active fire detection data product: Algorithm description and initial assessment. Remote Sensing of Environment 143:85–96. https://doi.org/10.1016/j.rse.2013.12.008
  29. Xu, W.; Wooster, M.J. (2023). Sentinel-3 SLSTR Active Fire (AF) Detection and FRP Daytime Product — Algorithm Description and Global Intercomparison to MODIS, VIIRS and Landsat AF Data. Science of Remote Sensing 7:100087. https://www.sciencedirect.com/science/article/pii/S2666017223000123
  30. Wooster, M.J.; Xu, W.; Nightingale, T. (2012). Sentinel-3 SLSTR active fire detection and FRP product: pre-launch algorithm development and performance evaluation using MODIS and ASTER datasets. Remote Sensing of Environment 120:236–254. https://doi.org/10.1016/j.rse.2011.09.033
  31. Theys, N.; Hedelt, P.; De Smedt, I.; Lerot, C.; Yu, H.; Vlietinck, J.; Pedergnana, M.; et al. (2019). Global monitoring of volcanic SO2 degassing with unprecedented resolution from TROPOMI onboard Sentinel-5 Precursor. Scientific Reports 9:2643. https://www.nature.com/articles/s41598-019-39279-y
  32. Exploiting Sentinel-5P TROPOMI and Ground Sensor Data for the Detection of Volcanic SO2 Plumes and Activity in 2018–2021 at Stromboli, Italy. Sensors 21(21):6991, 2021. https://www.mdpi.com/1424-8220/21/21/6991
  33. Theys, N.; et al. Sulfur dioxide retrievals from TROPOMI onboard Sentinel-5 Precursor: Algorithm Theoretical Basis. Atmospheric Measurement Techniques. https://amt.copernicus.org/articles/10/119/2017/
  34. NASA FIRMS — Fire Information for Resource Management System: product and latency documentation (VIIRS/MODIS active fire). NASA Earthdata. https://www.earthdata.nasa.gov/data/tools/firms/faq
  35. EUMETSAT — Meteosat Third Generation Instruments (FCI) documentation. https://www.eumetsat.int/meteosat-third-generation-instruments

Appendices

Appendix A — Evidence ledger summary

Every headline claim traces to thesis_evidence_ledger.csv (claim_id, operational_status, evidence_path, commit, command/query, timestamp, metric, CI, limitation, thesis-safe wording). Summary of the 16 ledger rows:

Claim Status Metric Evidence
C01 Frozen baseline LIVE_OPERATIONAL commit 4d1b0cc (master) git rev-parse HEAD @ 02:31Z
C02 64-feed board LIVE_OPERATIONAL 46/7/10/1/0 of 64 feeds.json @ 02:31:33Z
C03 Overpass guard LIVE_OPERATIONAL 23,379 roads / 341 rail (live) osm_roads_rail.py L64-72
C04 Live FWI LIVE_OPERATIONAL 13.06 (moderate) feeds.json effis_fwi
C05 Camera wall LIVE_OPERATIONAL (3/5) 75 s cadence, 3 online /api/cams @ 02:44:37Z
C06 Camera sources LIVE_OPERATIONAL 5 (INGV + 3 Windy + EtnaWalk) /api/cams sources[]
C07 VLM trigger LIVE_OPERATIONAL ~0.151/frame, 0 quiet /api/cams + WF25
C08 WF25 volcanic FA RESEARCH_ONLY 8.1% (5/62), CI [3.5–17.5%] WF25_system_scoring.json
C09 WF25 daytime recall RESEARCH_ONLY 94.4% (17/18), CI [74.2–99.0%] WF25_system_scoring.json
C10 Deterministic cache RESEARCH_ONLY 245/245 match, 0 leakage WF25 provenance
C11 WF36 matrix LIVE_OPERATIONAL 13 classes; >24 h = no-evidence corroboration.py @ 02:56:39Z
C12 Bilingual parity LIVE_OPERATIONAL 240 EN = 240 IT i18n.js
C13 Scheduled jobs LIVE_OPERATIONAL both Running; feeds hourly schtasks @ 02:32Z
C14 Model/prompt freeze LIVE_OPERATIONAL sha256 c0a3d0ea…; temp 0.0 model_prompt_freeze.json
C15 Dedup/staleness guards LIVE_OPERATIONAL IoU 0.4; 1800 s; 24 h config.py
C16 Alert dispatch STAGED_NOT_LIVE alert_email.enabled=false banner + cameras_wall.json

Appendix B — Failure appendix and crop references

failure_case_manifest.csv (38 rows); failure_crops/ (24 source frames + 24 thumbnails). Key illustrative crops: night-fire↔lava residual FN_007.jpg, FN_009.jpg; borderline town-lights FN_008.jpg; surviving volcanic FA FP_001.jpg … FP_006.jpg (FP_003 = the one vetoed sensor-artifact); Config-B-only night losses FNB_010.jpg, FNB_011.jpg; prior frame-level-veto losses FRAMEVETO_012–015.jpg; volcanic confounder library HN_018 … HN_024; contamination CONTAM_016 (false-colour satellite), CONTAM_017 (painting) — both correctly VLM-rejected and excluded from the recall denominator. Web hard-negatives HNWEB_025–029 carry blank paths (images off-repo, stated not fabricated). Vetoed-crop originals: crops/VETOED_dfire_pos_AoF07871_…_VOLCANIC.jpg, …AoF07875_…_VOLCANIC.jpg, …AoF07872_…crop0/crop1_NEITHER.jpg.

Appendix C — Model and prompt freeze manifest

Full manifest: model_prompt_freeze.json (commit 4d1b0cc, frozen 2026-06-29T02:45:00Z). Detector: YOLO11s 19-class, models/ingv_v1b_best.pt, sha256 c0a3d0ead257d318e70bec3bb84feaec7b99e9e3d55b132fc5f1ffd405cf0a20, 19,261,267 bytes, imgsz 960, conf 0.25, crop pad 0.25 / max 768 px, dedup IoU 0.4, cooldown 1800 s. VLM: qwen3-vl-32b, Model-Vault RunPod serverless OpenAI-compatible endpoint, crop-level routing, temperature 0.0, max_tokens 120, top_p server-default, retries 3 / backoff 8.0 s / timeout 180 s, image crop long-side ≤768 px JPEG q88 base64. Two frozen prompts (CROP_PROMPT primary veto, SMOKE_PROMPT degassing-vs-plume) with verbatim text in the manifest; strict-JSON output {label, confidence, reason} with PARSE_FAIL fallback. Cache: deterministic temp-0 per-crop (reports/crop_veto_outputs/te_crop_level_cpu_configA.json). Hardware: local OFFICE workstation (CPU/GPU auto) + remote serverless VLM; no DGX, no PHOENIX prod, no edge/Hailo assumed.

Appendix D — Full corroboration matrix

Full per-class matrix, per-class implemented rule + worked example on current live feeds + honest caveat: wf36_corroboration_matrix.md (implementation service/corroboration.py @ 4d1b0cc, feed snapshot 2026-06-29T02:56:39Z). Five rule-backed classes (wildfire smoke, wildfire flame, lava/incandescence, ash plume, steam/degassing) with cycle-level qualifiers; eight STAGED classes (cloud, glare, frozen-frame, fog/haze, industrial smoke, dust, camera artifact, unknown). Freshness rule: feeds older than FEED_MAX_AGE_H = 24 h are treated as no current evidence (uncorroborated, never contradicted). Reproduce: ETNA_FEEDS_OUT=../etna_dashboard/feeds/out python corroboration.py.


Prepared for the INGV deliverable. Frozen baseline commit 4d1b0cc. Every quoted number traces to a committed Phase-1 evidence artifact in flank_wildfire/reports/thesis/. The system is reported as an evidence-backed multi-source monitoring architecture with explicit operational classification, confidence intervals, and failure modes; it is not claimed to solve early wildfire detection generally.