u/Latter_Panda4439

I've been building an open dataset of geopolitical and supply chain risk events scraped from public feeds (GDELT, ACLED, GDACS, NASA FIRMS, WHO DON) for the past few months. Around 70k events at this point. The thing that surprised me when I cross-checked against mainstream news coverage: only about a quarter of those events have any major-outlet article attached.

The other ~72% are silent. Flagged in at least one public feed but never picked up by major news. I'd assumed those would all be low-severity noise (small protests, minor weather flags, single-source rumors). They're not. Roughly a quarter of the silent set is still rated critical or high severity by the source feed itself, which works out to ~14k events nobody covered. ACLED specifically dominates the silent set — local conflict events that don't make English-language outlets.

The cross-check has obvious limits worth flagging up front: my "news coverage" is a Google News fetch (so paywalled or non-English coverage gets undercounted), and the severity is graded after the fact by an LLM step (so wrong angle on ambiguous events). Both are best-effort. But the headline gap — ~28% news overlap — is just a SQL join, not LLM-dependent. Events are geocoded by region, no PII. Actor names from ACLED are excluded per their license.

The deduplicated event/chokepoint/entity tables are up on Hugging Face and Kaggle as Parquet + a 10% CSV sample, CC-BY-NC-SA. Browsable map version is at tremorwatch.com if you want to poke at individual events first. Curious if anyone has tried something similar at this scale and how you'd refine the coverage definition (different news source mix, embedding-based fuzzy match, etc).

Disclosure: I built this — part of an early-stage startup (Volt AI). Dataset is free under CC-BY-NC-SA, no paid tier exists yet. Posting under r/datasets self-promo guidelines; happy to adjust format if mods prefer.

Been running an automated public-feed pull (GDELT, ACLED, GDACS, FIRMS) around the East Asia battery cell cluster for a couple months and figured the patterns are worth sharing. The site auto-aggregates and writes the briefs with an LLM step, so what surfaces is what the firehose catches — not what an analyst cherry-picks. Take it with that caveat.

A few things stood out:

- The signal density isn't where I expected. The big established sites generate fewer flagged events than the newer assembly clusters spinning up further inland. Could be reporting bias (foreign press covers coastal more), could be that the new sites genuinely run hotter on labor and grid friction. honestly hard to tell without ground truth

- Wildfires near logistics corridors keep showing up as a third-order risk that nobody really prices in. A FIRMS hotspot a couple dozen km from a port gets translated into shipping delays nobody attributes to climate. I didn't expect how often.

- Strategic / diplomatic events (CAMEO 17x and 19x codes if you've used GDELT) cluster in patterns that don't track the news cycle. Quiet weeks of mainstream coverage, then a burst of flagged events in a few days that no major outlet ran. The under-covered ones tend to be the ones that move things — which is also where running an LLM over every flag earns its keep: it surfaces patterns a human-curated brief would skip as too small.

Curious if anyone here has a better methodology for separating "actual disruption" from "GDELT noise" without manually reading every article. The mention-dedup change in '18 broke a lot of older filters and I haven't found a clean replacement.

(fwiw I help build tremorwatch.com — it's the automated pipeline described above. Public sources, LLM-written briefs in EN/JA/KO, no human in the loop per event. Open, no signup. The trade-off versus an analyst-curated site is firehose coverage but occasional LLM hallucinations slip through and get corrected when caught. Map + raw event list there if curious.)

[self-promotion] I scraped ~70k geopolitical risk events from public feeds. Only about a quarter made the news. (Parquet + CSV on HF/Kaggle)