r/datasets — reddlx | reddlx-frontend

▲ 6 r/civictech+1 crossposts

Building with congressional data in 2026... what am I missing? Because everything is dead

I’m building an open source tool to track congressional stock trades, donors, travel, and voting records. One platform, all the data, free and open. Simple idea.

Except I can’t find data that works.

I’ve spent the last 48 hours wiring up pipelines and every single source I try is either dead, broken, paywalled, or publishing PDFs like it’s 2004. I have to be missing something because this can’t be the actual state of civic data in 2026.

Here’s what I’ve tried:

Dead:

∙	ProPublica Congress API – shut down, repo archived Feb 2025

∙	OpenSecrets API – discontinued April 2025, now “contact sales”

∙	GovTrack bulk data – shut down, told everyone to use ProPublica (which then died)

∙	Sunlight Foundation – dead for years, tools lived on through ProPublica (which then died)

∙	timothycarambat/senate-stock-watcher-data – the repo everyone’s senate stock trade scrapers point to. Last updated 2021. Data stops around Tuberville’s first year. The guy who was literally the poster child for congressional insider trading isn’t in the dataset.

Barely functional:

∙	Congress.gov API – returning empty responses right now. Changelog says they’re deploying tomorrow. Also went fully dark last August with no communication.

∙	Senate eFD (efdsearch.senate.gov) – 503 errors on weekends. Runs on a Django app behind a consent gate. When it works, it works. It just doesn’t work on weekends.

∙	House financial disclosures – ASPX form with ViewState tokens. Feels like scraping a government intranet from 2005.

∙	SEC EDGAR – “works” but there’s no crosswalk between congressional bioguide IDs and SEC CIK numbers. Common names return false positives. You’re matching by name and hoping for the best.

Not even trying:

∙	House travel disclosures – PDF only. Quarterly scanned documents. No API, no XML, no structured data of any kind. Just PDFs you parse with pdfplumber and pray the table formatting is consistent.

∙	Senate travel – published in the Congressional Record as text dumps. Good luck.

Actually works:

∙	FEC API – functional, rate limited, but real data

∙	That’s basically it

Every GitHub repo I find for congressional data scraping is archived, abandoned, or points to APIs that no longer exist. Every nonprofit that used to aggregate this data has either shut down or gone behind a paywall. The raw government sources exist but they’re spread across six different agencies using six different formats with six different auth methods and zero shared identifiers.

I can’t be the only person who needs this data. What am I missing? Is there a source or project I haven’t found? Is someone maintaining scrapers that actually work in 2026?

I’m building it anyway (github.com/OpenSourcePatents/Congresswatch) but right now it feels like I’m assembling a car engine from parts scattered across different junkyards, and half the junkyards are closed on weekends.

What do you all use?

reddit.com

u/-Darkened-Soul — 6 hours ago

▲ 4 r/datasets

[Self Promotion] Feature Extracted Human and Synthetic Voice datasets - free research use, legally clean, no audio.

tl;dr Feature extracted human and synthetic speech data sets free for research and non commercial use.

Hello,

I am building a pair of datasets, first the Human Speech Atlas has prosody and voice telemetry extracted from Mozilla Data Collective datasets, currently 90+ languages and 500k samples of normalized data. All PII scrubbed. Current plans to expand to 200+ languages.

Second the Synthetic Speech Atlas has synthetic voice feature extraction demonstrating a wide variety of vocoders, codecs, deep fake attack types etc. Passed 1 million samples a little while ago, should top 2 million by completion.

Data dictionary and methods up on Hugging Face.

https://huggingface.co/moonscape-software

First real foray into dataset construction so Id love some feedback.

u/Wooden_Leek_7258 — 18 hours ago

▲ 1 r/datasets

Sources for european energy / weather data?

Around 2018, towards the end of my PhD in math, I got hired by my university to work on a European project, Horizon 2020, which had the goal of predicting energy consumption and price.

I would like to publish under public domain some updated predictions using the models we built, the problem is that I can't reuse the original data to validate the models, because it was commercially sourced. My questions is: where can I find reliable historical data on weather, energy consumption and production in the European union?

reddit.com

u/servermeta_net — 8 hours ago

▲ 2 r/datasets

Indian language speech datasets available (explicit consent from contributors)

Hi all,

I’m part of a team collecting speech datasets in several Indian languages. All recordings are collected directly from contributors who provide explicit consent for their audio to be used and licensed.

The datasets can be offered with either exclusive or non-exclusive rights depending on the requirement.

If you’re working on speech recognition, text-to-speech, voice AI, or other audio-related ML projects and are looking for Indian language data, feel free to get in touch. Happy to share more information about availability and languages covered.

— Divyam Bhatia
Founder, DataCatalyst

reddit.com

u/Trick-Praline6688 — 15 hours ago

▲ 0 r/datasets

[Self-Promotion] Aggregating Prediction Market Data for Investor Insights

Implied Data helps investors make sense of prediction markets. We transform live market odds on stocks, earnings, and major events into structured dashboards that show what the crowd expects, what could change the view, and where the strongest signals are emerging.

reddit.com

u/BadBoyBrando — 20 hours ago