u/justme_cliff

I spent 2 weeks crawling and cleaning SEC EDGAR filings so you don’t have to free dataset, 5k+ chunks, LLM ready

I'm a solo dev and I got tired of finding financial datasets on HuggingFace that were either metadata only, tiny, outdated, or just not actually usable for training.

So I built my own pipeline. 10 stages: crawl, normalize, dedup (exact and near), quality filter, paragraph aware chunker. The result is 5,179 clean chunks from 261 SEC filings all in JSONL with consistent schema including ticker, CIK, filing type, date, exchange, SIC code and the full text.

It's free. No email wall, no gating.

Also do custom datasets on demand. If you're training on something specific and need clean data for a domain just DM me and we can figure it out.

Built this as the first dataset under Zorynthiq, a training data company I'm bootstrapping. Finance is just the first vertical.

Happy to answer questions about the pipeline or the schema.

reddit.com
u/justme_cliff — 17 hours ago