Hi folks,
I’m designing a large-scale SFTP ingestion + distribution platform on Apache NiFi (3-node cluster) and I’m looking for guidance on how to structure the flow end-to-end.
Context
We operate a data exchange platform where hundreds of external systems communicate exclusively via SFTP.
Each system can:
- produce files (sources)
- consume files (destinations)
Each source may:
- expose multiple directories
- produce multiple file types
- use different filename conventions (often embedding dates in inconsistent formats)
- require different ingestion schedules per directory/file type
What I need to build
A scalable, modular NiFi flow that handles:
Ingestion
- Poll hundreds of SFTP sources
- Support multiple directories and patterns per source
- Support different schedules per file type
Metadata enrichment
Each file must be enriched with:
- extracted file date (multiple filename patterns, fallback to ingestion time)
- source system identifier
- file type classification
Backup & Archival
Immediately after download:
- compress file (.gz)
- store in partitioned path:
/backup/{file_type}/year=YYYY/month=MM/day=DD/ - Storage must be immutable
- Must support replay from archive
Important constraint: post-download actions on the source (delete/move/etc.) must only happen AFTER the file is successfully stored in backup.
Post-download source actions (configurable per flow)
- delete file
- move file
- rename file
- or do nothing
Transformation layer
- Optional transformations per destination:
- PII masking
- format conversion (e.g.: CSV → Parquet)
- One input file may produce multiple outputs:
- raw → destination A
- transformed → destination B
Distribution
- Deliver to multiple destination SFTP servers
- Support parallel delivery
- Different formats per destination
Operational requirements
- Replay from archive (single file, batch, time range)
- Full observability at file level (ingested, transformed, delivered, failures)
- Clear failure isolation (source vs transform vs destination)
- Retry + dead-letter handling
Current approach
Right now, I’ve started by creating end-to-end flows per file type, where each Process Group contains:
- ingestion
- enrichment
- archive
- transformations
- delivery
So essentially: one Process Group = one file type pipeline
The problem
I’m already feeling this approach doesn’t scale:
- number of Process Groups grows quickly (file types × sources × directories)
- a lot of duplicated logic across flows
- hard to keep everything consistent and maintainable
- unclear how to make this truly config-driven
Also, I’m not sure where the right boundary is between:
- reusable/shared components
vs
- per-file-type logic
What I’m looking for
For those who’ve built something similar at scale:
- How would you structure this architecture in NiFi?
- Is “one PG per file type” a bad pattern long-term?
- How do you avoid flow duplication while still supporting different transformations per destination?
- How do you model
{source + directory + schedule + file type}without exploding the number of processors? - Any proven patterns for keeping this config-driven and maintainable?