u/JefferMarcelino — reddlx

Hi folks,

I’m designing a large-scale SFTP ingestion + distribution platform on Apache NiFi (3-node cluster) and I’m looking for guidance on how to structure the flow end-to-end.

Context
We operate a data exchange platform where hundreds of external systems communicate exclusively via SFTP.

Each system can:

produce files (sources)
consume files (destinations)

Each source may:

expose multiple directories
produce multiple file types
use different filename conventions (often embedding dates in inconsistent formats)
require different ingestion schedules per directory/file type

What I need to build
A scalable, modular NiFi flow that handles:

Ingestion

Poll hundreds of SFTP sources
Support multiple directories and patterns per source
Support different schedules per file type

Metadata enrichment
Each file must be enriched with:

extracted file date (multiple filename patterns, fallback to ingestion time)
source system identifier
file type classification

Backup & Archival
Immediately after download:

compress file (.gz)
store in partitioned path: /backup/{file_type}/year=YYYY/month=MM/day=DD/
Storage must be immutable
Must support replay from archive

Important constraint: post-download actions on the source (delete/move/etc.) must only happen AFTER the file is successfully stored in backup.

Post-download source actions (configurable per flow)

delete file
move file
rename file
or do nothing

Transformation layer

Optional transformations per destination:
- PII masking
- format conversion (e.g.: CSV → Parquet)
One input file may produce multiple outputs:
- raw → destination A
- transformed → destination B

Distribution

Deliver to multiple destination SFTP servers
Support parallel delivery
Different formats per destination

Operational requirements

Replay from archive (single file, batch, time range)
Full observability at file level (ingested, transformed, delivered, failures)
Clear failure isolation (source vs transform vs destination)
Retry + dead-letter handling

Current approach
Right now, I’ve started by creating end-to-end flows per file type, where each Process Group contains:

ingestion
enrichment
archive
transformations
delivery

So essentially: one Process Group = one file type pipeline

The problem
I’m already feeling this approach doesn’t scale:

number of Process Groups grows quickly (file types × sources × directories)
a lot of duplicated logic across flows
hard to keep everything consistent and maintainable
unclear how to make this truly config-driven

Also, I’m not sure where the right boundary is between:

reusable/shared components

per-file-type logic

What I’m looking for
For those who’ve built something similar at scale:

How would you structure this architecture in NiFi?
Is “one PG per file type” a bad pattern long-term?
How do you avoid flow duplication while still supporting different transformations per destination?
How do you model {source + directory + schedule + file type} without exploding the number of processors?
Any proven patterns for keeping this config-driven and maintainable?