u/JefferMarcelino

▲ 3 r/nifi

Hi folks,

I’m designing a large-scale SFTP ingestion + distribution platform on Apache NiFi (3-node cluster) and I’m looking for guidance on how to structure the flow end-to-end.

Context
We operate a data exchange platform where hundreds of external systems communicate exclusively via SFTP.

Each system can:

  • produce files (sources)
  • consume files (destinations)

Each source may:

  • expose multiple directories
  • produce multiple file types
  • use different filename conventions (often embedding dates in inconsistent formats)
  • require different ingestion schedules per directory/file type

What I need to build
A scalable, modular NiFi flow that handles:

Ingestion

  • Poll hundreds of SFTP sources
  • Support multiple directories and patterns per source
  • Support different schedules per file type

Metadata enrichment
Each file must be enriched with:

  • extracted file date (multiple filename patterns, fallback to ingestion time)
  • source system identifier
  • file type classification

Backup & Archival
Immediately after download:

  • compress file (.gz)
  • store in partitioned path: /backup/{file_type}/year=YYYY/month=MM/day=DD/
  • Storage must be immutable
  • Must support replay from archive

Important constraint: post-download actions on the source (delete/move/etc.) must only happen AFTER the file is successfully stored in backup.

Post-download source actions (configurable per flow)

  • delete file
  • move file
  • rename file
  • or do nothing

Transformation layer

  • Optional transformations per destination:
    • PII masking
    • format conversion (e.g.: CSV → Parquet)
  • One input file may produce multiple outputs:
    • raw → destination A
    • transformed → destination B

Distribution

  • Deliver to multiple destination SFTP servers
  • Support parallel delivery
  • Different formats per destination

Operational requirements

  • Replay from archive (single file, batch, time range)
  • Full observability at file level (ingested, transformed, delivered, failures)
  • Clear failure isolation (source vs transform vs destination)
  • Retry + dead-letter handling

Current approach
Right now, I’ve started by creating end-to-end flows per file type, where each Process Group contains:

  • ingestion
  • enrichment
  • archive
  • transformations
  • delivery

So essentially: one Process Group = one file type pipeline

The problem
I’m already feeling this approach doesn’t scale:

  • number of Process Groups grows quickly (file types × sources × directories)
  • a lot of duplicated logic across flows
  • hard to keep everything consistent and maintainable
  • unclear how to make this truly config-driven

Also, I’m not sure where the right boundary is between:

  • reusable/shared components

vs

  • per-file-type logic

What I’m looking for
For those who’ve built something similar at scale:

  • How would you structure this architecture in NiFi?
  • Is “one PG per file type” a bad pattern long-term?
  • How do you avoid flow duplication while still supporting different transformations per destination?
  • How do you model {source + directory + schedule + file type} without exploding the number of processors?
  • Any proven patterns for keeping this config-driven and maintainable?
reddit.com
u/JefferMarcelino — 8 days ago