u/Full_Employment_4289

Working on a system that collects and normalizes flight pricing data at scale, and running into real-world issues with data sources.

The goal is to gather prices across routes and future dates (~12 months) to build pricing trends and estimates (not a booking engine).

Current architecture:

- FastAPI backend

- Scheduled collection jobs (batch-based)

- Data stored and reused for trend analysis

- Supports one-way, round-trip, and multi-city queries

Issues encountered:

  1. Data inconsistency

    Prices vary significantly across sources and even across repeated queries (same route/date returning different values).

  2. API limitations

- Some APIs (e.g. metasearch) require strict session tracking (user IDs, headers, IP forwarding)

- Production access is gated and unclear in terms of scalability

  1. Scraping challenges

- Works initially, but:

- frequent breakage

- anti-bot protection

- cost increases with JS rendering

- Not confident in long-term stability

Constraints:

- High volume (10k–50k+ queries/month)

- Future date coverage

- Reasonable accuracy (not exact booking prices, but close)

- Budget-sensitive (GDS solutions likely too expensive)

Main questions:

- What architecture works best for this type of system?

- Is scraping + caching a viable long-term approach?

- Do people typically combine multiple providers instead of relying on one?

- How do you deal with constantly changing pricing in downstream systems?

- Is it better to treat this as a data pipeline problem rather than a live query system?

Would appreciate insights from anyone who has worked on large-scale data collection systems or travel-related pricing infrastructure.

reddit.com
u/Full_Employment_4289 — 15 days ago