Working on a system that collects and normalizes flight pricing data at scale, and running into real-world issues with data sources.
The goal is to gather prices across routes and future dates (~12 months) to build pricing trends and estimates (not a booking engine).
Current architecture:
- FastAPI backend
- Scheduled collection jobs (batch-based)
- Data stored and reused for trend analysis
- Supports one-way, round-trip, and multi-city queries
Issues encountered:
Data inconsistency
Prices vary significantly across sources and even across repeated queries (same route/date returning different values).
API limitations
- Some APIs (e.g. metasearch) require strict session tracking (user IDs, headers, IP forwarding)
- Production access is gated and unclear in terms of scalability
- Scraping challenges
- Works initially, but:
- frequent breakage
- anti-bot protection
- cost increases with JS rendering
- Not confident in long-term stability
Constraints:
- High volume (10k–50k+ queries/month)
- Future date coverage
- Reasonable accuracy (not exact booking prices, but close)
- Budget-sensitive (GDS solutions likely too expensive)
Main questions:
- What architecture works best for this type of system?
- Is scraping + caching a viable long-term approach?
- Do people typically combine multiple providers instead of relying on one?
- How do you deal with constantly changing pricing in downstream systems?
- Is it better to treat this as a data pipeline problem rather than a live query system?
Would appreciate insights from anyone who has worked on large-scale data collection systems or travel-related pricing infrastructure.