Working with massive public datasets on the cloud — avoiding S3 egress hell
Hey folks,
I’ve been tinkering with ways to make large public datasets easier to work with in the cloud. My background’s in bioinformatics (Broad Institute / MIT), and one thing I keep running into is the time and cost of just moving data around — especially S3 → EC2 across regions, or when you want to do quick exploratory analysis without spinning up a full pipeline.
Curious how people here handle it:
- Do you mostly download data locally first, or work directly in the cloud?
- Any tricks for minimizing transfer costs or friction?
- How do you handle ad-hoc, exploratory work without building a full pipeline every time?
I’ve been experimenting with ways to work in-place on cloud datasets, and would love to hear what’s actually working for others.