u/Acceptable-Ad-2904

Working with massive public datasets on the cloud — avoiding S3 egress hell

Hey folks,

I’ve been tinkering with ways to make large public datasets easier to work with in the cloud. My background’s in bioinformatics (Broad Institute / MIT), and one thing I keep running into is the time and cost of just moving data around — especially S3 → EC2 across regions, or when you want to do quick exploratory analysis without spinning up a full pipeline.

Curious how people here handle it:

  • Do you mostly download data locally first, or work directly in the cloud?
  • Any tricks for minimizing transfer costs or friction?
  • How do you handle ad-hoc, exploratory work without building a full pipeline every time?

I’ve been experimenting with ways to work in-place on cloud datasets, and would love to hear what’s actually working for others.

reddit.com
u/Acceptable-Ad-2904 — 11 hours ago

Exploring ways to reduce bioinformatics cloud costs + friction — would love input

Hi all — I used to work in bioinformatics at the Broad Institute and MIT, and recently started working on a project around improving access to large public datasets.

One thing I kept running into was how much time and cost goes into just getting the data locally (especially with S3/egress), before you can even start analyzing.

I’ve been experimenting with ways to access and work with these datasets in-place (without downloading), and would love to sanity check whether this is actually a pain point for others here.

Curious:

  • how are people currently handling large public datasets?
  • are you mostly downloading locally, or working directly in the cloud?
  • any workflows you’ve found that reduce friction/cost?

Happy to share more about what I’ve been building if useful — mainly just trying to learn from how others are approaching this.

reddit.com
u/Acceptable-Ad-2904 — 14 hours ago