u/Acceptable-Ad-2904

Hey folks,

I’ve been tinkering with ways to make large public datasets easier to work with in the cloud. My background’s in bioinformatics (Broad Institute / MIT), and one thing I keep running into is the time and cost of just moving data around — especially S3 → EC2 across regions, or when you want to do quick exploratory analysis without spinning up a full pipeline.

Curious how people here handle it:

Do you mostly download data locally first, or work directly in the cloud?
Any tricks for minimizing transfer costs or friction?
How do you handle ad-hoc, exploratory work without building a full pipeline every time?

I’ve been experimenting with ways to work in-place on cloud datasets, and would love to hear what’s actually working for others.

Hi all — I used to work in bioinformatics at the Broad Institute and MIT, and recently started working on a project around improving access to large public datasets.

One thing I kept running into was how much time and cost goes into just getting the data locally (especially with S3/egress), before you can even start analyzing.

I’ve been experimenting with ways to access and work with these datasets in-place (without downloading), and would love to sanity check whether this is actually a pain point for others here.

Curious:

how are people currently handling large public datasets?
are you mostly downloading locally, or working directly in the cloud?
any workflows you’ve found that reduce friction/cost?

Happy to share more about what I’ve been building if useful — mainly just trying to learn from how others are approaching this.

Working with massive public datasets on the cloud — avoiding S3 egress hell

Exploring ways to reduce bioinformatics cloud costs + friction — would love input