r/dataengineering

Dagster vs airflow 3. Which to pick?

hey guys, I manage tech for a startup. and I have not used an orchestrator before. Just cron mostly. As we are scaling, I wanted to make things more reliable. Which orchestrator should I pick? It will be batch jobs which might run at different intervals do some etl refresh data etc. Since it ran in cron, the dependency logic itself was all handled in the code itself before.

Also both eat equal amount of resources right? I hear airflow being ram heavy but not sure if it's entirely true. let me know what you guys think. Thanks.

reddit.com
u/Consistent_Tutor_597 — 3 hours ago
🔥 Hot ▲ 184 r/dataengineering

How I landed a $392k offer at FAANG after getting laid off from LinkedIn

I wrote a post here a couple years ago about landing a $287k offer at FAANG+. A lot has happened since then, and I wanted to share my wins (and losses) for going through it right now.

I got laid off from LinkedIn. No warning, no performance issue. Just a mass shitcanning. I had relocated across the country for that job. So that was fun.

I gave myself a week to feel sorry for myself (and move BACK across the country), then got back to grinding. I applied broadly and tried to be strategic about it. Over the course of about two months, I did somewhere around 20 interviews. Some went well. Some went laughably poorly.

Netflix rejected me after the first half of the onsite. That hurt. I had spent a lot of time preparing specifically for their spark round, and I was dead in the first 5 minutes. Something about executor retry behavior.

I made it deep into loops at FAANG, OpenAI, and Airbnb. All three came back with offers:

- FAANG: E5, 392k ($230k base + $150k stock/yr + 12.5k signing (50k amortized)

- OpenAI: 290k - the leveling and equity structure made it less competitive than it looked on paper

- Airbnb: 320k - competitive offer, great team, but the TC gap was significant (layoff hurt)

I almost got downleveled at FAANG. The initial signal from my system design round came back mixed, and my recruiter told me hiring committee was debating E4 vs E5. I asked my recruiter if I could strengthen the E5 case, and ended up in a f/u data modeling round. 4 days later they came back at E5.

If I had to distill the biggest difference between interviewing at this level vs. where I was a few years ago: behavioral/architecture matters so much more. At E5, they pushed hard on ambiguity, tradeoffs, and how I influenced decisions when I didn't have authority. I leaned heavily into real examples from LI where I had to untangle bad architecture with unhelpful information.

Getting laid off was humbling. Moving across the country for a job and then losing it was humbling. Getting rejected by Netflix was depressing. Almost getting downleveled was scary. But I kept blanketing resumes, grinding questions, diving deeper than anyone should ever have to into Spark executors, and it all worked out in the end.

Now I'm strapped in and ready for the next round of layoffs (it never ends)

reddit.com
u/Flat_Shower — 20 hours ago

Is Apache Spark skills absolutely essential to crack a data engineering role?

I have experience working with technologies such as Apache Airflow, BigQuery, SQL, and Python, which I believe are more aligned with data pipeline development rather than core data engineering. I am currently preparing to transition into a core data engineering role. As a Lead Software Developer, I would appreciate your guidance on the key topics and areas I should focus on to successfully crack interviews for such positions.

reddit.com
u/Far-Journalist-821 — 4 hours ago
🔥 Hot ▲ 70 r/dataengineering

how to remove duplicates from a very large txt file (+200GB)

Hi everyone,

I want to know what is the best tool or app to remove duplicates from a huge data file (+200GB) in the fastest way and without hanging the laptop (not using much memory)

reddit.com
u/Head_Capital_7772 — 23 hours ago

Why is everything in Java & Scala?

I have been wondering why most tools & services for DE are in java & Scala why not c/c++, go, or rust? I hate java but I will have to learn it now as its in my curriculum just trying to find some motivation lol

reddit.com
u/gorovaa — 9 hours ago

Thinking about starting a YT channel

I've been working in Data field for four years now, touching a little bit of everything in the Microsoft stack (Azure, Fabric, Power BI, Databricks...). I'm thinking about starting a YT channel talking about a little bit of everything in the field: tutorials, news...

It would help me to learn new things and strengthen my current knowledge. Also, it would force me to speak English, which is not my first language. However, I'm not sure what kind of content can be more interesting to the audience.

What are your thoughts? What kind of videos you would find useful / interesting? Do you think it would have an audience or It would be something just my friends would watch?

reddit.com
u/Affectionate-Tie1005 — 2 hours ago

I analyzed 1,800+ DE technical screens from companies like Netflix and Uber. Here are the 6 patterns that actually matter in 2026.

I’ve spent the last few months aggregating technical screening data from about 97 different companies. After looking at 1,800+ specific technical "evaluations," I noticed that despite the hype around new tools, the "failure points" for most candidates fall into 6 specific buckets.

I’ve compiled these into a deep-dive guide, but I wanted to share the high-level technical trends I'm seeing:

Spark Shuffles are the new LeetCode: Almost every top-tier firm is moving away from generic DSA and toward deep-dive internals. If you can't explain the cost of a wide transformation, it's an immediate red flag.

The "Hybrid" SQL/Python script: Many companies are now testing if you can move logic between engines efficiently rather than just writing a query.

System Design is getting "Cloud Native": It’s no longer just "design a URL shortener." It’s "how do you handle late-arriving data in a multi-region S3 setup?"

I put the full breakdown of these 74 core technical concepts (with code and theory) into a book on Kindle here: amazon.com/dp/B0G6ZCP986

I’m around all day - if anyone wants to know what specific technical patterns I saw for certain sectors (FinTech vs. Big Tech), ask away.

reddit.com
u/codeshukla — 5 hours ago

How do you safely share production data with dev/QA teams?

I’ve been running into this problem where I need to share production CSV data with dev/QA teams, but obviously can’t expose PII.

So far I’ve tried:

  • manually masking columns
  • writing small scripts

But it’s still a bit tedious and error-prone, especially when relationships between fields need to be preserved.

Curious how others are handling this in real workflows?

Are you using internal tools, scripts, or something else?

reddit.com
u/Lower-Candle3471 — 17 hours ago

Salary - Data Engineering Manager in Paris

I’m looking for a relocation to France (Paris area) and I’m applying for Data Engineering Manager positions. I’ve had a couple of interviews already, but I’m wondering about the salary range.

So I’m asking around €85.000,00 to €90.000,00 gross. A few questions if you guys could help me out, please:

- Looking online this seems to be an accurate average, but I’m wondering if it’s too far off. Should I be asking more or less?

- I’d be going with my spouse which would not be working for a while (possibly a few years). Would that salary be good for a couple living comfortably in the suburbs of Paris?

Thank you so much!

reddit.com
u/Neat_Armadillo_70 — 22 hours ago

Keep fact tables at grain or pre-aggregate before the BI layer?

Say when you create your star schema, do you typically aggregate the data beforehand, or do you keep the fact table at the defined grain and let the BI tool handle aggregation? Seems like the general consensus is at the BI level but with tools like dbt is it more common prior to being upstreamed to the BI tool?

reddit.com
u/Sheeesh7102 — 21 hours ago

Data engineering and AI in orgs - how did you start?

Hi all

So I am a data engineer in a Fortune 50 company. Our company and org has had a pretty big push into the AI landscape, and our team is trying to come up with solutions that would be meaningful and provide actual business value.

Currently, like with many of the other companies our leadership is simply saying ‘Use AI, create something’ etc etc, without any direction on what to do.

I would like to understand with the fellow data engineers here - how did you and/or your team came up with an AI solution?

Was it a top-down request or did the engineers find a friction point in the data?

How did you narrow down the pain point which you figured could use AI implementation?

Feels like lot of things are possible, but scaling it and bringing actual business value is always challenging.

Please share your thoughts!

reddit.com
u/PizzaLad16 — 8 hours ago

Better models for Audio than Whisper?

I have been handed a data pipeline side-quest: I need to create a reliable pipeline that transcribes short (<10min) audio .m4a files.
I work with structured data, and audio processing with async queue-based processing is new to me.
The team who sandboxed this worked on Whisper, but it's pretty resource hungry and I am looking for something of similar quality, hopefully faster, that we can host ourselves.
The pipeline is not time sensitive: it runs daily and is used for summarization of customer issues. ~100 to 200 audio files a day.
AI is suggesting exploring:

  • faster-whisper
  • whisper.cpp
  • WhisperX
  • Insanely Fast Whisper

Any advice on which model might be best would be welcome. No budget for external APIs sadly. We run on AWS EKS. I looked at Amazon Transcribe but at first glance, it does not support .m4a

reddit.com
u/grahamdietz — 15 hours ago
▲ 5 r/dataengineering+2 crossposts

S3 Table buckets daily and monthly backups for compliance reasons

hello everyone,

Are there any alternative to backup Table buckets (waterproof backup) and not just replication of S3.

we require daily and monthly backups (like a cron job backup at specific time) but AWS backup doesn't support it.

reddit.com
u/Kindly-Store4318 — 2 days ago

Best free visual data modeling tool

Hey guys. What is the best free tool for visual data modeling? I know I can use power bi, but I don’t use it very often, so I dont want to open it just for this and do the rest of my job with other tools. Is there any other good method which is free? preferably not one that is free, yet with very limited features. Thanks

reddit.com
u/random-soul-feeder — 14 hours ago
Looking for people in NY to build with

Looking for people in NY to build with

Hi Everyone!

I am hoping to connect and build with people passionate about agent engineering.

I have built an open source text to sql framework, and plan to build an agent observability platform on top of it.

If you are technical, interested in agent observability/engineering, and based in NY, please DM me!

u/Ok-Freedom3695 — 15 hours ago
Week