r/ProxyEngineering

▲ 14 r/ProxyEngineering+2 crossposts

Walmart scrapers in production

Heyo, story time: Spent the last year running Walmart scrapers in production. Headless browsers (Playwright specifically) are almost always recommended over plain "requests" + BeautifulSoup for JS-heavy sites like Walmart, and that's true, but "use a headless browser" isn't the whole story. Here's what I learned that actually works in practice You may ask why depend on headless at all? Walmart's product pages are JavaScript-rendered. A raw HTTP request returns an HTML shell, prices, titles, and availability are injected by JS after load. BeautifulSoup never sees that data. Now for the headless browser part, it runs a Chromium engine, executes the JS, and lets you query the fully-rendered DOM. That part works well. Even with a headless browser, you'll hit blocks, it's not a holy grail as some of people have praised it over the reddit. Walmart fingerprints more than just your IP, browser canvas signatures, WebGL data, timing patterns, and TLS handshake characteristics are all signals. Vanilla Playwright out of the box is detectable. You need "playwright-stealth" or equivalent patches to mask the most obvious headless tells.

Walmart A/B tests constantly. The "<h1>" for the product title and "<span itemprop="price">`" for pricing, the selectors everyone uses, can and do shift. A scraper that worked Monday can silently return empty strings by Wednesday. You need selector fallbacks and output validation, not just "element.inner_text()". As for the resources, well, each Chromium instance eats ~150–300MB of RAM. If you're running concurrent scrapers, this adds up fast. For small datasets it's fine at scale, you either need careful concurrency limits or a distributed setup. Rotating proxies help with IP bans but don't solve fingerprinting. Worse, misconfigured proxies inside a browser context can cause silent failures, the request goes through but returns a CAPTCHA page that your parser doesn't catch. Always validate that your response actually contains product data before storing it.

Honest suggestions, people:

- ALWAYS USE "playwright-stealth" to patch headless fingerprints

- Add "wait_for_selector()" with a timeout before extracting, don't assume the element is there

- Build in retry logic with exponential backoff on failures

- VALIDATE YOUR OUTPUT: if price is empty string, treat it as a failed scrape and retry

- Rotate User-Agents per session, not per request

- Use residential proxies, not datacenter, Walmart's filters are tuned to spot datacenter ranges, (however, I was running datacenter at first with the help of residential proxies, ditched datacenter after some time).

Headless browsers are the right tool for Walmart, but they're not a reliability silver bullet as some of you praise it. For me particularly, ~85–90% success rate with a well-tuned setup was what I got at most, dropping toward 60–70% if you skip stealth patches and output validation. The remaining failures are mostly CAPTCHAs and transient blocks that retries will catch. For anything production-scale, budget time for maintenance. Walmart's defenses update, and your selectors will break. That's just the reality of scraping a site this sophisticated.

reddit.com
u/kamililbird — 2 days ago
🔥 Hot ▲ 55 r/ProxyEngineering+1 crossposts

Headless browsers are destroying the open web and I'm tired of pretending they're not

u/dozerjones — 8 days ago

Amidst talks of mobile proxies being obsolete

Whats up y'all. I will probably be at the end of the board here, or against everyone's opinion but here goes. Why mobile proxies might be worth the extra cost:

I've been using mobile proxies for about 3 or so years now and wanted to share some thoughts since I noticed a trend as of lately whether "mobile proxies are not worth, residentials will replace them etc". The short answer would be, it depends on the situation. Sometimes yes, sometimes absolutely not. Mobile IPs are different when you're dealing with platforms that are paranoid about bots. Instagram, TikTok, Snapchat, WhatsApp, whatever, you name it. These apps were built for phones, so when you come at them with a residential IP from a desktop user agent, you're already looking suspicious.

I was burning through residential proxies trying to manage multiple social accounts until I switched to mobile. The difference was night and day. Bans dropped significantly because the traffic pattern actually makes sense - mobile IP + mobile device fingerprint = platform is happy. Mind you I was not so adamant in the knowledge of fingerprinting, this came up way later, although I should have researched it first thing before jumping on the proxies. Also, mobile IPs get shared among tons of real users on the same carrier, so even if one gets flagged, it's usually temporary. Carriers use CGNAT, meaning hundreds of people might share the same IP. Platforms can't just blacklist an entire carrier's IP range.

Why I think that scraping with mobile proxies is a waste of money: If you're just scraping public data from sites that don't have aggressive bot detection, you're wasting money. A decent residential proxy or datacenter proxies will do fine for most web scraping tasks. (I converted to this a while back, combining both resis and dc to scrape the websites, did not let me down.

My current setup: I use mobile proxies for managing social media related stuff and fall back to residential for everything else. Saves money this way. As for scraping, residential proxies and datacenter proxies combined. Yes, I am aware that there are plenty of already one-place solutions such as web scraper api, web unblocker, headless browser or whatever, my setup works for me and I'm running it until it no longer works or becomes obsolete (as some of you already pitching this for mobile proxies lol)

reddit.com
u/itsamaan26 — 9 days ago