u/crazy596

I'm getting too old for this ....

Maybe just a little rant, I've already begun the search and its a matter of finding the right job. I'm on of dem dere "Fancy AI doods" and I built a team over the last 5 years that started from "me" and we're at 5 data scientists, total team of 10. Boy for the first few years it was a dream. Boss leaves me alone, IT ignores me, so I have full reign on my servers and I am just killin it.

But I missed the clues in the meantime--new program manager came in--very aggressive and maneuvered her way in to replace my boss when they were promoted. Now we "co-lead" but officially I report to them. Strike one. Not huge, but ok, that's not the reward I was looking for.
Clue #2: My (old) boss is/was too busy to meet with me. Because I am doing all the AI/development, she runs around says "Build me X" I got do it, but I never get insight into what we need next for business. I never learn their job or what it takes to get to their level. Lots of flowing reviews. Big raises, but now I basically have the same skills I had 5 years ago and no mentor.

Clue #3: IT suddenly discovered we have money. We ordered a big fat GPU/Deep Learning Machine and IT "helped" us install it by hijacking it, putting it in a cluster they manage and they restrict access too which breaks most of our (money generating) workflows. I'd been doing this solo for 4 years before they came along but they still treat me like a chump including stating things like "Well this isn't just a desktop computer you order" as if the previous machines were purchased from office depot or best buy

Clue #4: This computer install has totally blown up. VPs are fighting blah blah blah. I try and work a solution. "I'll got test X,Y, and Z" Rather than accusing everyone, we'll have data that anyone can duplicate. Boss says "They'll just accuse you of sabotaging the tests"

Clue #5 My therapist offered to put me on short term disability due to stress and said GET A NEW JOB ALREADY YOU FOOL! You're underpaid, a freakin unicorn and these guys treat you like dirt!

Clue#6: Three big shots all had great presentations at a recent conference on "their" work. Its actually my work. Boss would come to me and say "We need an AI model that can measure X" and 4 weeks later a deployed software package with a fancy trained AI model shows up courtesy of me. I didn't even get a thank you in the presentation. I had a couple of very cool research ideas I did, pitched them, researched them, developed the models and boss took first author.

SOB I am an idiot.

Apologies for mild rule violation--I do sing to my cats during the day, so there is that...

reddit.com
u/crazy596 — 22 hours ago

Requirements for GPU hosting

I am sort of out of my depth and have lost faith in my IT team. I need some guidance info on what server rooms can handle.

Let me try and get everything out. I lead an AI team that is in a small division of a much larger company. When I first started IT wanted nothing to do with my Linux GPU "Servers" as they didn't know Linux and also insisted it wasn't a "server" it was just a machine (not here to argue that, tryin to get people to understand what I have been working with--call it Macaroni for all i care). Our first device was an oversized tower with a GPU. Not much, but got us rolling.

We've grown and got a second GPU "machine" Modest, but now we have dual GPUs and a rack mounted device (2U or 4U form factor), so rather than sitting in someone's office, this is more of a dedicated rack "Server' and sits in a dedicated server room that I am not allowed access too. Not a huge deal, I have iDrac and all that so I don't need physical access. When "IT" was installing it, they couldn't get it to stay up for more than 24 hours. Recommend we send it back eat the money loss and buy their recommended Windows Machine. I worked with the vendor, found a config error its been rolling for 2 years straight with no hiccups for about 6 users.

Now we are going Big time. 8xH200 Machine. Company gets word and offers to "help" us order. Now our machine is suddenly placed in the "Company" server room because our server room can't handle it. And wouldn't you know in their server room, we have to use their access and their methods (SLURM/restricted access from THEIR drives, not ours--we are all locked down like fort knox)

So the gist of this long rambling story is: What is required for these Big GPU machines. Is that a valid reason that some "rooms" can't handle them? I used to work in big Semiconductor Fabs, so I have just enough EE experience to be dangerous, but I can't for the life of me figure out why we have to go to a different server room unless ours is already dangerously overloaded or at capacity, in which case, that's a whole OTHER problem. Can anyone provide some guidance as I'm pushing back pretty hard and this is basically a $300k machine commandeered by another division. Needless to say people ARE NOT happy and I need to work from the facts. **@#(%^@#)%* Office politics. The reason I've lost trust with the IT team is that can't even give me a straight answer on connections (up/down) let alone install requirements. Its a FUBAR x 11

My Machines:
1st 16GB RTX 5000 (from memory right now) Xeon Tower (950W)

2nd 2 x 24GB RTX A5000 (again from memory) Dual Xeon 2U Form Factor Dual Non-redundant Power Supply, 1600W @ 220V

New
8 x H200 PSU as Octo,Fully Redundant (4+4),Hot-Plug MHS PSU,3200W MM HLAC (ONLY FOR 230-240 Vac)

Am I being sold a line of good here or what is the problem. It looks like the connections for my current and new would be the same (both 220AC--or is there something new with the 230/240 my understanding is it was the same thing).

I stand by to be educated.

reddit.com
u/crazy596 — 7 days ago