10 years doing FPGAs and a grey code CDC gotcha got me today
So I've been working on this design for the past six or seven years, and almost 10 years total in FPGAs, and somehow I still ran into something today that I'd never hit before. Figured it's kind of funny in hindsight and worth posting as a case study.
Our design has a lot of clock domains and we're constantly passing signals across them, single bit, multi bit, counters, the whole mix. We try to do things the right way: flip flops where appropriate, double flip flops for single bit synchronizers, handshakes for multi bit, and grey code for counters. Standard stuff.
Today something was glitching out and it was screaming "CDC problem" at me. We had a counter, we were converting it to grey code, and passing it across the domain. On paper, all good.
What I was missing: this counter wasn't running freely up to saturation and rolling over. It was getting reset back to zero somewhere in the middle of its range.
And that's the gotcha. Grey code only works for CDC because consecutive values differ by exactly one bit, so if the receiving domain samples during a transition it either latches the old value or the new value, never garbage. That assumption only holds when the counter increments by one. If you're sitting at 5 (0111 in grey) and you reset to 0 (0000), you've got three bits changing at the "same" time, and the receiver can sample any of the intermediate states.
Grey code on a counter is only safe if the counter is free running or only ever increments by one. The moment something can yank it back to zero, or jump it by more than one step, you've broken the invariant that makes grey code work, and you're right back to the multi-bit CDC problem you thought you'd solved.
The fix in my case was just adding a handshake on the CDC after the grey coding, which works. Though honestly at that point the grey code isn't really doing anything for you anymore, the handshake is what's making it safe. But I left it in because it's not hurting.
Moral of the story is more the gotcha than the fix. Anyone else been burned by this one?