I'm thinking about building a tool to discover backronyms for initialisms, like "Married But Available" for MBA. Since the potential search space for these word combinations follows V^n, where V is the vocabulary size, finding funny sequences is a challenge.
I've mapped out a workflow:
Seeding. Extract over 10,000 English initialisms from Wiktionary.
Filtering. Use a recognizability dataset to reduce the list to a subset that most people would know.
Mining. Match these seeds against the Google Ngram dataset for 2- to 5-gram sequences.
Ranking. Categorize the resulting phrases by their initialism and sort them by frequency, capping the count per bucket to keep the volume manageable.
Judging. Use a large language model as a judge to scan the lists for funny expansions.
My biggest concern with this approach is the frequency distribution. "Married But Available" does appear in the Google Ngram dataset. But it's roughly a million times rarer than a sequence like "May Be A". If the funny candidates are buried too deep in the tail, they might be dropped before the model sees them.
Does any systematic solution or dataset for this problem already exist? Any other feedback is welcome.