u/8ta4 — reddlx

▲ 3 r/LanguageTechnology

I'm thinking about building a tool to discover backronyms for initialisms, like "Married But Available" for MBA. Since the potential search space for these word combinations follows V^n, where V is the vocabulary size, finding funny sequences is a challenge.

I've mapped out a workflow:

Seeding. Extract over 10,000 English initialisms from Wiktionary.
Filtering. Use a recognizability dataset to reduce the list to a subset that most people would know.
Mining. Match these seeds against the Google Ngram dataset for 2- to 5-gram sequences.
Ranking. Categorize the resulting phrases by their initialism and sort them by frequency, capping the count per bucket to keep the volume manageable.
Judging. Use a large language model as a judge to scan the lists for funny expansions.

My biggest concern with this approach is the frequency distribution. "Married But Available" does appear in the Google Ngram dataset. But it's roughly a million times rarer than a sequence like "May Be A". If the funny candidates are buried too deep in the tail, they might be dropped before the model sees them.

Does any systematic solution or dataset for this problem already exist? Any other feedback is welcome.

reddit.com

u/8ta4 — 9 days ago

▲ 1 r/neovim

I'm looking for a way to filter buffers in-place while keeping the data editable. It's like an Excel spreadsheet: you filter the rows to hide the noise and continue to work on the data within the sheet itself.

Telescope doesn't work for this. I want to navigate and edit within the filtered results in the main window.

Here are the features I'm looking for:

Regex search: The input must support regular expressions.
Conceal-based masking: Non-matching lines are hidden using extmarks and the conceal attribute.
Async interruption logic: Typing triggers a background regex scan. The logic is async so the editor doesn't freeze while processing 1M+ lines. Too big? That's what Sheet said. If I type a character before the previous scan finishes, the old job must be killed.
1-row input bar: The UI is a 1-row floating bar that pops up when a filter is active. It stays open while I swap focus between the bar and the buffer. If I clear the filter, the bar goes away.

Does anyone know of an existing plugin that handles this type of masking?

reddit.com

u/8ta4 — 11 days ago

▲ 1 r/datasets

I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations:

It hasn't been updated since 2019.
It misses modern terms like TikTok.
It doesn't cover phrases.

I've scored about a million English entries from Wiktionary for recognizability. I built this for a pun tool. But I want to use the data for a new language project.

The dataset is too bloated because it's full of inflected forms. Even if I set the recognizability threshold at 50 percent, I'm still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through the English lemmas category from Wiktionary and split the single words from the multi-word phrases into separate lists.

Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.

Before I spin up a separate repository to handle this, I'm checking if a similar dataset already exists. Has anyone seen a project that offers this?

reddit.com

u/8ta4 — 11 days ago

▲ 1 r/LLMDevs

Cerebras hosts gpt-oss-120b at ~3000 tokens/s. But things can change once the buffer hits the load. Is there another production-ready model and provider combination that beats this setup for end-to-end response time while maintaining a similar level of reasoning?

I'm building an in-place, sentence-by-sentence rephraser and need the full response back in the buffer in under one second.

Any other feedback on the design is also welcome.

u/8ta4 — 12 days ago

▲ 28 r/neovim+4 crossposts

I built a Neovim plugin that provides text motions designed for prose. My goal was to bypass blank lines and handle punctuation edge cases. Because it's Neovim, the default Lisp path is Fennel. I started there, but rewrote it as a ClojureScript Node.js remote plugin using shadow-cljs.

To be fair, Fennel has upsides:

There's almost zero ceremony between Fennel and Lua.
You avoid the async overhead of remote plugins.

But the friction started with the standard library. I was using nfnl, which comes with its own Clojure-inspired functions. The problem is they are not comprehensive enough. I found myself manually implementing things like difference just to process text bounds. Since Fennel's fn doesn't support multi-arity functions the way Clojure does, I decided to write some macros to implement it myself. Naturally, this devolved into yak shaving. The dealbreaker hit when I ran into a bug in Conjure. When the REPL failed on the macros I was trying to build just to make the language usable, that was the straw that broke the yak's back.

So I switched to a ClojureScript remote plugin. But I traded the macro REPL issues of Fennel for a different kind of REPL headache in ClojureScript.

Specifically, I'm having a bizarre issue where println fails inside my async code. It feels nondeterministic. Sometimes the output prints perfectly fine. But other times, println disappears into the void. To see the value, I resort to a hack: I create a fake atom and reset! the value into it. That works. But if I try to add a watch to that atom to print the updated value, that doesn't print either!

Does anyone have any idea why println is getting swallowed in this async Neovim context?

If anyone has any other feedback, I'd be happy to hear it.

u/8ta4 — 14 days ago