u/Chemical-Wall9026

[NLP/ML] Classifying short meeting subjects into 90+ task categories — accuracy stuck at 48%, looking for advice

Hey everyone, I'm working on an internal productivity tool that automatically tags calendar meetings with the correct project and task category. The app pulls meeting data from the calendar API and I want the ML model to predict: which client, which project, and which task purely from the meeting metadata.

The data looks roughly like this:

| Meeting Subject | Day | Duration | Organiser Role | Task Label |

|---|---|---|---|---|

| Team daily sync | Monday | 0.25h | QA Lead | QA Standup |

| Weekly checkpoint | Wednesday | 1h | Infra Lead | Infra Weekly Call |

| Tech review session | Thursday | 1.5h | QA Lead | QA Internal Meeting |

| Daily standup | Monday | 0.25h | Client PM | Client Standup |

| Automation framework setup | Friday | 2h | QA Engineer | Mobile Automation |

~1,500 records total.

The problem:

I have ~90 unique task labels in the raw data. Most of them have only 1–5 examples. The straightforward approach is to drop rare classes (< 15 samples) but that means losing real data. I want to instead *group similar tasks* and retain everything.

But grouping is tricky:

- "Client daily standup" and "Internal daily standup" sound identical in the subject line but are completely different tasks (different billing, different project)

- "AI assistant testing" and "AI POC work" sound similar and probably should be grouped

- Some tasks are person-specific (e.g. "Remediation task - engineer A" vs "Remediation task - engineer B") — same type of work, different person assigned

What I've tried:

- Logistic Regression + TF-IDF: ~44% on tasks

- SVM: ~44%

- DistilBERT fine-tuned on subject only: ~46%

- DistilBERT on subject + body_preview + organiser: ~48%

The training loss converges fine but validation loss plateaus early, suggesting the signal just isn't strong enough in the text alone.

My questions:

  1. Is there a smarter way to group ~90 classes into meaningful buckets beyond manual rules? I tried clustering sentence embeddings but struggling to validate whether the clusters actually make business sense.

  2. Should I be doing hierarchical classification? (predict client first → use that as a feature → predict task). Feels like the right architecture but haven't implemented it yet.

  3. Is 1,500 records just fundamentally too small for this many classes even after grouping?

  4. Any features I might be missing? I currently have: subject, body preview, organiser name, duration, day of week, attendee count.

Any advice appreciated — especially from people who've tackled short-text multi-class classification with heavily imbalanced labels.

reddit.com
u/Chemical-Wall9026 — 2 days ago

Hi everyone,

I’m working on an NLP problem and would really appreciate some guidance on what to do next.

Objective:
I’m building a model that takes a meeting subject (e.g., “weekly sync”, “client call”, “testing discussion”) and predicts:

  • Project
  • Client
  • Task

Important point:
Not every meeting subject clearly contains all three.
Sometimes it may indicate only one or two, or be vague like “discussion” or “sync”.

Dataset:
The data comes from real meeting logs. Most fields are either missing or not useful, so I’m mainly relying on:

  • meeting_subject (primary input)

Challenges:

  • Short and ambiguous text
  • Many similar subjects across different projects/tasks
  • Task labels are very granular (~95 unique tasks)
  • Class imbalance (some tasks appear very rarely)

Models I tried:

  1. Logistic Regression (TF-IDF on subject)
  • Project accuracy: 66%
  • Client accuracy: 78%
  • Task accuracy: 37%
  1. SVM
  • Project accuracy: 0.67
  • Client accuracy: 0.80
  • Task accuracy: 0.44
  1. DistilBERT (separate models for each target):
  • Project accuracy: 79.50%
  • Client accuracy: 93.50%
  • Task accuracy: 0.46

Experiments:

  • Using only meeting subject → best performance
  • Adding other fields → reduced accuracy due to noise

Current system:

I’ve built a pipeline where:
meeting_subject → predicts Project + Client + Task using separate models

Problem:

  • Project and Client predictions are strong
  • Task prediction is weak

Likely reasons:

  • Too many task classes (~95)
  • Tasks are too specific and overlapping
  • Limited signal in short subject text

What I need help with:

  1. How should I improve task prediction?
    • Should I group tasks into broader categories?
    • Or use hierarchical prediction (project → task)?
  2. Should I keep 3 separate models or try a single multi-output model?
  3. Is DistilBERT enough, or should I try something like RoBERTa?
  4. Any best practices for handling short-text + high-class-count classification?

Goal:

I want to build a practical and usable system, not just optimize metrics.

Would really appreciate suggestions.

Thanks!

reddit.com
u/Chemical-Wall9026 — 14 days ago