u/Dry-Opportunity-1987 — reddlx

▲ 1 r/LearnDataAnalytics

Hii. I have posts I got from a query search on reddit. Thos posts may representa brand or may represent a name of a person, a film, or another unrelated content. Tries KB, and supervised learning, but I still can get all the meanings my dataset have. My man objetcive is to know what people are talking about one of the meanings, in this case, the brand. Should I

(1) do a cluster/topic modelling to understand the meanings, select the one I want, and do another topic modelling/cluster?

(2) do a BERTopic, and select only the ones that have the meaning I want.

(3) Do like a company list universe, that have the brand products, important keywords, and negative meanings, according to hte KB, and assume the limitation I don't have all the contexts. Do a biencoder for similarity and maybe active learning or cross encoder, for the ones that the model does have a doubt?

Thank you for ur help.

reddit.com

u/Dry-Opportunity-1987 — 12 days ago

▲ 1 r/learnmachinelearning

(1) do a cluster/topic modelling to understand the meanings, select the one I want, and do another topic modelling/cluster?

(2) do a BERTopic, and select only the ones that have the meaning I want.

Thank you for ur help.

reddit.com

u/Dry-Opportunity-1987 — 12 days ago

▲ 4 r/MLQuestions+1 crossposts

(1) do a cluster/topic modelling to understand the meanings, select the one I want, and do another topic modelling/cluster?

(2) do a BERTopic, and select only the ones that have the meaning I want.

Thank you for ur help.

reddit.com

u/Dry-Opportunity-1987 — 12 days ago

▲ 1 r/askdatascience

Hiii

I am currently working on the Prada entity filtering task for my school thesis and have encountered a few challenges.

Since all data was collected using the query “Prada”, the problem is not entity detection but context disambiguation. I explored knowledge bases (e.g., Wikidata, YAGO), but they do not cover many social media contexts of the word, like name of person, or prada me (instead of pardon me). A fully supervised approach is also difficult due to the lack of labeled data, and standard NER models fail to capture key expressions such as The Devil Wears Prada on Reddit extracted submissions, with the query. I have asked my teachers, and they are looking for an alternative method which is not select subreddits.

I am now considering starting with curated lists of positive labels (e.g., products, competitors) and negative labels (e.g., The Devil Wears Prada), and then applying semi-supervised approaches such as bootstrapping or active learning. However, I am unsure how to properly justify the quality of the results without a fully labeled dataset or even the approach. Is there a easier or more proper way to solve this problem?

Could you advise on how best to validate and justify this approach?

reddit.com

u/Dry-Opportunity-1987 — 12 days ago