Building a Narcolepsy sentiment analysis tool using Reddit's public API; full details on anonymization, ToS compliance, and how this can help all of us
I'm building a tool that aggregates public Reddit posts and comments mentioning narcolepsy medications to identify common side effects, treatment patterns, and efficacy of medications. I have made various posts (links below) recently on evidence based approach for most effective treatments, best supplement, potential causes of narcolepsy, etc.
This is NOT for commercial use.
I'm one person with narcolepsy trying to fill a research gap that affects me and everyone here. I know what it's like to suffer with this condition, it's derailed my whole life. Medication research for this condition is scarce (mostly research funded by big pharma and small patient sizes) and real patient experiences with drugs like Xywav, modafinil, pitolisant, and sunosi are scattered across thousands of Reddit posts. There is a wealth of data that I believe could change people's lives.
I have also cited Reddit's ToS that allows me to do this. The academic papers cited did not ask for permission nor give you the ability to opt out, but I am.
What the tool does
It analyzes public posts/comments that mention narcolepsy medications, extracts mentions of side effects, and produces aggregate-level output like:
- "Among 200 posts mentioning Xywav, nausea appeared in ~30%, anxiety in ~20%"
- "Modafinil sentiment breakdown: 60% positive, 25% neutral, 15% negative"
It does NOT identify individuals, publish quotes, or profile users. It reports patterns at the medication level.
Reddit ToS compliance
Reddit's Public Content Policy states: "you can use Reddit content for non-commercial uses, such as learning and community." This project is non-commercial. I'm also using the official API within rate limits, accessing only public subreddits, and not touching private, quarantined, or deleted content. Reddit's Data API Terms are also clear: no user content will be used for ML/AI training or any commercial purpose, which I am not doing.
Reddit ToS links:
• Public Content Policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
• Data API Terms: https://redditinc.com/policies/data-api-terms
This has been done before, by actual researchers
Reddit-based health research is published in peer-reviewed journals regularly. A few examples:
- Pleasants E, et al. (2023). "Exploring Language Used in Posts on r/birthcontrol." J Med Internet Res, 25, e46342.
- Johnson AK, et al. (2022). "STD-Related Reddit Posts During the COVID-19 Pandemic." J Med Internet Res, 24(10), e37258. PMID: 36219757.
- Heidari O, et al. (2024). "Personal Experiences With Xylazine and Behavior Change." J Addict Med, 19(2), 135-142. PMID: 39329377.
- Adams NN. (2024). "'Scraping' Reddit posts for academic research?" Int J Soc Res Methodol, 27(1), 47-62.
- Turcan E, McKeown K. (2019). "Dreaddit: A Reddit Dataset for Stress Analysis in Social Media." LOUHI 2019, 97-107.
I'm following the same methodology and ethical framework these papers used. The Adams (2024) paper specifically outlines the ethical considerations for Reddit based research, including consent, anonymization, and data handling, and my tool follows those recommendations.
How anonymization works
At collection: I only pull post text, subreddit name, month (not day), and score. Usernames, permalinks, profile URLs, avatars, flair, and user IDs are never stored. The raw API response is processed in memory and discarded immediately, nothing identifiable is stored.
Text scrubbing: A preprocessing pass strips direct identifiers (u/ handles, emails, phone numbers, URLs, clinic names, street addresses, exact dates). Quasi-identifiers like age, location, and employer are generalized (e.g. "29-year-old nurse in Boise" becomes "adult healthcare worker in western US").
Storage: Only the cleaned, de-identified analytic table is stored: subreddit, medication mention, side effect flag, month, and a sentiment score. No raw text. No way to trace anything back to a person.
Output: Only aggregate statistics. Minimum threshold of 10 posts per subgroup. No individual quotes, no verbatim text, no cross-tabulations that could isolate a single person.
Opt-in vs opt-out
Some people have asked about consent. Here's why I'm doing opt-out rather than opt-in:
Opt-in is not standard practice for secondary analysis of public social media data. None of the published studies I cited used opt-in consent, because Reddit posts are public by default, ethics boards routinely classify this as exempt research, and requiring opt-in would create massive selection bias (only the most motivated people respond, skewing the data). Once data is properly anonymized, informed consent becomes structurally impossible to retroactively apply, and that's accepted practice in health informatics research.
I'm offering opt-out because I think people deserve the option, even though it's not required by ToS or standard research practice. I’m trying to do this right when other academic papers have not, considering I am part of this community not just a researcher.
How opt-out will work
Because all data is fully anonymized at the point of collection (usernames are never stored long-term), implementing opt-out requires a specific pipeline: usernames collected during the scrape will be hashed, checked against the opt-out list, matching posts excluded, and then the hashing salt destroyed , meaning exclusion is applied and can't be reversed or abused. I will update everyone once the mechanism is finalized, including a submission form.
Links to my recent posts:
https://www.reddit.com/r/Narcolepsy/s/mqegrd6GpU
https://www.reddit.com/r/Narcolepsy/s/c5XsZo9Ol5
https://www.reddit.com/r/Narcolepsy/s/jrO1OPfQbl
https://www.reddit.com/r/Narcolepsy/s/egbb28hTJA