r/data

▲ 0 r/data

Looking for ovary cancer data.

reddit.com

u/Accomplished-Day-529 — 1 day ago

▲ 1 r/data

Looking for personal injury data

Live date needed for the below campaigns:

Roundup

Depo Provera

Talcum

Hair relaxer

Rideshare

Motor Vehicle Accident

Interested in long term partnership. DM me.

reddit.com

u/FilmZealousideal6880 — 2 days ago

▲ 1 r/data

Dating Compatibility Scoring Matrix

Hey! I’m a data analyst and I implement data into all aspects of my life. I’ve had an idea and can’t find anyone who has done anything similar.

Most aspects of life have assessments and qualifying criteria, but not relationships. I want to create a matrix to score potential partners - the aim of this is to weed out incompatibility early.

It would be in a spreadsheet and all preferences would have a point attached to them, simplified example:

Has a hobby: +2 points

Cat person: +1 point

Has a cat/wants a cat: +2 points

Feminist (and enforces it): +3 points

Good fashion sense: +1 point

Unemployed (with caveats on this): -2 points

Drinks alcohol excessively: -4 points

Disparaging past partners: -10 points

Has anyone done this? All I can find is compatibility charts based on zodia signs or personality types.

I’m aware that this could be an unhealthy approach to dating. On the other hand, it could allow people to have a clear, objective viewpoint.

With the example above, red flags cause the person to lose many points so it’s harder to overlook things that could become an issue later down the line.

Let me know your thoughts, thank you!

reddit.com

u/y2kateee — 1 day ago

▲ 2 r/data

Fixing data governance ?

Has anyone been able to 'fully' fix that data governance issue within an organization ?

Even me as a data engineer for the past 5-6 years, I've never been fully grounded and learned in data governance until 'I had to do it'.

I feel that it's a never ending problem, most Orgs. are just trying to keep things up and running with bandages, and the data is never fully trusted, and slips of bad formatted data or just plainly bad data.

I feel saying that you have to make sure your data is under a single governance is easier said than done.

So is everyone facing the same issue here?

reddit.com

u/al_tanwir — 3 days ago

▲ 2 r/data

Anyone here using structured datasets for outreach? Curious what’s working..

Been experimenting a bit with structured datasets recently (mainly around property owners in Dubai) and trying to see what actually works vs what people claim works.

Not doing anything crazy just cleaning the data properly, filtering by specific communities, and testing simple outreach (mostly WhatsApp + occasional calls).

One thing I noticed:

Raw data is almost useless unless you spend time structuring it properly. Once it’s cleaned and segmented, the response rate improves quite a bit.

Also feels like timing and how you approach the first message matters way more than the size of the dataset itself.

Still figuring things out, but curious —

Are people here using datasets for lead gen / outreach?

What’s actually working for you right now?

Would be interesting to compare notes.

reddit.com

u/BatPlastic5066 — 5 days ago

▲ 3 r/data

Esports data VS odds conversation that we should start having

Something worth talking about when it comes to trading/data side would be the latest shift observed in Esport lobbies!

When you model traditional sports, physical fatigue is manageable., you have rest days, fixture congestion, travel logs, injury reports, etc so the degradation curve is relatively predictable. (sportsbooks have been pricing tired legs for decades)

Esports don't get tired legs, it has "tilt", for example:

A player on tilt in a CS2 or Dota 2 lobby isn't showing up in a physio report. It's showing up in their flash accuracy at round 18, their gold efficiency dropping 15% off baseline, their team's timeout clustering. By the time a casual bettor watching the stream thinks "they look shaky," the market should already have moved, but in a lot of live esports products, it hasn't.

That gap between what the data sees and what the odds reflect is the real conversation operators need to be having. If your live esports repricing is running on the same cadence as a pre-match football market, you probably have a mismatch worth fixing.

Any thoughts on this?

reddit.com

u/Altenar_b2b — 2 days ago

▲ 0 r/data

How would you monetize a dataset-generation tool for LLM training?

I’ve built a tool that generates structured datasets for LLM training (synthetic data, task-specific datasets, etc.), and I’m trying to figure out where real value exists from a monetization standpoint.

From your experience:

Do teams actually pay more for datasets, APIs/tools, or end outcomes (better model performance)?
Where is the strongest demand right now in the LLM training stack?
Any good examples of companies doing this well?

Not promoting anything — just trying to understand how people here think about value in this space.

Would appreciate any insights. Can drop in any subreddits where I can promote it or discord links or marketplaces where I can go and pitch it?

reddit.com

u/JayPatel24_ — 7 days ago

▲ 8 r/data

How business process automation is quietly reshaping data pipelines

Something I’ve been noticing in data workflows lately is how much business process automation is influencing how pipelines are built and maintained.

Traditionally, data pipelines were owned by engineering or data teams. But now, with more automation tools available, non-technical teams are starting to build and manage parts of these workflows themselves.

On one hand, this democratization is great, it reduces bottlenecks and speeds up decision-making. On the other hand, it introduces new challenges around data quality, consistency, and governance.

I’ve seen cases where multiple automations are writing to the same dataset, leading to discrepancies that are hard to trace.

reddit.com

u/Sirwanga — 12 days ago

▲ 2 r/data

Best way to extract iPhone Screen Time data from screenshots into Excel (for university project)?

Hey everyone,

I’m currently working on a university art/research project where I’m collecting and analyzing personal data (e.g. screen time, app usage, notifications, etc.) and transforming it into structured datasets.

The issue:

I have around 30+ iPhone Screen Time screenshots (one per day), and I need to convert all of that into a clean Excel table (e.g. per app, per day, usage time, notifications, etc.).

I’ve already tried using ChatGPT and basic OCR approaches, but they start making errors pretty quickly (especially after a few days), and the structure breaks down. Since the data needs to be quite precise, that’s a problem.

Manually typing everything is not an option — it would take way too long.

I’ve attached an example screenshot so you can see what kind of data I’m working with.

So my questions:

- Are there better OCR tools for this kind of structured UI data?

- Is there a way to automate this properly (batch processing)?

- Would a different prompting approach improve results?

- Or is there maybe a completely different workflow I’m missing?

Would really appreciate any suggestions — especially from people who’ve dealt with similar data extraction problems.

Thanks!

u/GreenySeinVater — 11 days ago

▲ 11 r/data+1 crossposts

Beyond CSV & Parquet: What Real Data Ingestion in Spark Actually Looks Like

Most Spark tutorials focus on clean CSVs and Parquet files, but real-world data is rarely that simple. In this post, I share practical ingestion patterns and lessons learned from working with messy, unpredictable data in production.

medium.com

u/Expensive-Insect-317 — 15 days ago

▲ 2 r/data

Need advice on datasets and models for Song-classification (genre, mood, gender)

Need advice on datasets and models for multi-task music classification (genre, mood, gender)

Hi,

I’m working on a Song classification project and I need some guidance.

The goal is to build a system that takes a song as input and predicts multiple things like genre, mood, and singer gender. Eventually I want to either combine everything into one model or design a good pipeline for it.

So far, I’ve used the FMA dataset for genre classification and the DEAM dataset for mood. For gender classification, I manually collected around 1200 songs and labeled them. The problem is that all these datasets are separate and don’t overlap, so the same song doesn’t have all labels.

even though i had trained the model (i used cnn model ) seperately and checked it but it is providing wrong answers and i also tried combining the 3 seperate model into one and trained and the results are same some the gender is correct but the other things doesnt shows a correct answer

and when i tested with shape of you song by edsheeran the gender is shows as female and remaining 2 are showing wrong answers and when i try with regional songs ( indian orgin ) also facing same issue doesnt able to recognize all the 3 classification but my project need to classify the western songs and as well as regional songs

So,Are there any datasets where songs already have multiple labels like genre, mood, and gender together?

suggest me any llm for this project ive been using claude sonnet but the free limit is getting my nerves but im a student and cant able to afford claude code even with the student discount

Any advice or resources would be really helpful. Thanks.

reddit.com

u/Abhiram_L — 14 days ago

▲ 0 r/data

I've tested most AI data analysis tools, here's how they actually compare

I'm a statistician and I've been testing AI tools for data analysis pretty heavily over the past few months. Figured I'd share what I've found since most comparison posts online are just SEO content that never actually used the tools.

Tool	What It Does Well	Limitations
Claude	Surprisingly good statistical reasoning. Understands methodology, picks appropriate tests, explains its thinking.	Black box — you can't see the code it runs or audit the methodology. Can't reproduce or defend the output.
Julius AI	Solid UI, easy to use. Good for quick looks at data.	Surface level analysis. English → pandas → chart → summary paragraph. Not much depth beyond that.
Hex	Great collaborative notebook if you already know Python/SQL.	It's a notebook, not an analyst. You're still writing the code yourself. Different category.
Plotly Dash / Tableau / Power BI	Good for building dashboards and visualizing data you've already analyzed.	Dashboarding tools, not analysis tools. No statistical tests, no interpretation, no findings. People conflate dashboards with analysis.
PlotStudio AI	4 AI agents in a pipeline — plans the approach, writes Python, executes, interprets. Full analysis pages with charts, stats, key findings, implications, and actionable takeaways. Shows all generated code so you can audit the methodology. Write-ups are measured and careful — calls out limitations and gaps in its own analysis. Closest to what a real statistician would produce.	One dataset upload at a time. No dashboarding yet. Desktop app so you have to download it (upside: data never leaves your machine).

Curious what others are using. Anyone found something I'm missing?

reddit.com

u/PlateApprehensive103 — 17 days ago

▲ 2 r/data

I tracked every outbound email we sent for 30 days

I recently decided to track every outbound email we sent over a 30-day period. Not just the number of emails, but timing, follow-ups, and outcomes.

What I found was uncomfortable. We weren’t as consistent as we thought. Some days we sent a lot of emails, other days barely any. Follow-ups were even worse—many prospects never received a second or third touch.

The biggest realization was that our results were directly tied to this inconsistency. It wasn’t random, it was predictable based on our activity patterns.

Seeing it laid out in data made it impossible to ignore.

Now we’re focused on building a more structured and consistent approach, rather than relying on bursts of effort.

reddit.com

u/ExtremeAstronomer933 — 21 days ago

▲ 1 r/data

Cleaned Indian Liver Patient Dataset (ML Ready)

🔥 The Dataset :

https://www.kaggle.com/datasets/shauryasrivastava01/liver-patient-dataset

• 583 patient records with real clinical biomarkers

• Binary classification (Liver Disease vs Healthy)

• Fully cleaned + preprocessed (no messy columns)

• Includes enzymes, bilirubin, proteins & demographic data

• Perfect for ML projects, EDA, and healthcare modeling

💡 Great for:

- Beginners learning classification

- Feature importance & SHAP analysis

- Bias & fairness studies in healthcare

🚀 Ready to plug into your ML pipeline!

reddit.com

u/Direct-Jicama-4051 — 14 days ago

▲ 1 r/data

Private set intersection, how do you do it?

I work with a company that sells data. As an example, let’s say we are selling email addresses. A frequent request we’ll get is, “We’ll we already have a lot of emails, we only want to purchase ones you have that we don’t”.

We need a way that we can figure out what data we have that they don’t, without us giving them all our data or them giving us all their data.

This is a classic case of private set intersection but I cannot find an easy to use solution that isn’t insanely expensive.

Usually we’re dealing with small counts, like 30k-100k. We usually just have to resort to the company agreeing to send us hashed versions of their data and hope we don’t brute force it. This is obviously unsafe. What do you guys do?

reddit.com

u/EducationalTackle819 — 22 days ago

▲ 3 r/data

At what point did your data start failing you in production?

One pattern I’ve been noticing across different AI/ML systems we’ve been building and deploying:

Things work fine early on with:

- curated datasets

- synthetic data

- small controlled test sets

But once systems hit real-world usage, a different class of problems shows up:

- edge cases that weren’t in the original data

- distribution shifts that quietly degrade performance

- workflows behaving differently than expected

- gaps in eval coverage that only show up over time

What’s interesting is that we often hit a point where everything looks fine structurally, but performance just isn’t reliable anymore.

For those who’ve run into this:

When did you realize your existing data wasn’t enough?

More importantly:

- what didn’t work when you tried to fix it?

- where did your data still fall short even after expanding it?

Trying to understand where this actually breaks down in practice.

reddit.com

u/EveningWhile6688 — 10 days ago

▲ 0 r/data

Chaptgpt’s new policy takes data from chat context to show you ads

u/Secret_Skirt_214 — 14 days ago

▲ 3 r/data

Snowflake PII Classification & Auto Policy Setup - Help

What real-world use cases or extensions can I build open on Sensitive Data Classification & Policy Enforcement in snowflake to experimenting and building something impactful

To run SYSTEM$CLASSIFY across -schemas to detect PII (emails, SSNs, phon e numbers), then auto-generate and apply masking and row access policies based on the results. Policies are tied to tags so new columns are automatically p rotected-building a governance-as-code layer for GDPR/CCPA compliance.

I’m still in the exploration/ideation phase, so open to experimenting and building something impactful in Snowflake.

Would really appreciate your inputs 🙌

Thanks in advance!

reddit.com

u/Key_Card7466 — 22 days ago