r/statistics

[Q] Linear regression normality test, teachers keep telling me to do it on variables instead of residuals.

Hello,
I have a dataset I got from my likert scale questionnaire (16 questions for IV and 14Q for DV) n is 66, and I need to study the relationship of the variables, and I thought linear regression is the best for this type of situation since its common and used in most previous dissertations, I did normality on residuals and got sig above 0.05, but the teachers in the uni keep telling me to do it on variables instead which makes my normality test fail at values under 0.05, what do I do? how do I convince them and if there is a better way to study the relationship without normality tests im down for it, the Q-Q plot is ok all the dots are close the line but the teachers still refuse to accept it without the normality test on variables.

reddit.com
u/mohdd22 — 12 hours ago

[Question] What kind of statistical test should I use when comparing across different treatment groups?

Hello! Basically I'm trying to figure out what kind of statistical test I should do based on the observations I made.

Essentially, my study was looking at 4 different treatments; a control, as well as a low, medium, and high concentration of algae. The purpose was to see if Daphnia hopping frequency changed as concentration increases. More specifically, to see if it is getting slower as concentration increases. Each treatment had 10 individuals measured.

I'm kind of at a loss in terms of where I should even start? In my head it doesn't make sense to do an ANOVA (i think), because from what I understand that's like comparing each treatment against a baseline. But what I want is to have a statistical test tell me whether or not there's significant slowing over time. So I think that would be a linear regression...?

Sorry if this question is easy to answer, I genuinely have forgotten any stats I took in the lower years so choosing a test is like digging through a bucket of marbles. Thank you!

reddit.com
u/myCabbagesssssss — 9 hours ago

[Q] What statistical analysis would fit my dissertation?

I'm currently writing my politics dissertation, using data from Hansard to see if/how politicians have changed their framing on a certain issue. I am coding statements using Mary Douglas’ cultural theory categories: fatalist, egalitarian, hierarchical, and individualist. Some statements also have a primary and secondary frame, for example primarily hierarchical but secondarily egalitarian. With my data, I have split it into two time periods, pre and post-event. My instinct is to compare the percentage distribution of frames across the two periods, then use that as the basis for qualitative analysis of what the shift means. Unfortunately I have little statistical training, so I'm trying to work out if there is something that would be methodologically appropriate and realistic. I don’t want to force an overly complicated model onto a project that is mainly qualitative, but I also want the analysis to feel rigorous. Do you think that percentages would be enough here? Or are there other techniques which could strengthen my analysis? Thanks.

reddit.com
u/Peggylizzie — 5 hours ago
▲ 40 r/statistics+1 crossposts

difference between econometrics and (applied) statistics

Finishing my MSc in economics, and the more I dive deep in econometrics (in 2026 of course) the more I find it hard to distinguish between statistics and econometrics . Heckman & Pinto's argument aside (and ignoring structural econometrics), after the "credibility revolution" much of the working toolkit looks less like a separate science than just applied statistical inference on economic data.

Reading some papers from QJE one could've easily seen them perfectly fitting a journal from the ASA, and vice-verse (Wagner & Athey 2018 for example).

Theoretical econometrics is even more indistinguishable from pure statistics. I'm not that interested in a historical account (Morgan's 1990 book is amazing), but rather how you guys see the current state of affairs.

At least reduced-form econometrics seems to me like economics-branded applied statistics. Of course a traditional applied micro paper would not probably be fit for a stats journal, but I cannot see it more than literally applied stats. what do you think?

reddit.com
u/virgil_eremita — 18 hours ago

[Q] How does Job market look like right now for PhD students (Biostatistics) in 2026 and any tips

I am currently Biostatistics PhD student, and my advisors want me to graduate next year (2027).

Orginally, my first advisor want me to graduate in 2028, but there were funding issues, so it looks like I have next year to prepare for job search.

NGL, I am super worried, as I don't have any internships and my research is mostly computational (not theoretical).

I am wondering if research direction is important? I know that I probably would not get into top research labs or become top quantitative researcher. I am just hoping I have good chance to become data scientist at tech company or work at pharma.

I am little clueless how to do job search. I am super worried. I do have a paper or two published, but they are applied/collobration (large scale data analysis).

reddit.com
u/edsmart123 — 7 hours ago

[Q] What are the most important distributions to know beyond the Normal and Binomial/Multinomial?

Beyond just the Normal and Multinomial distributions, what are some extremely important distributions that you frequently come across / work with when doing statistics? Any that are just really cool?

In my coursework I've run across the Beta and Gamma distributions and they seem quite important, but I'm not sure which distributions I should really get to know for when I will conduct actual analyses/make models.

Any input is appreciated!

reddit.com
u/JerryChen06 — 6 hours ago
▲ 1 r/statistics+1 crossposts

[E] Advice regarding course selection for an MS Applied Stats program with a focus in geospatial data

I am in the process of registering for another semester of classes for my MS program that allows for us to take courses in an allied field in addition to purely statistics courses.

The allied field I am considering is geographic information systems. I've always been interested in physical geography/cartography so it feels like a natural fit. Also, my university's geosciences department is very tech oriented, so I am kind of spoiled for choice in that regard.

My question is directed at those who work on the stats side of GIS in their careers: what are some particularly important topics in statistics/GIS I should incorporate into my curriculum if I aim to work in geospatial data analytics after graduating?

Also, our stats department essentially runs on R and I am wondering if it would be worthwhile to also learn Python on my own time while in school.

Please feel free to offer career advice of any kind, as well. I am only beginning my second semester and my undergrad is in pure mathematics, so all of this is very new to me.

Thanks in advance!

reddit.com
u/Ready-Community-4459 — 7 hours ago

[Q] A question on the estimation of reliability in longitudinal data

I’ve been researching the problem of test-retest reliability for a while now and I’m curious how others are handling the identifiability issues that come with longitudinal data.

In psychology we are usually taught that retest reliability is a simple correlation between two time points. The problem is that this assumes the underlying trait is perfectly stable and the measurement error is completely random. In my opinion these assumptions are basically impossible for real world data because even the most stable traits usually only correlate at about 0.6 to 0.8 over time.

I recently published a paper in Applied Psychological Measurement where I demonstrated that when these assumptions are not exactly met the resulting retest coefficient is entirely uninterpretable. Moreover, these assumpions are also not testable, since the framework is essentially a black box. A simple correlation cannot tell you if a low score means your scale is noisy or if your participants actually changed, because you only ever observe two knowns, but have more than two unknowns.

I am definitely not alone in this critique. A paper that came out earlier this year by Tufiş, Alwin, and Ramírez in the Journal of Survey Statistics and Methodology reaches a similar conclusion using GSS survey data. They argue it is a bit of a Catch-22 where we rely on these coefficients because they are easy to calculate even though the math is often fundamentally uninterpretable for most psychological and sociological constructs.

The classic fix for this is the Heise 1969 framework. If you have three waves of data Heise showed you can algebraically separate reliability from stability using the three observed correlations. It is a neat trick but as I’ve dug into it the limitations are pretty glaring. It requires constant measurement precision across waves and a strict Markovian process for trait change. More importantly with only three waves these assumptions are mathematically untestable so you are basically just trading one set of blind assumptions for another.

I am looking to move past the 1960s-era CTT math on this. I am wondering if anyone here has found success using more modern latent trait models or SEM-based approaches to reliably differentiate trait stability from measurement error. Specifically, I want to know how people are actually implementing Latent State-Trait models when they don't have massive multi-indicator datasets. Are there Bayesian or Dynamic SEM approaches that allow us to identify these components without needing a ridiculous number of waves? I would love to hear if there is a better modern standard I should be looking at that moves beyond the Heise framework.

My paper: https://journals.sagepub.com/doi/full/10.1177/01466216251401213

The Tufiş et al. 2024 paper: https://academic.oup.com/jssam/article/12/4/1011/7484622

reddit.com
u/CogitoErgoOverthink — 10 hours ago

Is a 1.5 year Master of Data Science worth it for someone with an Econometrics bachelors? [E][C]

I have taken some CS electives in my undergrad (intro python programming, databases, deep learning), and my econometrics major was more of an applied statistics major heavily focused on causality and time series.

The problem is that it was basically focused on the wrong thing. I basically have never done a data science project in python, only R for basically every single unit (some SAS and Stata sprinkled in...). I know a lot about causality, non stationary time series, and panel data, but have never used XGBoost in my life.

I do have an applied time series paper published, and worked as a research assistant for 1 year, also in time series.

I feel like the data science master's would fill in these technical gaps while also giving me further data science training. But I'm also scared I might be wasting my time (money isn't an issue).

Do you think it would be worth it for me?

reddit.com
u/GayTwink-69 — 21 hours ago
▲ 1 r/statistics+1 crossposts

PG IN AGRICULTURAL STATISTICS

Hello everyone,I am in first year pursuing Bsc Horticulture,and I am thinking to prepare for statistical sciences in agriculture for PG,can anyone guide me on whether it's a good idea and does it have proper scope?I am too much tensed,if someone can help me please...

reddit.com
u/Ayush___001 — 11 hours ago

[Question] Do you think homoscedasticity has been violated based on the attached output?

All other assumptions and tests have passed. It's for RCBD - 6 treatments, 5 blocks, 5 treatments. Water shape (response ) i" a measurement of the "stability and performance" of water from a firefighting engine. Residual Vs Fitted Plot Image

u/centaurineb — 1 day ago

Q: [discussion] Was I being mansplained to? statistic math vs statistic theory?

hi, I got in a drunk argument last night. 21F (me) vs 21M. I need to know if I am secretly incompetent

i am in an undergraduate engineering degree and have taken data science/statistics.

he is a self appointed nerd who did take some highschool classes of relevance and has done outside reading.

he was claiming statistic science is different than statistic math. that science and math can exist without the other, ect. he kept throwing around words like “hypothesis, theory, ect, statistics, scientific method, statistic method” in places where I have never heard them be used, and where I am not sure they should be used (def non-academic verbiage)

I was arguing that math and science are 🤞, tightly woven together. stats is literally woven into the scientific method. science can’t have much validity without math. I guess almost that science can’t exist without math.

I didn’t even consider statistic math and statistic science being different. Hell it just sounds like different marketing to me.

i probably developed a mental block to this conversation early on and started arguing just to argue. I asked if he knew what R^2 was, and he said “that’s geometry. “ and I said that it’s to compare groups in stats. and he said ”well you have to know R to know what R squared is” … we were drunk.

reddit.com
u/ObjectBubbly3216 — 2 days ago
▲ 5 r/statistics+1 crossposts

[Education] Any good online courses to brush up on the US college equivalent of Statistics 1 or 2?

As the post says, looking for recommendations online to brush up on some skills. I'm currently going into my third year of my Business Tech Analytics degree, and due to my AP credits from high school (calculus BC), I never had to take Stats 1 or 2, and have admittedly touched it very little. I want to brush up, but I don't know where to start, and believe that an online course I could take over the summer would be beneficial to kick off before I begin taking my major concentration classes!

Additionally, any online courses to get a certification or start building a portfolio would be great! I've never started a portfolio, and believe that it would be good to start one now. Any recommendations for videos to watch on building a portfolio would be much appreciated.

(I happen to also be a musician, and seeing my creative portfolio bring me more gigs than my actual major is rather... concerning me as I don't have a portfolio started for my degree- and some of my business analytics related cohorts have begun to use jargon that I'm not aware of, which is making me feel very behind!)

Please don't recommend taking a summer school course- I'm looking into it already, but with my financial situation, I'm not entirely sure whether that's possible!

Much appreciated, internet strangers!

reddit.com
u/SchemeDifferent3237 — 1 day ago

[C] How do I learn the technical skills required to use the degree I have?

I have an applied statistics and mathematics degree and have begun looking for a job. I'm realizing very quickly that most jobs require extensive knowledge of many programs and pieces of software I have never had to touch for any reason whatsoever.

I'm aware that maybe this makes me look like an idiot and that I should have been doing this throughout my undergrad, but that's why I'm here asking for advice. What is the best way to go about learning all of these pieces of software to the degree required to find a job? Or should I just start applying and expect that I'll be accepted as a new grad?

Software examples I need to learn: SQL, Tableau, Power BI, Pandas, SAS, excel dashboards etc.

I have some cursory knowledge of some of these topics like Python and Excel, but I would certainly not call myself an expert in any of these topics whatsoever.

reddit.com
u/OkEntertainment9557 — 2 days ago

[Q] Normality assumption violated in Shapiro-Wilk — can I proceed with parametric tests? (Master's thesis, n=67)

Hey everyone,

I am working on my master's thesis and running into an issue with assumption testing before Pearson correlation and simple linear regression.

My sample is n=67, and I ran the Shapiro-Wilk test on my two composite variables (independent and dependent). Both came back significant:

Variable 1: W = 0.916, p < .001

Variable 2: W = 0.961, p = .034

So technically normality is violated according to the test. The Q-Q plots look fairly clean with only minor deviation at the lower tail.

My questions are:

Is it acceptable to proceed with parametric tests given that Shapiro-Wilk is significant but the Q-Q plots look reasonable?

Is the Central Limit Theorem argument valid here given n=67?

Should I be switching to non-parametric alternatives instead?

This is a survey-based study using Likert scales in a management/organisational context if that matters. Any advice appreciated.

reddit.com
u/mohdd22 — 2 days ago
🔥 Hot ▲ 62 r/statistics

[D] feels like we abandoned proper joint probability modeling just because next-token prediction is easier to compute

Been thinking about the probabilistic foundations of the current ML meta and it feels kinda... backwards? we have this massive industry-wide fixation on autoregressive models right now, where we're just hammering conditional probabilities P(x_t | x_<t) to death

But mathematically, if you want to capture the actual underlying distribution of complex, structured data, building a joint probability model makes way more sense. i was going over some literature on EBMs recently and it just reminded me how elegant it is to model the unnormalized density directly. you define a scalar energy function, and lower energy simply equals higher probability. it maps so beautifully to actual statistical mechanics and thermodynamics

Obviously the partition function is a nightmare to compute in practice, and MCMC sampling is notoriously painful to scale compared to just running a simple forward pass in a transformer. but it honestly feels like we just threw our hands up and accepted greedy left-to-right sampling purely because its easier to parallelize on current GPU architectures. statistically speaking, it's such a brittle way to model global structure.

is anyone here actually doing research or applied stats with non-autoregressive probabilistic models lately? or did the whole field just permanently capitulate to the genAI hype?

u/Crystallover1991 — 3 days ago

Teenage smoking [discussion]

Recently I seen a video on the most common teenage addictions in the USA, and I’d say smoking is at the top. But I did some research and the percent of teens that smoke, are use a nicotine product [vape, pouches, cigarettes, etc.] is wayyy lower than I expected. That makes me think these statistics are only from the teens who have been caught smoking. But I think the percent who smoke is way, way higher.

What are your guesses on the actual percent of students I think it could easily be 45%-50%.

reddit.com
u/BBgunsandaviation — 2 days ago
▲ 2 r/statistics+1 crossposts

[Q] What effects to include in meta-analysis for papers with multiple estimates of same outcome?

I am a PhD student conducting a meta-analysis. I have already extracted the data from each paper and calculated standardized mean differences (SMDs) for all my outcomes. The next step is the actual meta-analysis, which will be a robust, random effects meta-analysis (robumeta in Stata) to account for differences in study settings and multiple outcomes per paper. Here lies my question. I am unsure which effect sizes to include in the analysis. Most papers measure the outcome variable as an overall index as well as separated into sub-indices. Some papers use a different estimation method as robustness checks or carry out sub-group/heterogeneity analysis. Others create the index in two different ways and calculate effects for both. Lastly, some papers estimate the effect on the main outcome for multiple follow-up periods separately as well as pooled over all periods.

 

Would I include all potential estimates of a paper in the meta-analysis? For example, the overall index and the sub-indices, the outcomes for the overall sample as well as for sub-samples and so on. (My analysis includes 28 papers if that is relevant.)

 

Thank you very much for your help! If you can refer me to further reading about this, I would also be very happy!

reddit.com
u/Luuk0417 — 1 day ago
▲ 1 r/statistics+1 crossposts

[D] The Star-Rating Dilemma: A simple mathematical model for when "more stars" collides with "fewer votes"

Picture a familiar choice when looking at reviews:

  • Option A: 4.0★, 100 ratings
  • Option B: 4.5★, 10 ratings

Option B has a higher average, but the lower number of ratings makes it less trustworthy. We all intuitively discount ratings with small sample sizes, but I wanted to make this intuition explicit and tunable without relying on complex Bayesian priors right out of the gate.

I wrote up a tiny article using three short functions to formalize this trade-off. Here is a small excerpt of the approach:

1. Normalise the rating First, map any rating scale (like 1 to 5 stars) linearly onto a [0,1] scale.

2. Calculate Confidence from the vote count We need a function c(n) that takes a vote count n and returns a confidence value in [0,1). It should have diminishing returns. While an exponential or arctangent function works, a simple rational function is the most conservative:

c(n) = n / (n + c_h)

Here, c_h is the "half-confidence point" — the number of reviews at which you consider the rating exactly 50% trustworthy. (For Google Maps, maybe c_h = 50).

3. Merge both via a risk-aversion parameter (ρ) Instead of just multiplying rating by confidence, we can weight them based on a risk-aversion parameter ρ:

V = (r + ρ * c(n)) / (1 + ρ)

  • ρ = 0: Pure star-gazing (risk-seeking). Vote count is ignored.
  • ρ → ∞: Maximum caution. Only sample size matters.

The Tipping Point: When you map this out, even mild risk aversion (weighting sample size at roughly a fifth of star quality) is enough to flip the lead to Option A. The 0.5-star advantage of Option B simply cannot overcome its confidence deficit.

AI chatbots tell me the classic statistical approach to this is a Bayesian average (shrinking low-sample averages toward a global mean), but I am not familiar with the concept and I liked the transparency of keeping the critical review amount, and risk-aversion as distinct, tuneable pieces.

I'd love to hear this community's critiques on this model. How would you improve it? Do you see any specific flaws?

If you want to see the full story, the visual graphs for the tipping points, the sensitivity analysis of c_h, and a workaround for handling vote counts spanning several orders of magnitude, you can read my full article here.

https://frequently-asking-questions.com/2026/04/18/the-star-rating-dilemma/

reddit.com
u/Meduty — 1 day ago
▲ 6 r/statistics+3 crossposts

TSEDA, a tool for exploring time series data

The following is a tool that I created for analyzing regularly sampled time series data. It uses a technique called Singular Spectral Analysis. It slides a window through the data and then uses SVD to analyze patterns.

The package is here:
https://github.com/rajivsam/tseda

A brief SSA primer is here:

https://rajivsam.github.io/r2ds-blog/posts/markov_analysis_coffee_prices/

A note about using the tool is here:

https://rajivsam.github.io/r2ds-blog/posts/tseda%20announcement/

This is a fairly common data type - if you have this data and would like to try the tool to see if it helps you, I would appreciate any feedback

Thanks

u/rsambasivan — 2 days ago