u/Mathyato_

Hello,

I am currently performing a GWAS and am at the quality control stage, more precisely at the "ancestry" analysis. My goal is to select a homogeneous subpopulation to prevent population stratification during the subsequent statistical analysis.

To achieve this, I followed the plinkQC tutorial tilted "Training a Random Forest Classifier for Population Structure Identification", using the HapMap Phase III dataset (as suggested in the tutorial).

https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html

I trained my model using 77 individuals per subpopulation, which corresponds to the size of the least represented group (MXL).

https://preview.redd.it/f6ved33thl0h1.png?width=564&format=png&auto=webp&s=d815f571391c0ddcc3fcc7cc47d7e2ae5e0bc18d

I chose this approach to avoid class imbalance, which could bias the classifier. However, the estimated OOB (Out-of-Bag) error rate after training is 22.67%, which is too high (I'm going to select CEU subpopulation).

https://preview.redd.it/ptdx80mvhl0h1.png?width=652&format=png&auto=webp&s=50d63b8bcc84d1053e0f22c76e0aeb9096b1a5c3

To improve accuracy, I have explored several approaches :

- Principal Component Analysis: I observed that the accuracy of my model increases as I include more PCs.

https://preview.redd.it/meb314rmhl0h1.png?width=2880&format=png&auto=webp&s=d7f840f96358c75b62a9276d75d4a2c1b4aa2dd9

- Sampling Strategy: Using an equivalent proportion per subpopulation rather than a fixed count to maximize the total number of individuals used for training.

- Reference Panel Uprgade: Replacing HapMap III with 1000 Genomes Project Phase III data, which offers a significantly larger sample size (this is my current focus).

My questions:

1 - Would using 1000 Genome Phase III data significantly imporve the classifier's accuracy compared to HapMap III?

2 - Are the other reference datasets available that might further enhance the model's accuracy?

3 - Is using a proportion of individuals per subpopulation rather that a fixed count considered a valid practice, and does it effectively imporve accuracy?

Note: I should clarify that I am not a ML engineer, I am a Master 2 bioinformatics sutdent . My utlimate objective is to identifiy variants associated with a specific population through statistical analysis, rahter than achieving a perfectly optimized classifier. While I understand that QC is the most critical stage of a GWAS, unfortunately my current deadling do not allow me to spend excessive time on this specific sted. Thank you for taking this into consideration in your response !

Random Forest Classifier Training for population structure identification QC in a GWAS analysis