r/bioinformatics

🔥 Hot ▲ 69 r/PhD+2 crossposts

My PhD has been unimpressive and I don’t feel confident about my work

I’m at the end of my 4th year in a 5-year bioinformatics program. All I’ve accomplished is one paper that’s still under review at a very low impact journal, I have 3 other projects that are at a weird stage but are also not very novel or impressive, just things I did because my PI made me do them. I have so many projects that went to the project graveyard because collaborators/undergrads working on them ghosted us, or because my PI’s goals changed.

I’m thinking about my next career steps. I want to do a postdoc, but my PhD has been so unimpressive, and I also want to pivot into a different field, so who the hell is going to hire me? Forget about an industry position, there’s no way I’ll be able to get one.

I’m trying to think about my PhD as just the beginning of my career, but I feel like I’ve had such a poor foundation and haven’t built the skill set I wanted to.

u/pickleeater58 — 22 hours ago

▲ 4 r/bioinformatics

Generating a GTDB-based database for EMU classification of microbiota 16S rRNA gene sequencing

Hey everyone.

I work with microbiota of human samples - primarily feces and urine, but also skin, and other biological nicheas. For this, we are using Nanopore sequencing targetting the 16S rRNA gene (27F - 1391R primers).

To determine the taxonomy of the sequences, we are using EMU. However, the database included in the package seem a bit old, so I am in the process of preparing a new database for the EMU pipeline, using GTDB 226 as a reference.

My steps so far (briefly):

Downloaded and unzipped the ssu_all_r226.fna.gz and bac120_taxonomy_r226.tsv.gz files
Created fasta file from the .fna file.
Filtered short (<1100 bp) and long (>1800 bp) sequences from the fasta file.
Deduplicated sequences using seqkit
Ensured that the taxid of the taxonomy files matched the fasta files
Combined taxa that is difficult to distinguish from each other using 16S rRNA gene sequencing.

After assigning taxonomy, there will be multiple versions of e.g. E. coli in the database, due to small variations in reported sequences. So after assigning taxonomy, I usually group by species identity.

I have tried using the database for classifying a few mock communities, as well as biological samples that we have previously sequenced. So far it seem okay, allthough we do seem to get a bit more low-abundant species. I expect some of it is related to probleems with taxa that should be grouped.

My questions for the rest of you are therefore:

Are there any essential steps that I have missed?
I have tried to ask and look around for which bacterial species that are hard to distinguish using 16S rRNA gene sequencing. Some I have found:
- Bacillus subtilis group: Contains B. subtilis, B. spizizenii, B. halotolerans, B. atrophaeus. I can also see this with our mock controls.
- Escherichia / Shigella.I have seen arguments that it can be difficult to distinguish escherichia species from shigella species, using 16S rRNA gene. But I have also seen multiple groups that mages to distinguis species from the two genera. What is the rest of yours experience?
- Bifidobacterium longum vs b.infantis vs B. suis
- Streptococcus mitis vs oralis vs pneumoniae

Thank you!

u/Illustrious_Yard_813 — 5 hours ago

▲ 28 r/bioinformatics

PI wants to create a pipeline app for single cell, help i’m a lowly undergrad.

Hi i’m an undergrad here learning bioinformatics and specifically single cell analysis as part of building a pipeline for my PI. He has no background in it and i’m self teaching myself everything.

Part of the project is he wants to build a UI/app that allows the lab to essentially plugin certain parameters and pump out a graph like UMAP or tsne. Essentially, standardizing it for easy use.

Problem is from what i’ve learned is that the analysis is a bit more complicated than just adjusting a few parameters with a drop down. Now i don’t know much but I believe TSNEs are models that cannot be applied to different data sets because it is non parametric. I brought this up to him and he said that they have set seeds and i can set the seed to be the same.

I kinda know what that means but kinda don’t. I have a vague idea of dimensionality reduction, eigen vectors, etc.

Would making an app/internal pipeline be possible with these kind of things? Wouldn’t it require a person to actually handle the data or code to specify it per data set?

u/Pristine_Temporary67 — 20 hours ago

▲ 4 r/bioinformatics

I never heard about this - money for spotting publication errors?

https://www.science.org/content/article/offering-scientists-cash-spot-errors-published-papers-doesn-t-work?utm_source=sfmc&utm_medium=email&utm_content=alert&utm_campaign=DailyLatestNews&et_rid=71602376&et_cid=5922407

Did anyone on here participate? Seems like their new model might be more tantalising, but I'd totally dig the cash

u/AsparagusJam — 9 hours ago

▲ 2 r/bioinformatics

Need help with discovery studio analysis of post docking results

I'm fairly new to molecular docking and I learnt about analysis of receptor ligand interactions through a youtube tutorial but the result im getting is quite different from the one i saw on the tutorial, what i got seems to be a "simple" diagram and the one in the tutorial seems to be a "schematic" diagram.

what i need to know is the one that i got accurate or should i try to make it into a schematic diagram ? my PI did ask for ligand-receptor interactions but I don't know if he wanted it in 2D or 3D

The docking was done through autodock 4.2 and the ligand was obtained through IEDB(B-cell epitope prediction)

u/Thick_Weird6363 — 8 hours ago

▲ 3 r/bioinformatics

Exploring ways to reduce bioinformatics cloud costs + friction — would love input

Hi all — I used to work in bioinformatics at the Broad Institute and MIT, and recently started working on a project around improving access to large public datasets.

One thing I kept running into was how much time and cost goes into just getting the data locally (especially with S3/egress), before you can even start analyzing.

I’ve been experimenting with ways to access and work with these datasets in-place (without downloading), and would love to sanity check whether this is actually a pain point for others here.

Curious:

how are people currently handling large public datasets?
are you mostly downloading locally, or working directly in the cloud?
any workflows you’ve found that reduce friction/cost?

Happy to share more about what I’ve been building if useful — mainly just trying to learn from how others are approaching this.

u/Acceptable-Ad-2904 — 12 hours ago

▲ 1 r/bioinformatics

Issues with RNA Velocity Analysis Between Subpopulations of One Cell Type

I am working on an RNA velocity analysis for one cell type which has 4 different subpopulations (based on whether they are high or low expression aka +/- for 2 different genes). My PI believes these genes are important based on wet lab experiments.

I'm following the scVelo tutorial to do this but my trajectories and positions are all over the place.

I tried placing around with the # of highly variable genes (below is 2000), I did basic filtering, and my unspliced counts are between the 10-25% they recommend. I also only have 1000 cells so perhaps this is an issue but I can't fix this part as we were given this data. Any other ideas I can try?

Sorry if this is a strange question but I am happy to answer any clarifying questions as well. Thank you guys in advance.

https://preview.redd.it/v433c9tkjptg1.png?width=912&format=png&auto=webp&s=f9056ef974b1dfd3ecb9ce69e9f680c918e26f64

However when I try an RNA velocity tutorial from scVelos

u/MiserableAd5989 — 7 hours ago

▲ 1 r/bioinformatics

How would you build a local PubMed/PMC-style search + QA system over a private local corpus?

I have a large local PMC/PubMed corpus on SSD and want to build a fully local system on my workstation that behaves somewhat like PubMed search, but can also answer questions over the local corpus with grounded references.

Hardware: RTX 5090, Ryzen 9 9950X3D, 96 GB RAM.

I already have the corpus parsed locally and partially indexed.

If you were building this today, what exact local setup would you use for:

retriever
reranker
local LLM
FAISS or something else
framework vs fully custom pipeline

I’m especially interested in responses from people who have actually built a local biomedical literature search / RAG system.

Thank you

u/snurss — 13 hours ago

▲ 1 r/bioinformatics

AI applied to biology research // Advice for a high school student

Hi everyone,

I’ve been researching this topic quite a bit, asked Chatgpt and talked to a few biologists (though not bioinformaticians), but I still feel like I’m not getting clear answers. So I wanted to ask here.

I’m an 18 year old high school senior and I’ve recently become very interested in work like AlphaFold, RFdiffusion, and scGen, where AI/ML is applied to biological problems. I’m still exploring what exactly this field looks like, but I’m generally interested in the intersection of AI and complex biological systems. So I wanted to ask a few questions:

Do you think this field (AI applied to biology) is truly the future, or is there a significant hype element right now?
I often see different terms used inconsistently. Would the area I’m describing fall under bioinformatics, computational biology or something else entirely?
A biology professor I spoke to mentioned that bioinformatics might become less valuable due to AI automation, and that entering the field could be risky. Do you agree with that perspective?
I’ve been accepted to study Computer Science at ETH Zürich, which I’m currently planning to attend. For this career path, would a CS degree be the right choice or would something like molecular biotechnology, biochemistry, or biology be a better foundation?
Finally, I’d also like to start actively building skills in this area, but I’m not sure where to begin. I already have a background in Python and some machine learning, but I’ve never really worked with biological data and to be honest, I find the biology side a bit difficult to grasp at times. It also seems like an incredibly broad field ranging from single cell genomics to protein structure prediction, systems biology, and etc etc... Because of that I've been wondering whether it makes more sense to pick up the biology “along the way” rather than trying to learn it all upfront. I don't really know. Where would you recommend starting? Would trying to reproduce papers I find interesting be a good approach or is that maybe too advanced at this stage?

No need to answer all of the questions, I’d really appreciate any insights or experiences you can share. Thanks a lot!

u/Low-Relationship6865 — 22 hours ago

▲ 0 r/bioinformatics

Contigs filtering by length in shotgun sequencing data

Hi all!

I was wondering what filtering parameters do you use for filtering you contigs after assembly? I have been trying to find some sort of agreement on how much to filter but it seems its not really standardised. I have high fragmentation (which I expected considering my samples come from soil), and my QUAST shows my N50 is around 1500bp, L50 400000 contigs and auN around 7000. (This is for my MEGAHIT co-asssembly).

I decided to go for 2000bp length filtering as from what I was reading, contigs below 1000bp are likely artifacts/low quality. However, this leaves me with around 4-5% of the total contigs (and about 25-28% of the bases). I am really torn here as I don't know whether these numbers make sense and this is expected/normal, or if I should relax the filtering.

Thanks!

u/Asleep_Shoulder_9426 — 3 hours ago

▲ 0 r/bioinformatics+1 crossposts

Familial DOCK4 nonsense mutation (p.Arg896Ter) with dominant inheritance across multiple children - looking for similar cases

Not seeking medical advice, searching for similar cases. I’m hoping to connect with anyone who may have encountered a rare DOCK4 mutation, like the one that has presented in our family.

Our family recently identified the following variant through genetic testing:

Gene: DOCK4

Transcript: NM_001363540.2

Variant: c.2686C>T

Protein change: p.Arg896Ter

Genomic location (GRCh38): chr7:111844813 G>A

Zygosity: Heterozygous

Inheritance pattern observed: Dominant (maternal transmission)

This variant introduces a premature stop codon at amino acid position 896, truncating the DOCK4 protein.

DOCK4 is a large protein (~2000 amino acids), so this mutation removes more than half of the protein, including downstream functional regions such as the DHR2 catalytic domain, which is involved in Rac signaling and cytoskeletal regulation.

Because of this, the mutation likely represents a loss-of-function variant.

⸻

So far the variant has been identified in multiple members:

• I carry the mutation (maternal carrier)

• My identical twins both tested positive

• My 13-year-old, who has an autism diagnosis, also tested positive, but he also was found to have a paternal AUTS2 mutation.

• Our 5-year-old is currently undergoing testing

This pattern suggests autosomal dominant inheritance with variable expression, possibly?

I have noticed a normal appearing facial structure for someone who is not familiar… but with specific features that are different. You wouldn’t notice anything in particular unless you knew. Flatter forehead, more straight sides, and specific eye set.

So far my 13 yr old son is 6’ tall, 160lbs, and wears a size 12 shoe. My twins are 2 years old and have remained in higher percentile for height. So now I’m curious because I’m only 5’7 so nothing wild with my height.

The other thing noticed is I have 2 fully formed extra ribs coming off of C7 and connecting fully in the front. No x-rays yet to confirm if my children with this mutation also have it.

⸻

Why I’m posting

Our clinical team at Nationwide Children’s Hospital indicated that this exact variant appears to be extremely rare and never documented. The other DOCK4 mutations in medical journals have been denovo cases and not the same exact mutation. I have submitted the case to a rare-variant registry to help researchers potentially identify additional families.

However, because rare variants often go unrecognized, I’m trying to see whether:

• Other families may have DOCK4 mutations exactly like this one not found in medical journals?

• Anyone has observed familial inheritance patterns involving DOCK4?

• If anyone has this exact mutation and any health conditions?

DOCK4 sits in the chromosome 7q31 region, which has been discussed in literature related to neurodevelopmental conditions, but the clinical significance of individual variants still appears to be evolving.

⸻

Looking to connect with

• Geneticists or researchers studying DOCK family proteins

• Families with this DOCK4 variant identified through genetic testing

• Clinicians who may have encountered similar cases

If anyone has encountered similar variants or is researching DOCK4, I’d really appreciate hearing from you.

#DOCK4

#RareVariant

#HumanGenetics

#GeneticsResearch

#RareDisease

#RareMutation

#UndiagnosedDisease

#Neurogenetics

#AutismResearch

#GeneMatcher

#VariantResearch

#PrecisionMedicine

u/HZLIZS — 22 hours ago

▲ 0 r/bioinformatics+1 crossposts

is it actually worth it by 2028? I need advice.

Hi! I'm a 15 year old high schooler from India, currently in Class 11 CBSE board. I have taken Physics, Chemistry, Biology, Mathematics, and Computer Science as my subjects, and I'm considering Bioinformatics as my future career path.

I have researched this multiple times and talked with multiple people about it. Most of them are saying that fields like tech and finance are going to be heavily taken over as AI keeps developing more and more. So basically I want to find the safest career by 2030, and from whatever I have read and heard, Bioinformatics seems like a good option. But I also want to hear from more people and get multiple perspectives on this, so here I am.

Here's where I'm at right now:

Preparing for SAT,aiming to apply to US colleges

Already talking to a consultant about profile building, research papers, internships, the whole deal

Planning to graduate in 2028 (Class 12)

My main questions:

Is Bioinformatics actually a smart long term bet or is it overhyped right now?

By 2028, will the job market be strong?

what will Compensation be especially if I study in the US and want to work in biotech or pharma?

Is Physics, Chemistry, Biology, Mathematics, and Computer Science a solid foundation for this?

u/zenizzzzz — 22 hours ago

▲ 0 r/bioinformatics

Has anyone tested RStudio and programs like SLiM 3 on MacBook Neo?

After some research, the 8gb of ram is definitely disappointing for a student-oriented affordable laptop. I was looking for something optimized and new as I head into a PhD program. My previous MacBook Pro just died on me last week and was looking for something affordable.

Has anyone tested out the performance of these programs on a Neo by any chance? I’m not very informed on laptops and computer performances, but heard so many good things about the Neo and feel a bit disappointed that it might not be up to par for bio work. In case it helps, I am probably going to be working on a drosophila dissertation regarding genomics

u/periodt-bitch — 15 hours ago

▲ 0 r/bioinformatics

Workflows for handling large bioinformatics datasets without high cloud/egress costs?

Hi all — I used to work in bioinformatics at the Broad Institute and MIT, and recently started working on a project around improving access to large public datasets.

One thing I kept running into was how much time and cost goes into just getting the data locally (especially with S3/egress), before you can even start analyzing.

I’ve been experimenting with ways to access and work with these datasets in-place (without downloading), and would love to sanity check whether this is actually a pain point for others here.

Curious:

how are people currently handling large public datasets?
are you mostly downloading locally, or working directly in the cloud?
any workflows you’ve found that reduce friction/cost?

Happy to share more about what I’ve been building if useful — mainly just trying to learn from how others are approaching this.

u/Public_Jellyfish2328 — 12 hours ago

▲ 0 r/bioinformatics

scGPT embeddings

What is the difference between the embedding modes 'cls' and 'cell'. Which to use for cell-type annotation?

u/Economy-Brilliant499 — 1 hour ago

▲ 0 r/bioinformatics

Confused after 2 NEET drops — interested in biology + research, considering bioinformatics. What should I do?

My_qualifications: 12th (PCB), 2-year NEET dropper, interested in research careers

I’ve taken two drops for NEET and it’s unlikely I’ll clear this year. I genuinely like biology, but I’m more interested in research than MBBS. Right now I’m confused about what path to take.

Some context:

- Weak in maths earlier (8th–9th), but improved in 10th with a good teacher

- I perform well with the right guidance/environment

- Middle-class background, so finances matter (especially for abroad plans)

Currently considering:

- BSc Bioinformatics → MSc abroad

- BSc Biotechnology → MSc Bioinformatics

- Any other better alternatives

My questions:

- Is bioinformatics actually worth it in India (job prospects + salary reality)?

- Which path is better: direct bioinformatics vs biotech → bioinformatics?

- Is MSc abroad realistic for a middle-class student? What should I start doing now?

- Are there better career options in biology/research that I should consider?

Looking for practical advice from people in these fields.

u/Honest-Sympathy964 — 1 hour ago

Week