Generating a GTDB-based database for EMU classification of microbiota 16S rRNA gene sequencing
Hey everyone.
I work with microbiota of human samples - primarily feces and urine, but also skin, and other biological nicheas. For this, we are using Nanopore sequencing targetting the 16S rRNA gene (27F - 1391R primers).
To determine the taxonomy of the sequences, we are using EMU. However, the database included in the package seem a bit old, so I am in the process of preparing a new database for the EMU pipeline, using GTDB 226 as a reference.
My steps so far (briefly):
- Downloaded and unzipped the ssu_all_r226.fna.gz and bac120_taxonomy_r226.tsv.gz files
- Created fasta file from the .fna file.
- Filtered short (<1100 bp) and long (>1800 bp) sequences from the fasta file.
- Deduplicated sequences using seqkit
- Ensured that the taxid of the taxonomy files matched the fasta files
- Combined taxa that is difficult to distinguish from each other using 16S rRNA gene sequencing.
After assigning taxonomy, there will be multiple versions of e.g. E. coli in the database, due to small variations in reported sequences. So after assigning taxonomy, I usually group by species identity.
I have tried using the database for classifying a few mock communities, as well as biological samples that we have previously sequenced. So far it seem okay, allthough we do seem to get a bit more low-abundant species. I expect some of it is related to probleems with taxa that should be grouped.
My questions for the rest of you are therefore:
Are there any essential steps that I have missed?
I have tried to ask and look around for which bacterial species that are hard to distinguish using 16S rRNA gene sequencing. Some I have found:
- Bacillus subtilis group: Contains B. subtilis, B. spizizenii, B. halotolerans, B. atrophaeus. I can also see this with our mock controls.
- Escherichia / Shigella.I have seen arguments that it can be difficult to distinguish escherichia species from shigella species, using 16S rRNA gene. But I have also seen multiple groups that mages to distinguis species from the two genera. What is the rest of yours experience?
- Bifidobacterium longum vs b.infantis vs B. suis
- Streptococcus mitis vs oralis vs pneumoniae
Thank you!