Improving metagenomic assembly pipeline for microbial protein structure-function prediction
Background:
The human microbiome encodes complex information about health and has significant clinical diagnostic potential. However, the majority of microbial proteins have not been functionally annotated, limiting our ability to understand the mechanistic basis to detected associations. One of the bottlenecks relates to the sequence-structure-function paradigm [1]: despite recent success in applying deep learning frameworks such as AlphaFold2 to large-scale protein structure prediction, microbial protein structures remain largely unknown as most of such methods rely on multiple sequence alignment (MSA) to the target sequence [2,3]. Microbial protein sequences, despite the natural preponderance, are severely underrepresented in the AlphaFold2 database for MSA generation [2,4].
Project: Our goal is to improve current assembly pipelines to better define full-length sequences of open reading frames for downstream microbial protein structure prediction tasks. Our initial analysis shows that contemporary databases are incomplete and truncated, resulting in an incomplete and incorrect set of sequences limiting downstream tasks such as structural, functional and protein-protein interaction inference. An improved assembly pipeline is thus necessary for full-length sequence recovery in a highly scalable manner, and enables the extension of AlphaFold2 (OpenFold) onto structure prediction from metagenomic sequencing data.
References: [1] Redfern, O. C., Dessailly, B., & Orengo, C. A. (2008). Exploring the structure and function paradigm. Current opinion in structural biology, 18(3), 394-402. [2] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589. [3] Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., … & Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871-876. [4] Mitchell, A. L., Almeida, A., Beracochea, M., Boland, M., Burgin, J., Cochrane, G., … & Finn, R. D. (2020). MGnify: the microbiome analysis resource in 2020. Nucleic acids research, 48(D1), D570-D578.
Selected candidate(s) can receive a stipend directly from the faculty advisor. This is not a guarantee of payment, and the total amount is subject to available funding.
Faculty Advisor
- Professor: Tal Korem
- Center/Lab: Systems Biology
- Location: PH18-200
- We develop data analysis methods for multi-omic microbiome data. We focus on integrating clinical, microbiome, lifestyle and environmental data in a way that advances from statistical associations to actionable insights that can be used in clinical practice.
Project Timeline
- Earliest starting date: 10/15/2022
- End date:
- Number of hours per week of research expected during Fall 2022: ~12
Candidate requirements
- Skill sets: Students who are interested in metagenomics data / structural biology with experience in Python and Unix. Experience in C/C++ preferred.
- Student eligibility:
freshman, sophomore, junior, senior, master’s - International students on F1 or J1 visa: eligible
- Academic Credit Possible: Yes