MSc Thesis - METU Biotechnology

Next-generation Cell Type Annotation:
Integrating NLP and ML Techniques

My MSc thesis explores the integration of Natural Language Processing and Machine Learning for enhanced single-cell RNA sequencing classification using gene text embeddings with autoencoders.

GPA: 4.0 from all courses

Abstract

This thesis presents a novel approach to cell type annotation in single-cell RNA sequencing (scRNA-seq) data by integrating Natural Language Processing (NLP) techniques with traditional machine learning methods. The research focuses on leveraging gene text embeddings combined with autoencoders to enhance the accuracy and efficiency of cell type classification.

The methodology combines state-of-the-art text embedding techniques from NLP to capture semantic relationships between genes, with autoencoder architectures that learn compressed representations of single-cell expression profiles. This hybrid approach addresses the high-dimensional nature of scRNA-seq data while incorporating biological knowledge encoded in gene descriptions.

Key contributions include: (1) Development of a gene embedding framework specifically tailored for scRNA-seq analysis, (2) Integration of these embeddings with autoencoder architectures for dimensionality reduction, and (3) Demonstration of improved cell type classification accuracy compared to traditional methods.

Key Research Findings

Gene Text Embeddings

Developed novel gene text embedding techniques that capture biological relationships and functional similarities between genes based on their descriptions and annotations.

Autoencoder Integration

Successfully integrated autoencoder architectures with gene embeddings to create compressed representations that preserve biological information while reducing dimensionality.

Enhanced Classification

Achieved significant improvements in cell type annotation accuracy, demonstrating the effectiveness of combining NLP and ML techniques for bioinformatics applications.

Technical Implementation

Technologies Used

Python: Core programming language
Scanpy: Single-cell analysis framework
PyTorch: Deep learning implementation
Transformers: NLP embeddings
scikit-learn: ML algorithms

Key Innovations

Novel Embedding Approach

Created gene embeddings using biological descriptions and GO terms, capturing semantic relationships beyond expression values.

Hybrid Architecture

Combined traditional bioinformatics approaches with modern NLP techniques for improved biological interpretation.

Scalable Implementation

Designed for efficient processing of large-scale scRNA-seq datasets with millions of cells.

Research Impact

This research contributes to the growing field of computational biology by demonstrating how techniques from natural language processing can enhance biological data analysis. The methods developed have potential applications in precision medicine, drug discovery, and understanding cellular heterogeneity in complex tissues.

"Key is getting information from data"

- My research philosophy