about

Hi, I'm Gabriel Cha. I'm a Data Science undergraduate at UC San Diego and incoming M.S. Data Science student at Columbia Engineering. I'm broadly interested in aligning LLMs with human intent and creating agentic LLM applications.

In the past, I researched at the DSTL Lab, where I partnered with Professor Sam Lau to develop an LLM-powered JupyterLab extension that generates customized lecture content for instructors. Built with Python and TypeScript, the tool creates explanations, code examples, and practice problems from notebook content.

Before that, I was an intern at Edison, where I engineered data pipelines that automated third-party data requests, reducing retrieval time from 4 days to 10 minutes.

As a 9-time teaching assistant for ML/DS courses at UC San Diego, I've taught 1,200+ students in machine learning algorithms, statistical modeling, and practical data science applications.

LinkedIn GitHub Resume Site last updated: May 2025


projects

Project 1
Concept Bottleneck for LLMs
Creating a user interface for the Concept Bottleneck Large Language Model (CB-LLM), an interpretable LLM introduced by Lily Weng at ICML 2024 MI Workshop. CB-LLM integrates high accuracy, transparency, and scalability for enhanced interpretability.
Project 2
Are you sad? Speech Emotion Recognition
Recipient of the HDSI scholarship ($6500) for researching emotion classification in voice recordings. Experimented with and selected the best feature extraction methods, transforming recordings into matrices using Fourier analysis. Performed data augmentations and optimized four models: Decision Tree, SVM, ViT, and CNN.
Project 3
Gatsby InspiredLanguage Model
Built a Language N-Gram model that estimates the probability of a word based on preceding words, using empirical frequencies. The goal is to uncover language patterns and enable statistical text generation.
Project 5
CoDebug Python Package
Published a Python library to improve the debugging experience for developers in Jupyter Notebook by using linecache to read error messages and integrating with the OpenAI API to provide troubleshooting suggestions.
Project 4
Predictive Outage Cause
Leveraged sklearn's decision tree to classify severe power grid outage causes, normalizing data and optimizing hyperparameters, achieving 82.7% accuracy.
Project 2
ML Modeling in Spark
Wrangled 25 GB of data by performing joins and aggregations to train a Word2Vec model in Apache Spark, simultaneously learning the fundamentals of systems for scalable analytics.
Project 6
IoT Stratified Sampling
Generated a stratified random sample, reducing 4,300 GB of IoT recording to approximately 34.3 GB while ensuring representativeness and efficient data selection.
Project 7
Racoon Spottings
Integrated Google Maps API to display campus location and NoSQL Firebase Cloud to track racoon spottings on UC San Diego campus.


Since you made it this far ✌️ let's connect! My email is