Specializing in LLM Agents, RAG Systems, and Scalable ETL Pipelines.
MS in Machine Learning and Data Science at Northwestern University
Overview: Built an agentic RAG system over 800,000+ arXiv abstracts with query classification and semantic search.
Key Results: Query classifier routes to direct answers or a RAG-enhanced flow with a query rewriter, document retrieval, and answer generation. Returns relevant abstracts and a fine-tuned 1.8B-parameter LLM abstract simplifier to explain technical jargon. Deployed as a Streamlit chatbot.
Overview: Designed cloud-native data infrastructure processing 2.5M+ network traffic records for intrusion detection.
Key Results: Implemented Hive-style date partitioning in S3 and used AWS Glue Crawlers for automated schema discovery. PySpark ETL jobs in Glue transformed raw PCAP CSVs into a Redshift-ready schema; a Random Forest classifier was trained as part of the pipeline and served via a containerized FastAPI endpoint on ECS Fargate.
Overview: A RAG system for question answering tasks deployed on Railway using Pinecone as the vector database and Grok for natural language processing.
Overview: An automated data pipeline for fetching and processing financial data from Yahoo Finance using Fivetran to extract data from Yahoo Finance, Snowflake for loading/storage, dbt to transform it, and Streamlit to visualize it.
Key Results: Led 3 person team and delivered 3 analytics tools (lineup predictor, synergy finder, tactical chatbot) in a 48-hour hackathon.
Overview: Comparative analysis of five ML models predicting customer churn on 165K records, with strategic threshold tuning to optimize retention interventions.
Key Results: LightGBM achieved 86.54% accuracy; threshold tuning boosted at-risk customer identification from 55% to 75% recall.
Key Results: Forecasted sales for top 10 departments, identifying those with >90% projected ROI to guide strategic investment.
Automating user queries via multi-tool calling and RAG.
Established significant health indicators using Cox PH models and survival analysis.
Reduced manual review time by 90% (30 min → 3 min) using LangChain & HPC.
Cut experimental sample size by 33% via stratified sampling.
Overhauled research data infrastructure, doubling accessible data volume.
Northwestern University
Middlebury College