John He Portfolio

Featured Projects

RAG LLM System for Scientific Literature GitHub link

LangGraph Pinecone LoRA Gemini API HuggingFace Streamlit

Overview: Built an agentic RAG system over 800,000+ arXiv abstracts with query classification and semantic search.

Key Results: Query classifier routes to direct answers or a RAG-enhanced flow with a query rewriter, document retrieval, and answer generation. Returns relevant abstracts and a fine-tuned 1.8B-parameter LLM abstract simplifier to explain technical jargon. Deployed as a Streamlit chatbot.

Cloud-Based Intrusion Detection System Project page link

Glue S3 Redshift ECS Fargate PySpark Spark MLlib VPC IAM Docker FastAPI

Overview: Designed cloud-native data infrastructure processing 2.5M+ network traffic records for intrusion detection.

Key Results: Implemented Hive-style date partitioning in S3 and used AWS Glue Crawlers for automated schema discovery. PySpark ETL jobs in Glue transformed raw PCAP CSVs into a Redshift-ready schema; a Random Forest classifier was trained as part of the pipeline and served via a containerized FastAPI endpoint on ECS Fargate.

More Projects

Deployed NLP Answering System GitHub link

Grok Pinecone Railway Docker

Overview: A RAG system for question answering tasks deployed on Railway using Pinecone as the vector database and Grok for natural language processing.

YFinance Data Pipeline GitHub link

Fivetran Snowflake dbt Streamlit plotly

Overview: An automated data pipeline for fetching and processing financial data from Yahoo Finance using Fivetran to extract data from Yahoo Finance, Snowflake for loading/storage, dbt to transform it, and Streamlit to visualize it.

AI-Powered Soccer Analytics GitHub link

scikit-learn Random Forest Collaborative Filtering ChatGPT

Key Results: Led 3 person team and delivered 3 analytics tools (lineup predictor, synergy finder, tactical chatbot) in a 48-hour hackathon.

Bank Customer Churn Prediction Project page link

LightGBM XGBoost PyTorch scikit-learn Threshold Tuning

Overview: Comparative analysis of five ML models predicting customer churn on 165K records, with strategic threshold tuning to optimize retention interventions.

Key Results: LightGBM achieved 86.54% accuracy; threshold tuning boosted at-risk customer identification from 55% to 75% recall.

Market Sales Forecasting

ARIMA PostgreSQL

Key Results: Forecasted sales for top 10 departments, identifying those with >90% projected ROI to guide strategic investment.

Work Experience

Sep 2025 - Dec 2025

Data Science Contractor (Capstone)

Videspan | Evanston, IL

Automating user queries via multi-tool calling and RAG.

Deployed an agentic chatbot using a containerized, multi-service architecture (Docker Compose, FastAPI).
Engineered the agent's core conversational logic with stateful context management for MCP tool calling and elicitation.
Developed the system's multimodal RAG knowledge base with Qwen3 and LangGraph.

Sep 2025 - Nov 2025

Data Science Intern (Biostatistics)

Monopar Therapeutics | Wilmette, IL

Established significant health indicators using Cox PH models and survival analysis.

Constructed predictive models from R&D datasets to investigate biomarker relationships.
Performed survival analysis on clinical time-to-event data utilizing Generalized Linear Mixed Models and Survival Analyses.

Jun 2025 - Aug 2025

Data Science Intern (LLM)

Alexion, AstraZeneca Rare Disease | Wilmington, DE

Reduced manual review time by 90% (30 min → 3 min) using LangChain & HPC.

Architected an LLM pipeline using Langchain to identify relevant research articles and extract 20+ variables.
Engineered a scalable, end-to-end data pipeline on a Slurm-managed HPC cluster to ingest 300+ scientific articles.
Resolved a failing few-shot classifier by visualizing embeddings with t-SNE, diagnosing labeling drift and guiding a new SOP for 10+ categories.

Sep 2024 - Jun 2025

Data Science Contractor (Industry Practicum)

Azul 3D | Skokie, IL

Cut experimental sample size by 33% via stratified sampling.

Led a 3-person team to analyze 50+ material compositions, building a predictive framework to model reaction speeds.
Devised a cost-effective experimental design methodology using stratified sampling.

Sep 2021 - Mar 2025

Data Coordinator (Data Engineer)

Brigham and Women's Hospital | Boston, MA

Overhauled research data infrastructure, doubling accessible data volume.

Built 10+ automated ETL pipelines, cutting project setup time from months to weeks.
Developed a Python web scraper to download and curate a novel dataset of 10,000+ images.
Built a Python NLP pipeline to parse 15GB+ of clinical notes.

Skills

Programming Python R SQL

AI / ML & LLMs PyTorch HuggingFace LangChain LangGraph Pinecone MCP LoRA scikit-learn RAG

Cloud & Data Eng AWS Glue Redshift S3 Spark Snowflake

MLOps & Tools Docker FastAPI Streamlit dbt Fivetran ECS Fargate

Education

MS in Machine Learning and Data Science

Northwestern University

BA in Molecular Biology and Biochemistry

Middlebury College

John He

AI Engineer & Data Scientist