← Back

Cloud-Based Intrusion Detection System

Scalable ETL pipeline and real-time inference API processing 2.5M+ records using AWS Glue, Redshift, and ECS Fargate.

TL;DR: Real-time inference on ECS Fargate; Hive-style date partitioning in S3; PySpark ETL via AWS Glue with Glue Crawlers for schema discovery.

Architecture diagram: data flow from ingestion to inference.

System Demo & API Inference

Short demo of the containerized FastAPI inference service running on ECS Fargate. The video shows a Postman request and the model response.

FastAPI demo: containerized inference on ECS Fargate (use controls to play).

Automated Data Ingestion & Transformation

Date-partitioned raw data in S3 (optimized for daily loads ~200MB/day).

Implemented Hive-style partitioning (date=YYYY-MM-DD) in S3 to enable partition pruning and optimize query performance for daily data loads (~200MB/day). Utilized AWS Glue Crawlers for automated schema discovery to handle data drift, while PySpark ETL scripts orchestrated by AWS Glue clean, normalize, and transform raw PCAP CSV data into a Redshift-ready schema. Jobs are idempotent, versioned, and emit structured job metrics to CloudWatch.

AWS Glue Job Console showing ETL and Model Training jobs — AWS Glue Job Console showing ETL and model training orchestration.

Production-Grade Inference API

Modular FastAPI app with strict Pydantic validation and separation of concerns between API routes and model inference logic. Serves a Random Forest model trained on flow metadata; models are versioned and served via a lightweight container. Health checks and structured JSON responses make the API production-ready.

FastAPI codebase showing model_infer.py and routes directory structure — FastAPI code snapshot: `model_infer.py` and routes layout.

Network Security & VPC Configuration

Challenge: Enabling secure private communication between AWS Glue and Redshift without exposing credentials or traffic to public internet.

Solution: Utilized VPC Interface and Gateway Endpoints to create a private network tunnel, allowing JDBC database traffic to flow securely between services without traversing the public internet. This architecture isolates data within private subnets and enforces IAM-based role assignments to eliminate the need for hardcoded AWS access keys.

VPC resource map visualizing subnets and route tables — VPC resource map highlighting subnets, route tables, and private endpoints.

Additional Details

Tech Stack

Languages: Python, PySpark, SQL
AWS Services: S3, Glue, Redshift Serverless, ECS Fargate, VPC
Frameworks & Tools: FastAPI, Docker, Spark MLlib
Ops & Other: Boto3 (Python SDK), CloudWatch (monitoring), ECR (registry)

Credits

Dataset: UNSW‑NB15.

This was a collaborative team project; I worked primarily on setting up the S3 buckets, the Glue infrastructure, and the ETL pipelines.