← Back

Cloud-Based Intrusion Detection System

Scalable ETL pipeline and real-time inference API processing 2.5M+ records using AWS Glue, Redshift, and ECS Fargate.

TL;DR: Real-time inference on ECS Fargate; Hive-style date partitioning in S3; PySpark ETL via AWS Glue with Glue Crawlers for schema discovery.
AWS Cloud Architecture Diagram showing flow from S3 to Glue to Redshift to ECS Fargate.
Architecture diagram: data flow from ingestion to inference.

System Demo & API Inference

Short demo of the containerized FastAPI inference service running on ECS Fargate. The video shows a Postman request and the model response.

FastAPI demo: containerized inference on ECS Fargate (use controls to play).

Automated Data Ingestion & Transformation

Date-partitioned raw data in S3
Date-partitioned raw data in S3 (optimized for daily loads ~200MB/day).

Implemented Hive-style partitioning (date=YYYY-MM-DD) in S3 to enable partition pruning and optimize query performance for daily data loads (~200MB/day). Utilized AWS Glue Crawlers for automated schema discovery to handle data drift, while PySpark ETL scripts orchestrated by AWS Glue clean, normalize, and transform raw PCAP CSV data into a Redshift-ready schema. Jobs are idempotent, versioned, and emit structured job metrics to CloudWatch.

AWS Glue Job Console showing ETL and Model Training jobs
AWS Glue Job Console showing ETL and model training orchestration.

Production-Grade Inference API

Modular FastAPI app with strict Pydantic validation and separation of concerns between API routes and model inference logic. Serves a Random Forest model trained on flow metadata; models are versioned and served via a lightweight container. Health checks and structured JSON responses make the API production-ready.

FastAPI codebase showing model_infer.py and routes directory structure
FastAPI code snapshot: `model_infer.py` and routes layout.

Network Security & VPC Configuration

Challenge: Enabling secure private communication between AWS Glue and Redshift without exposing credentials or traffic to public internet.

Solution: Utilized VPC Interface and Gateway Endpoints to create a private network tunnel, allowing JDBC database traffic to flow securely between services without traversing the public internet. This architecture isolates data within private subnets and enforces IAM-based role assignments to eliminate the need for hardcoded AWS access keys.

VPC resource map visualizing subnets and route tables
VPC resource map highlighting subnets, route tables, and private endpoints.

Additional Details

Tech Stack

  • Languages: Python, PySpark, SQL
  • AWS Services: S3, Glue, Redshift Serverless, ECS Fargate, VPC
  • Frameworks & Tools: FastAPI, Docker, Spark MLlib
  • Ops & Other: Boto3 (Python SDK), CloudWatch (monitoring), ECR (registry)

Credits

Dataset: UNSW‑NB15.

This was a collaborative team project; I worked primarily on setting up the S3 buckets, the Glue infrastructure, and the ETL pipelines.