Overview
541,909 e-commerce transactions — RFM segmentation on AWS at distributed scale
My contribution: the complete 60-cell PySpark EDA and RFM feature engineering pipeline. The data engineering infrastructure mirrors clinical data pipelines — EHR records, PACS archives, and claims data have the same distributed compute pattern.
The entire pipeline runs on AWS: raw data on S3, computation on EMR Spark clusters, outputs saved back to S3. At 541,909 transactions, this exceeds what single-machine pandas can handle at full parallelism. PySpark on distributed EMR is the same infrastructure used for large-scale EHR cohort analysis and population health data pipelines.
Most operationally significant finding: A single customer (ID 16547) accounts for £1,772,220 — roughly 18% of all revenue from 541,909 transactions. This kind of concentration makes standard clustering approaches fragile without outlier handling. Analogous to outlier patients in clinical datasets (e.g., ICU readmissions): the presence of extreme cases requires deliberate decision-making about inclusion, not just preprocessing automation.
RFM Pipeline
Recency, Frequency, Monetary — the segmentation features
What this project built that transfers to clinical AI
Distributed computing on real cloud infrastructure: First project running actual Spark jobs on AWS EMR with multiple EC2 nodes and data on S3. Understanding how Spark's lazy evaluation, partitioning, and shuffle operations behave at scale versus single-node PySpark — the same infrastructure used for clinical data lake pipelines.
Outlier detection as a business/clinical decision: When Customer 16547 accounts for 18% of all revenue, including or excluding them isn't a statistical decision — it's a domain decision. The same framing applies to clinical outliers: ICU readmissions, rare diagnoses, or extreme lab values. Data decisions have real-world consequences and should be made with domain reasoning, not just automated preprocessing.
RFM as a segmentation framework: Building Recency/Frequency/Monetary features from raw transaction records in PySpark (groupBy + agg + join chains) made the data engineering behind a conceptually simple framework tangible. The same join logic appears in clinical cohort construction from EHR data.
My contributions (team of 4)
- ▸Complete 60-cell PySpark EDA — customer geography, product analysis, revenue trends, concentration analysis
- ▸Full RFM feature engineering pipeline — Recency, Frequency, Monetary computation in PySpark, joined RFM DataFrame
- ▸AWS pipeline execution — S3 data storage, EMR cluster computation, output saved to S3 for downstream clustering