// Big Data · PySpark · RFM Analysis · AWS EMR

Big Data Customer Segmentation
— RFM Pipeline on AWS

541,909 e-commerce transactions analysed on AWS S3 + EMR using PySpark. Complete 60-cell EDA and RFM (Recency-Frequency-Monetary) feature engineering pipeline — the same distributed infrastructure used for clinical EHR cohort analysis and population health data pipelines.

541K TransactionsRFM PipelineAWS S3 + EMRPySpark · Hadoop/YARN · Intro to Big Data Spring 2025

Overview

541,909 e-commerce transactions — RFM segmentation on AWS at distributed scale

My contribution: the complete 60-cell PySpark EDA and RFM feature engineering pipeline. The data engineering infrastructure mirrors clinical data pipelines — EHR records, PACS archives, and claims data have the same distributed compute pattern.

The entire pipeline runs on AWS: raw data on S3, computation on EMR Spark clusters, outputs saved back to S3. At 541,909 transactions, this exceeds what single-machine pandas can handle at full parallelism. PySpark on distributed EMR is the same infrastructure used for large-scale EHR cohort analysis and population health data pipelines.

Most operationally significant finding: A single customer (ID 16547) accounts for £1,772,220 — roughly 18% of all revenue from 541,909 transactions. This kind of concentration makes standard clustering approaches fragile without outlier handling. Analogous to outlier patients in clinical datasets (e.g., ICU readmissions): the presence of extreme cases requires deliberate decision-making about inclusion, not just preprocessing automation.

541K
Transactions
Dec 2010 – Dec 2011
£1.77M
Top Customer
~18% of all revenue
AWS EMR
Platform
S3 + Spark cluster

RFM Pipeline

Recency, Frequency, Monetary — the segmentation features

R
Recency
Days since last purchase relative to dataset maximum. Lower = more recent = higher engagement value.
F
Frequency
Count of unique invoice numbers per customer. High frequency = loyal, engaged customer relationship.
M
Monetary
Total revenue (Quantity × UnitPrice) per customer. Identifies high-value vs low-value segments.

What this project built that transfers to clinical AI

Distributed computing on real cloud infrastructure: First project running actual Spark jobs on AWS EMR with multiple EC2 nodes and data on S3. Understanding how Spark's lazy evaluation, partitioning, and shuffle operations behave at scale versus single-node PySpark — the same infrastructure used for clinical data lake pipelines.

Outlier detection as a business/clinical decision: When Customer 16547 accounts for 18% of all revenue, including or excluding them isn't a statistical decision — it's a domain decision. The same framing applies to clinical outliers: ICU readmissions, rare diagnoses, or extreme lab values. Data decisions have real-world consequences and should be made with domain reasoning, not just automated preprocessing.

RFM as a segmentation framework: Building Recency/Frequency/Monetary features from raw transaction records in PySpark (groupBy + agg + join chains) made the data engineering behind a conceptually simple framework tangible. The same join logic appears in clinical cohort construction from EHR data.

My contributions (team of 4)

PySparkAWS EMRAWS S3Hadoop/YARNEC2pandas

View PySpark EDA, RFM pipeline, and AWS architecture.

60-cell analysis, RFM pipeline script, and AWS architecture diagram.

GitHub → Get in Touch