Big Data Customer Segmentation

Overview

541,909 e-commerce transactions — RFM segmentation on AWS at distributed scale

My contribution: the complete 60-cell PySpark EDA and RFM feature engineering pipeline. The data engineering infrastructure mirrors clinical data pipelines — EHR records, PACS archives, and claims data have the same distributed compute pattern.

The entire pipeline runs on AWS: raw data on S3, computation on EMR Spark clusters, outputs saved back to S3. At 541,909 transactions, this exceeds what single-machine pandas can handle at full parallelism. PySpark on distributed EMR is the same infrastructure used for large-scale EHR cohort analysis and population health data pipelines.

Most operationally significant finding: A single customer (ID 16547) accounts for £1,772,220 — roughly 18% of all revenue from 541,909 transactions. This kind of concentration makes standard clustering approaches fragile without outlier handling. Analogous to outlier patients in clinical datasets (e.g., ICU readmissions): the presence of extreme cases requires deliberate decision-making about inclusion, not just preprocessing automation.

541K

Transactions

Dec 2010 – Dec 2011

£1.77M

Top Customer

~18% of all revenue

AWS EMR

Platform

S3 + Spark cluster

RFM Pipeline

Recency, Frequency, Monetary — the segmentation features

Recency

Days since last purchase relative to dataset maximum. Lower = more recent = higher engagement value.

Frequency

Count of unique invoice numbers per customer. High frequency = loyal, engaged customer relationship.

Monetary

Total revenue (Quantity × UnitPrice) per customer. Identifies high-value vs low-value segments.

What this project built that transfers to clinical AI

Distributed computing on real cloud infrastructure: First project running actual Spark jobs on AWS EMR with multiple EC2 nodes and data on S3. Understanding how Spark's lazy evaluation, partitioning, and shuffle operations behave at scale versus single-node PySpark — the same infrastructure used for clinical data lake pipelines.

Outlier detection as a business/clinical decision: When Customer 16547 accounts for 18% of all revenue, including or excluding them isn't a statistical decision — it's a domain decision. The same framing applies to clinical outliers: ICU readmissions, rare diagnoses, or extreme lab values. Data decisions have real-world consequences and should be made with domain reasoning, not just automated preprocessing.

RFM as a segmentation framework: Building Recency/Frequency/Monetary features from raw transaction records in PySpark (groupBy + agg + join chains) made the data engineering behind a conceptually simple framework tangible. The same join logic appears in clinical cohort construction from EHR data.

My contributions (team of 4)

▸Complete 60-cell PySpark EDA — customer geography, product analysis, revenue trends, concentration analysis
▸Full RFM feature engineering pipeline — Recency, Frequency, Monetary computation in PySpark, joined RFM DataFrame
▸AWS pipeline execution — S3 data storage, EMR cluster computation, output saved to S3 for downstream clustering

PySparkAWS EMRAWS S3Hadoop/YARNEC2pandas

Big Data Customer Segmentation
— RFM Pipeline on AWS

541,909 e-commerce transactions — RFM segmentation on AWS at distributed scale

Recency, Frequency, Monetary — the segmentation features

What this project built that transfers to clinical AI

My contributions (team of 4)

View PySpark EDA, RFM pipeline, and AWS architecture.

Big Data Customer Segmentation— RFM Pipeline on AWS

541,909 e-commerce transactions — RFM segmentation on AWS at distributed scale

Recency, Frequency, Monetary — the segmentation features

What this project built that transfers to clinical AI

My contributions (team of 4)

View PySpark EDA, RFM pipeline, and AWS architecture.

Big Data Customer Segmentation
— RFM Pipeline on AWS