Reduce the dataset by applying a KNN segmentation.

Parameters:

When working with transactional data, sampling — even stratified sampling — can lead to the loss of valuable information during the creation of predictive models. A better approach to maintain data quality while keeping the dataset size manageable is to group transactions from customers who exhibit similar purchase behavior.
Stratified sampling is useful when we aim to preserve the representativeness of a sample for inference purposes. However, it’s incorrect to assume that individuals of the same age, region, or ethnic background will have the same preferences or purchasing habits.
For example, it doesn’t significantly matter that on Fridays Daniel buys 6 beers, 2 Camembert, and 1 bread, while Frank buys 2 beers, 2 Brie, and 1 bread — they demonstrate essentially the same purchase behavior.
This technique reduces dataset size by identifying and grouping transactions that are behaviorally similar.
Algorithms:
Distance Definition:
Primary Key:
Unique transaction identifier.
Variables:
Set of columns used to calculate distances between rows.
