Identifies outliers in a dataset.

Parameters:

Parameters:
See dedicated page for more information.
R_MAH identifies multivariate outliers in a dataset using the Mahalanobis distance and a Chi-square significance threshold.
Unlike univariate rules (e.g., z-score on a single column), Mahalanobis distance measures how far each record lies from the multivariate center of the selected variables while taking into account their covariance structure. This means it:
Partitioning note
In many real-world datasets, “outliers” only make sense within a segment (by product line, region, customer cohort, etc.). R_MAH supports partitioned execution so the distance is computed independently per group, preventing global patterns from hiding local anomalies. Ensure the data is sorted by the partition key when you use column-based partitioning.
R_MAH enriches the input table with:
MAH Outliers — the Mahalanobis distance (or squared distance, depending on implementation) for each row based on the selected feature set.Outlier Outliers — an indicator (0/1) showing whether the row is flagged as an outlier given the selected Chi-square threshold.These columns appear alongside the original input columns and can be forwarded to downstream actions (filtering, labeling, exporting, etc.).

| Section / Id | UI label | What it does | Notes & tips |
|---|---|---|---|
| Partitioning (top of panel) | none / by column / fixed number of lines (excluding last) | Controls how the input is split before computing distances. | Use by column to compute outliers per group (e.g., per category). Make sure the dataset is sorted by the partition column upstream. |
idxVars |
Columns on which to identify outliers | The numeric variables used to compute Mahalanobis distances. | Choose only numeric, non-collinear columns. Remove IDs or near-constant fields; they destabilize the covariance matrix. |
cN |
Construct Name | Base name for the computed fields. | Default is Outliers. Downstream columns will use this base (e.g., MAH Outliers, Outlier Outliers). |
CS |
Chi Square Threshold (.975 for 5% significance) | The quantile of the Chi-square distribution used for flagging. | Typical values: 0.975 (~5% one-sided), 0.99, 0.999. Higher values → fewer, more severe outliers. Lower values → more outliers. |
Connect input to R_MAH (e.g., from a CSV reader).
Open Parameters:
idxVars): pick the numeric features to analyze.cN): keep default or provide a label (e.g., Outliers).CS): keep 0.99 (as shown) or set to your desired sensitivity.Run the pipeline.
Inspect the Data tab: you’ll see MAH Outliers (distance) and Outlier Outliers (0/1 flag).
(Optional) Filter rows where Outlier Outliers = 1 for review, remediation, or downstream routing.
Distance scale: Larger MAH Outliers values indicate more extreme observations. Use the binary flag for primary decisions and the distance for ranking.
Threshold tuning: Start with 0.99. If you get too many false positives, raise it (e.g., 0.995/0.999). If you miss anomalies, lower it (e.g., 0.975).
Feature hygiene:
Segmented detection: Prefer partitioned runs when business logic is segment-dependent (e.g., “expensive” only within a product tier).
Post-processing: Combine the outlier flag with domain rules (capacity limits, business constraints) before taking automated actions.
All distances are 0 or NA
Many rows flagged unexpectedly
CS to a stricter value (e.g., 0.995 or 0.999).“Singular matrix / covariance not invertible” (if shown in logs)
Segment-wise behavior missing
R_MAH provides a robust, statistically grounded way to detect multivariate anomalies. By leveraging Mahalanobis distance and a configurable Chi-square threshold—plus optional partitioned execution—it surfaces records that don’t fit the joint distribution of your chosen features. Use the distance to rank anomalies and the binary flag to act on them, adjusting sensitivity with the threshold and grouping strategy to match your business context.
