Python template.

Parameters:
- Developer mode
- Input table partitioning (1st pin)
- Add button: Add custom parameters
- Whole table:
- Append columns to input table:
See dedicated page for more information.
pythonTemplate is a scripted action. Embedded code is accessible and customizable through this tab.
See dedicated page for more information.
See dedicated page for more information.
A general-purpose Python action to transform a table row-by-row (or in bulk), optionally installing extra pip packages and appending the result to the input.
-
Takes one input table on Pin 0 and exposes it to your Python code as a pandas DataFrame named idTable.
(Pandas and NumPy are pre-imported; you don’t need to import them.)
-
Optionally installs extra Python packages you specify before your code runs.
-
Runs your transformation once per partition (or once for the whole table if no partitioning).
-
Emits a result table that can be:
- Only your new columns, or
- Your new columns appended to the original input (controlled by the Append columns to input table toggle, a.k.a.
idJoin).
Note: the Code tab appears only when Developer mode is enabled.
Think of it as three guarantees the engine needs from you:
-
Shape: your code must produce a DataFrame whose row count matches the input partition currently being processed.
- If you’re appending to the input, the index alignment should match (the scaffold handles this for you).
-
Schema: you declare the names of the new columns in the definitions section; your result must contain exactly those names, in that order.
-
Determinism per partition: if you enable partitioning, your logic should not depend on rows that live in other partitions (unless you design an explicit aggregation strategy that is partition-safe).
When those three are true, the box returns a clean, predictable table every run.
Below are common patterns you can compose to get the results you want. They are described conceptually so you can adapt them to your own logic.
¶ Add derived columns
-
Define one or more output column names.
-
Compute values from existing columns in idTable (e.g., arithmetic, text transforms, dates).
-
Return a DataFrame with those new columns only.
-
With Append columns to input:
- ON → original columns + your derived columns
- OFF → only your derived columns (useful for isolated features or downstream joins)
¶ Normalize or standardize
- Create columns that represent normalized versions of existing features (scales, min–max, z-scores, buckets).
- Keep partitioning off if your normalization requires the whole dataset’s statistics; with partitioning, compute per-partition stats intentionally.
- Map raw values into categories (e.g., bins, labels, domain-specific groups).
- Ensure the resulting categories are represented as strings or categoricals in your output schema.
- After installing the text packages you need (declared in the packages list), compute text features (lengths, flags, extracted attributes), then emit them as your declared columns.
- From date columns in
idTable, derive day/week/month, elapsed durations, or window-based flags.
- If you need calendar-aware logic, install a date library and use it before you compute the outputs.
- Emit boolean/indicator columns for missing, out-of-range, or contradictory values to drive downstream quality rules.
-
If you aggregate (grouping, rolling windows), choose one:
- Partition-safe approach: compute only within the current partition (repeatable per chunk).
- Whole-table stats: leave partitioning none so the calculation sees the entire input.
-
Align aggregated results back to each row if you intend to append them.
¶ Partitioning: when to use it (and when not)
- Use
none (default) for most feature engineering so your code sees the entire table and produces global-consistent outputs.
- Use partitioning when the input is very large and your logic is strictly row-local or partition-local (e.g., pure per-row transforms).
Your result must still have the same number of rows as the partition being processed.
¶ Dependencies (packages) and reproducibility
- You can list extra modules to install (the box performs a pip install prior to running your code).
- Prefer explicit versions (e.g.,
== or bounded ranges) to keep runs reproducible.
- If a package fails to install, the run stops and the error message in the Log tab indicates why (version conflict, network, etc.).
-
Preview output (Pin 0) and confirm:
- Row count equals input.
- Column names are exactly as you declared (spelling and casing matter).
- Data types appear as intended (numbers, booleans, strings, datetimes).
-
Toggle append to confirm both return modes work (append vs. return-only-new-columns).
-
Re-run with partitioning OFF vs ON (if you plan to use partitioning) and verify the result is as designed.
- Prefer vectorized operations over per-row Python loops.
- Select only the columns you need from
idTable for heavy logic to reduce memory pressure.
- Be conscious of dtypes (especially large-text columns).
- Keep side effects (filesystem writes, network calls) minimal or controlled; the box is designed for pure table transformations.
- “Column mismatch” → The DataFrame you produced doesn’t have exactly the declared new column names. Adjust names/types so the schema matches.
- “Row count mismatch” → Ensure the transformation returns one row per input row (per partition).
- “Package not found / install error” → Pin or change the package version in your packages list; check the Log tab for exact error text.
- “Unexpected types in output” → Cast explicitly to the types you want before returning.
Design the columns you want to produce, make sure your output has exactly those columns and the same number of rows as the input, and choose whether to append them to the source table or output them standalone. The rest—packages, partitioning, and performance—are levers you use to match your data size and reproducibility needs.
